## Survey Data Cleaning
This notebook performs data cleaning and integration for a shared data published by The Associated Press. The data is from the COVID Impact Survey, which provides statistics about physical health, mental health, economic security and social dynamics related to the covid in the USA.

## Step 1: importing all the libraries

In [36]:
import pandas as pd
import numpy as np

## Step 2: Importing all the datasets that we need to process
The datasets are three csv files containing survey results from different months. Importing them as pandas dataframes in order to perform further operations.

In [106]:
df1 = pd.read_csv('01_April_30_covid_impact_survey.csv')
df2 = pd.read_csv('02_May_12_covid_impact_survey.csv')
df3 = pd.read_csv('03_June_9_covid_impact_survey.csv')

## Step 3: Do necessary data cleaning and integration

Since the data is quite dirty, we need to clean the data. For example, almost all the column values from the survey are composed of an integer value followed by a string description. We as the programmer, know where the data came from and the meanings of the integer value. Thus, we could just filter out the string descriptions to make the data clean.

In these datasets, we care about the results about economy. Thus, we need to extract the columns about the economy. In addition, there are some columns that we are interested in as well such as the home income and their response to the covid.

### 01_April_30_covid_impact_survey
Most of the columns are easy to clean. What we need to do is to remove all the non-numerical characters. For future work, we convert the values to numeric. But some descriptions contains COVID-19, if we use the same method to remove the characters, we might get some unwanted results. We therefore cut the string and then remove non-numerical characters.

In [110]:
# people's response to the coronavirus
df1_PHYS = df1.iloc[:, 38:60].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
# economic related questions
df1_ECON8 = df1.iloc[:, 65:84].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df1_ECON7 = df1.iloc[:, 84:95].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df1_ECON1 = df1.iloc[:, 95].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df1_ECON4 = df1.iloc[:, [97, 99, 100]].replace('\D+', '', regex=True).apply(lambda x: x.str[:1]).apply(lambda x: pd.to_numeric(x))
df1_ECON6 = df1.iloc[:, 101:113].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df1_ECON5 = df1.iloc[:, [113, 114]].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df1_ECON = pd.concat([df1_ECON8, df1_ECON7, df1_ECON1, df1_ECON4, df1_ECON6, df1_ECON5], axis=1)
# home income
df1_income = df1.iloc[:, 153]

df1_clean = pd.concat([df1_PHYS, df1_ECON, df1_income], axis=1)
    

### 02_May_12_covid_impact_survey.csv
At beginning, we use a function to perform the redundent operation. But we found that the other two datasets have different columns from the first one. Thus, we have to perform this redundent operation to make sure that we get all the columns that we want.

In [112]:
# people's response to the coronavirus
df2_PHYS = df2.iloc[:, 40:62].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
# economic related questions
df2_ECON8 = df2.iloc[:, 67:86].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df2_ECON7 = df2.iloc[:, 86:97].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df2_ECON1 = df2.iloc[:, 97].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df2_ECON4 = df2.iloc[:, [99, 101, 102]].replace('\D+', '', regex=True).apply(lambda x: x.str[:1]).apply(lambda x: pd.to_numeric(x))
df2_ECON6 = df2.iloc[:, 103:115].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df2_ECON5 = df2.iloc[:, [115, 116]].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df2_ECON = pd.concat([df2_ECON8, df2_ECON7, df2_ECON1, df2_ECON4, df2_ECON6, df2_ECON5], axis=1)
# home income
df2_income = df2.iloc[:, 155]

df2_clean = pd.concat([df2_PHYS, df2_ECON, df2_income], axis=1)

### 03_June_9_covid_impact_survey

In [114]:
# people's response to the coronavirus
df3_PHYS = df3.iloc[:, 40:62].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
# economic related questions
df3_ECON8 = df3.iloc[:, 67:86].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df3_ECON7 = df3.iloc[:, 86:97].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df3_ECON1 = df3.iloc[:, 97].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df3_ECON4 = df3.iloc[:, [99, 101, 102]].replace('\D+', '', regex=True).apply(lambda x: x.str[:1]).apply(lambda x: pd.to_numeric(x))
df3_ECON6 = df3.iloc[:, 103:115].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df3_ECON5 = df3.iloc[:, [115, 116]].replace('\D+', '', regex=True).apply(lambda x: pd.to_numeric(x))
df3_ECON = pd.concat([df3_ECON8, df3_ECON7, df3_ECON1, df3_ECON4, df3_ECON6, df3_ECON5], axis=1)
# home income
df3_income = df3.iloc[:, 155]

df3_clean = pd.concat([df3_PHYS, df3_ECON, df3_income], axis=1)