# Data Cleaning

## Team : 14 (Vidisha, Yijin, Yvette)


#### Newly defined questions:

**1. Can socio-economic features such as race, gender, family income, sexual orientation, marital status, and region predict 1) anxiety disorder (ANXEV_A), 2) depression (DEPEV_A), 3) difficulty participating in social activities?**
- Classification (KNN, Naive Bayes) or Logistic Regression


**2. Can features related to physical health conditions such as diabetes, cardiovascular disease, chronic disease, sleep quality, as well as habits such as alcohol consumption, cigarette consumption, or physical activities, predict 1) frequency of feeling worried, nervous, or anxious (ANXFREQ_A), 2) frequency of feeling depressed (DEPFREQ_A)?**
- Classification (KNN, Naive Bayes) or Logistic Regression

**3. Can socio-economic features predict whether an individual will seek mental health treatment?**
- Classification (KNN, Naive Bayes) or Logistic Regression

In [None]:
import pandas as pd

In [18]:
survey = pd.read_csv('adult20.csv') # Reading Data for 2020

survey

Unnamed: 0,URBRRL,RATCAT_A,INCGRP_A,INCTCFLG_A,FAMINCTC_A,IMPINCFLG_A,RJWKCLSOFT_A,RJWCLSNOSD_A,RJWRKCLSSD_A,RECJOBSD_A,...,PHSTAT_A,PROXYREL_A,PROXY_A,AVAIL_A,HHSTAT_A,INTV_MON,RECTYPE,WTFA_A,HHX,POVRATTC_A
0,3,14,5,0,100000,0,,,,,...,2,,,1,1,11,10,4526.109,H066706,6.47
1,3,11,4,0,75000,0,,,,,...,2,,,1,1,8,10,12809.039,H034928,3.64
2,3,14,4,0,90000,0,,,,,...,3,,,1,1,8,10,10322.534,H018289,6.76
3,3,11,3,0,65000,0,,,,,...,1,,,1,1,3,10,7743.375,H006876,3.79
4,3,8,1,0,25762,2,,,,,...,3,,,1,1,6,10,4144.724,H028842,2.10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31563,4,13,4,0,79000,0,,,,,...,3,,,1,1,2,10,2857.585,H065697,4.61
31564,4,11,3,0,60000,0,,,,,...,3,,,1,1,2,10,2994.763,H061937,3.50
31565,4,8,1,0,27500,0,,,,,...,2,,,1,1,2,10,1328.907,H005331,2.24
31566,4,8,3,0,61880,0,,,,,...,2,,,1,1,2,10,3481.003,H047025,2.38


### Step 1: Dealing with missing values

Since we are using survey data, there are bound to be several missing values in various attributes due to non-response. In order to deal with such missing observations, we consider only those columns in the data set with number of missing values less than 10% of the total observations. 

We consider 10% as a reasonable threshold for a data set with 31,586 observations. For columns with number of missing vales less than 10% of the total observations, we drop the missing values as these are a small proportion of the total observations.

In [38]:
null = survey.isna().sum()/ len(survey) # Proportion of missing values in every attribute
keep = list(null[null < 0.1].index) # Attributes with less than 10% missing observations

survey_cl = survey[keep]

### Remaining Missing Values


In [74]:
percent = null[keep] #quantifying proportion of missing observations
df_miss = pd.DataFrame({'Column Name': keep,
                        'Percent Missing': round(percent * 100, 2)}) #displaying variable name and proportion of missing observations

df_miss.sort_values(by = ['Percent Missing'], ascending = False).reset_index(drop = True).head(20)

Unnamed: 0,Column Name,Percent Missing
0,WLK13M_A,9.37
1,WLK100_A,8.79
2,STEPS_A,8.79
3,USPLKIND_A,8.14
4,WRKHLTHFC_A,7.82
5,HINOTYR_A,7.23
6,INCSSRR_A,3.71
7,INCOTHR_A,3.71
8,INCRETIRE_A,3.71
9,INCWELF_A,3.71


In [80]:
df_org = survey_cl.dropna().reset_index(drop = True) #Dropping missing values in remaining variables
print("Number of Missing Values: %1.0f"%df_org.isnull().sum().sum())

Number of Missing Values: 0


In [82]:
df_org.head()

Unnamed: 0,URBRRL,RATCAT_A,INCGRP_A,INCTCFLG_A,FAMINCTC_A,IMPINCFLG_A,PPSU,PSTRAT,HISPALLP_A,RACEALLP_A,...,CHLEV_A,HYPEV_A,PHSTAT_A,AVAIL_A,HHSTAT_A,INTV_MON,RECTYPE,WTFA_A,HHX,POVRATTC_A
0,3,14,5,0,100000,0,2,103,2,1,...,2,2,2,1,1,11,10,4526.109,H066706,6.47
1,3,11,3,0,65000,0,2,103,1,1,...,2,2,1,1,1,3,10,7743.375,H006876,3.79
2,3,9,2,0,36000,0,2,103,2,1,...,1,1,4,1,1,6,10,3164.668,H004811,2.93
3,3,3,1,0,30105,2,2,103,2,1,...,2,2,2,8,1,6,10,8423.396,H068043,0.98
4,2,2,1,0,9000,1,2,115,3,2,...,2,2,4,1,1,12,10,3231.773,H028696,0.73


### Step 2: Removing irrelevant columns

Our original dataset included over 200 variables. In this step, we manually selected variables using the dataset codebook (https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHIS/2020/adult-codebook.pdf) and created a csv file (relevant_variables.csv) with all the variables relevant for our study topic. We decided to take this approach given the large number of variables that were similar. For example, there were several variables related to food insecurity (e.g., Couldn't afford to eat balanced meals; food didn't last; worry food would run out;receive food stamps, past 12m). For each group of similar variables, we chose the one captured the most information.

We also chose to manually select our remaining variables since there were many features irrelevant to our research questions. We chose to manually eliminate these variables as well.

In [117]:
rv = pd.read_csv('relevant_variables.csv') 
df = df_org[list(rv["Variable Names"])]

print(df.describe())
df.head(10)

             AGEP_A       ANXEV_A     ANXFREQ_A      ANXMED_A      ARTHEV_A  \
count  21890.000000  21890.000000  21890.000000  21890.000000  21890.000000   
mean      54.106898      1.859479      3.616583      1.879488      1.755505   
std       17.604185      0.408785      1.368223      0.395050      0.485406   
min       18.000000      1.000000      1.000000      1.000000      1.000000   
25%       40.000000      2.000000      3.000000      2.000000      1.000000   
50%       56.000000      2.000000      4.000000      2.000000      2.000000   
75%       68.000000      2.000000      5.000000      2.000000      2.000000   
max       99.000000      9.000000      9.000000      9.000000      9.000000   

           BMICAT_A       CHDEV_A       CHLEV_A   COGMEMDFF_A      COPDEV_A  \
count  21890.000000  21890.000000  21890.000000  21890.000000  21890.000000   
mean       3.080219      1.958657      1.691000      1.181270      1.957972   
std        1.172404      0.357119      0.571848    

Unnamed: 0,AGEP_A,ANXEV_A,ANXFREQ_A,ANXMED_A,ARTHEV_A,BMICAT_A,CHDEV_A,CHLEV_A,COGMEMDFF_A,COPDEV_A,...,SEX_A,SLPFLL_A,SLPHOURS_A,SLPREST_A,SLPSTY_A,SMKCIGST_A,SOCSCLPAR_A,STRFREQW_A,URBRRL,VIGFREQW_A
0,85,2,5,2,2,2,2,2,2,2,...,1,2,9,4,1,4,1,94,3,94
1,32,2,3,2,2,2,2,2,1,2,...,1,1,8,3,2,4,1,3,3,5
2,70,2,4,2,1,3,2,1,1,2,...,2,2,7,2,2,4,1,2,3,94
3,32,2,1,2,2,4,2,2,1,2,...,2,1,8,1,1,4,1,94,3,94
4,77,2,5,2,2,9,2,2,1,2,...,2,2,8,2,2,4,1,94,2,94
5,71,2,5,2,2,3,2,2,1,2,...,1,1,8,4,1,4,1,94,2,94
6,60,2,1,2,1,3,2,2,1,2,...,2,4,7,2,2,4,1,94,2,94
7,62,2,5,2,1,4,2,2,1,2,...,1,1,6,4,1,4,1,94,2,94
8,58,2,4,2,2,3,2,2,1,2,...,2,2,6,3,2,4,1,94,4,2
9,56,1,3,1,1,4,2,2,1,2,...,1,1,6,3,1,3,1,94,4,0


### Step 3: Transforming Categorical Variables

Below, we convert the categorical features into dummy variables.

In [178]:
#check which variables are categorical
rv_cat = rv[rv['Type'] == 'Categorical']

df_cat = df[rv_cat['Variable Names'].tolist()]
print("Data Type of the categorical variables (Before cleaning)\n")
print('=' * 22)
print(df_cat.dtypes)
print('=' * 22)

# Note that INCWELF_A atypically has floating values
# Reset the data type as integer
df_cate = df_cat.astype('int64')
df_cate

Data Type of the categorical variables (Before cleaning)

ANXEV_A          int64
ANXFREQ_A        int64
ANXMED_A         int64
ARTHEV_A         int64
BMICAT_A         int64
CHDEV_A          int64
CHLEV_A          int64
COGMEMDFF_A      int64
COPDEV_A         int64
DEMENEV_A        int64
DEPEV_A          int64
DEPFREQ_A        int64
DEPMED_A         int64
DIBEV_A          int64
DRKSTAT_A        int64
EDUC_A           int64
FGEFRQTRD_A      int64
FUNWLK_A         int64
HYPEV_A          int64
INCGRP_A         int64
INCWELF_A      float64
MARSTAT_A        int64
MHTHRPY_A        int64
NOTCOV_A         int64
ORIENT_A         int64
PEOPLEWLK_A      int64
RACEALLP_A       int64
REGION           int64
SCHCURENR_A      int64
SEX_A            int64
SLPFLL_A         int64
SLPHOURS_A       int64
SLPREST_A        int64
SLPSTY_A         int64
SMKCIGST_A       int64
SOCSCLPAR_A      int64
URBRRL           int64
dtype: object


Unnamed: 0,ANXEV_A,ANXFREQ_A,ANXMED_A,ARTHEV_A,BMICAT_A,CHDEV_A,CHLEV_A,COGMEMDFF_A,COPDEV_A,DEMENEV_A,...,REGION,SCHCURENR_A,SEX_A,SLPFLL_A,SLPHOURS_A,SLPREST_A,SLPSTY_A,SMKCIGST_A,SOCSCLPAR_A,URBRRL
0,2,5,2,2,2,2,2,2,2,2,...,3,2,1,2,9,4,1,4,1,3
1,2,3,2,2,2,2,2,1,2,2,...,3,2,1,1,8,3,2,4,1,3
2,2,4,2,1,3,2,1,1,2,2,...,3,2,2,2,7,2,2,4,1,3
3,2,1,2,2,4,2,2,1,2,2,...,3,2,2,1,8,1,1,4,1,3
4,2,5,2,2,9,2,2,1,2,2,...,3,2,2,2,8,2,2,4,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21885,2,4,2,1,4,2,2,1,2,2,...,4,2,1,2,8,3,1,3,1,4
21886,2,4,2,2,2,2,1,1,2,2,...,4,2,2,1,7,3,2,3,1,4
21887,2,4,2,2,4,2,1,1,2,2,...,4,2,2,2,8,3,2,4,1,4
21888,2,4,2,1,4,2,2,2,2,2,...,4,2,2,1,6,3,2,3,1,4


In [181]:
# Non-reporting values in the categorical variables:
# 7: Refused
# 8: Not Ascertained
# 9: Don't know
# Not informative if included in the model
# Therefore, drop them or integrate these 3 classes into 1 with a value of 0

# First, make sure there is no '0' value that has statistical meaning
zero = df_cate.describe().loc['min'] == 0
print(zero[zero == True].index[0], 'has a minimum value of 0.')

# In the codebook, we know that:
# EDUC_A = 0: 
#   Variable Description: Educational level of sample adult
#   Category Description: Never attended/kindergarten only
#            Frequency  : 59
#            Percent    : 0.2

df_category = df_cate.replace([7, 8, 9], 0)

EDUC_A has a minimum value of 0.


###### LADIES, HELP ME FIGURE THIS OUT!!!
###### But my concern here is: when .get_dummies(drop_first = True) drops the first class or base class automatically, what is the 'first class'? Is the class with the lowest frequency? Or, if the first class means the class of the lowest numeric number (class = 0), then the class of 'non-reporting' will be dropped, and the dummy variables left will have high collinearity.

In [182]:
# Transform into dummy variables
df_dummy = pd.get_dummies(df_category.astype('category'), drop_first = True)

df_dummy

Unnamed: 0,ANXEV_A_2,ANXFREQ_A_2,ANXFREQ_A_3,ANXFREQ_A_4,ANXFREQ_A_5,ANXMED_A_2,ARTHEV_A_2,BMICAT_A_2,BMICAT_A_3,BMICAT_A_4,...,SMKCIGST_A_2,SMKCIGST_A_3,SMKCIGST_A_4,SMKCIGST_A_5,SOCSCLPAR_A_2,SOCSCLPAR_A_3,SOCSCLPAR_A_4,URBRRL_2,URBRRL_3,URBRRL_4
0,1,0,0,0,1,1,1,1,0,0,...,0,0,1,0,0,0,0,0,1,0
1,1,0,1,0,0,1,1,1,0,0,...,0,0,1,0,0,0,0,0,1,0
2,1,0,0,1,0,1,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
3,1,0,0,0,0,1,1,0,0,1,...,0,0,1,0,0,0,0,0,1,0
4,1,0,0,0,1,1,1,0,0,0,...,0,0,1,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21885,1,0,0,1,0,1,0,0,0,1,...,0,1,0,0,0,0,0,0,0,1
21886,1,0,0,1,0,1,1,1,0,0,...,0,1,0,0,0,0,0,0,0,1
21887,1,0,0,1,0,1,1,0,0,1,...,0,0,1,0,0,0,0,0,0,1
21888,1,0,0,1,0,1,0,0,0,1,...,0,1,0,0,0,0,0,0,0,1


### Step 5: Visualization

**Yijin: Correlation Matrix/ Heatmap**

**Yvette: Barplot/Boxplot on: gender/sexual orientation/races against count(depression/anxiety == True)**

### Step 6: Preliminary Test

**Vidisha: train one model**