# Wrangling NHANES Data

## Summary

### Quantify Missing Data

In this notebook we take a look at the quality of the data, namely the quantity of missing data. We drop rows and columns missing excessive numbers of values. We also consider special columns we use toward feature engineering.

### Split train, validation, test data

After the dataset is ridden of rows and columns missing excessive values, we split the remaining dataset into train, validation, and test data. 

### Impute Missing Data

We will see that some data is missing, desicions about data imputation are made using the training data set and applied to the validation and test sets.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from utils import GroupImputer

In [None]:
df = pd.read_pickle("preprocessed_data.pkl")

In [None]:
n,m = df.shape

print(f'The dataframe consists of {n} rows and {m} columns.')

## View missing data by column

In [None]:
def col_frac_missing(df, threshold = 0.05):
    fraction_null = df.isnull().sum()/len(df)
    plt.figure(figsize=(16,8))
    plt.xticks(np.arange(len(fraction_null)),fraction_null.index,rotation='vertical')
    plt.ylabel('fraction of rows with missing data')
    plt.bar(np.arange(len(fraction_null)),fraction_null)
    plt.axhline(2*threshold,linewidth=2, color='r')
    plt.axhline(threshold,linewidth=2, color='g')
    plt.title('Proportion of missing values by column.')
    plt.show()
    return fraction_null

In [None]:
column_fraction_null = col_frac_missing(df,threshold = 0.05)

We can see quite a few columns have missing data over the 5% and 10% threshold. There is a stand out column DiabAge -- the age a person is diagnosed with diabetes -- but of course if someone is never diagnosed this value is missing by design. Therefore we must come back to this column later. Next we will attempt to drop rows missing the most values.

## View missing data by row

In [None]:
frac_index_null = df.isnull().sum(1).sort_values(ascending = False)

frac_index_null.reset_index()[0].plot()
plt.title('Frequency of missing values by row.')
plt.show()

There seem to be roughly 5000 SPs missing well over 5 values. We will investigate whether dropping such rows improves the missing data by columns.

In [None]:
# Drop SPs missing 5 or more values
df_depleted = df[df.isnull().sum(1)<5]

In [None]:
column_fraction_null = col_frac_missing(df_depleted)

We can see LBXGLU, LDL, Triglicerides miss very large number of values. These columns are not part of our predictive analysis, but will be used in exploratory analysis, so we will leave it in for now. FastFood and and PregnantNow are still missing a large numbe of values, so we will drop them from the original dataframe.

## Drop columns with many missing values

In [None]:
# Drop columns from original dataframe missing many values
df.drop(['PregnantNow','FastFood'],axis=1,inplace=True)
# Drop SPs missing 5 or more values
df_depleted = df[df.isnull().sum(1)<5].copy()
column_fraction_null = col_frac_missing(df_depleted)

The columns Alcohol and CholHist still seem to be missing a high number of values. Let us view the portion missing for these columns.

In [None]:
print('Portion missing:')
column_fraction_null.loc[['Alcohol','CholHist']]

Alcohol is missing a high number of values at 8.2%, unfortunately there is not a similar feature so it would be a better not to remove it. CholHist (Whether a doctor has told you you have high cholesterol) likely correlates well with HyperHist (Whether a doctor has told you you have hypertension). Let us view the correlation.

In [None]:
corr = (df_depleted[['HyperHist','CholHist']].corr()).iloc[0,1]

print(f'The correlation coefficient for Hypertensive History with Cholestorol History is {corr:.3}.')

Although the correlation appears high, the missing values are only borderline, so we will keep the feature.

## Fixing the DiabAge variable

In [None]:
x = df['DiabHist'].value_counts()
print(f'SPs not told they have diabetes: {x[0.0]}')
print(f'SPs told they have diabetes: {x[1.0]}')

The majority of SPs have not been diagnosed with diabetes, which explains the large number of missing values in DiabAge. We will construct a new feature DiabHistAge which combines these two variables.

In [None]:
df_depleted['DiabHistAge'] = 0
df_depleted.loc[(df_depleted['DiabAge'] > 50),'DiabHistAge'] = 1
df_depleted.loc[(df_depleted['DiabAge'] > 50),'DiabHistAge'] = 2
df_depleted.drop(['DiabHist','DiabAge'],axis=1,inplace=True)

Again let us view missing values by column

In [None]:
column_fraction_null = col_frac_missing(df_depleted)

In [None]:
n,m=df_depleted.shape

print(f'The dataframe consists of {n} rows and {m} columns.')

## Split data

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df_depleted, test_size = 0.2)

# Imputations

In [None]:
print('Missing proportions in demographic columns')
column_fraction_null[['Age','Gender','Ethnicity']]

No values are missing from these demographics columns, so we will attempt to impute by these groups. First we verify which columns are missing sufficiently few values per group. We convert Age to AgeGroup for imputation.

In [None]:
df_train = df_train.join((df_train['Age'].apply(lambda x: np.floor(x/20))).rename('AgeGroup')) 
df_test = df_test.join((df_test['Age'].apply(lambda x: np.floor(x/20))).rename('AgeGroup'))

demo = ['Gender','AgeGroup','Ethnicity']

# Find proportion missing per demographic
min_prop = (df_train.groupby(by = demo).count().apply(lambda x: x/max(x),axis = 1)).min()
max_prop = (df_train.groupby(by = demo).count().apply(lambda x: x/max(x),axis = 1)).max()


In [None]:
print('Columns requiring imputation missing 5% data or less in all demo groups.')
min_prop[(min_prop >= 0.95) & (min_prop < 1)]

In [None]:
print('Columns missing over 5% data in at least one demo group.')
min_prop[min_prop < 0.95]

In [None]:
# Impute by demographic for those demos missing 5% or under per group
demo_impute = min_prop[(min_prop >= 0.95) & (min_prop < 1)].index

for col in demo_impute:
    DemoImputer = GroupImputer(demo, col, metric = 'mode')
    DemoImputer.fit(df_train)
    df_train = pd.DataFrame(DemoImputer.transform(df_train),columns = df_train.columns)
    df_test = pd.DataFrame(DemoImputer.transform(df_test),columns = df_test.columns)

### Check the data types.

Most variables in the dataset are categorical. We ensure that such variables are set as integers.

In [None]:
df_depleted.columns

In [None]:
# Subset of continuous variables:
float_vars = {'WTINT2YR',
                'HoursSlept',
                'MaxWeight',
                'LegLen',
                'ArmCirc',
                'ArmLen',
                'Weight',
                'Systolic',
                'Diastolic'
            }

int_vars = (df_depleted.columns).difference(float_vars)

int_vars = {var:'int8' for var in int_vars}

df_depleted.astype(int_vars)


## Finally we save the data

Most of the columns have low numbers of missing values, aside from the laboratory data, which we keep only to study in the EDA, not for the purpose of predictive analysis.

In [None]:
df_depleted.to_pickle("wrangled_data.pkl")