# Preprocessing NHANES Data

## Summary

In this notebook we take the first step towards wrangling data collected from the NHANES repository.

### Steps taken in this notebook

#### Dropping participants under the age of 20
We are interested in hypertension which typically becomes a concern among older patients.


#### Treating refused / don't know as missing
The NHANES survey taker records responses of the SP 'refused(to answer)' and 'don't know'. Such answers are coded as numbers which are documented on the NHANES website. There are not enough of these values overall to treat them as a separate category, so we will treat them as we treat the other missing values in the data.


#### Filling in cells skipped by design in the survey (dependent column)
The NHANES survey methods indicate occasionally skipping questions based on previous answers. Such questions should have corresponding missing values filled in.

* For example, if the answer to the question 'Have you smoked 100 cigarettes in your lifetime?' is no, then the following question 'Are you currently smoking?' is skipped. In this case we fill in missing values as 'no'.


#### Combining similar columns
Over the years questions in the NHANES survey have changed. When questions wordings are changed slightly the variables are renamed. Such variables will be combined into single columns. Concretely, the combined columns are described below.

  * For example, the alcohol survey question 'Was there ever a period of your life when you drank 5 alcoholic drinks per day?' was changed to 'Was there ever a period of your life when you drank 4/5 alcoholic drinks per day?' (4 for Women, 5 for Men). These columns will be merged into a single column.

#### Recoding
Some columns are coded in such a way that is unusual or counterproductive for analysis. We recode these columns.

  * For example, in yes / no questions yes = 1 and no = 2, we will recode no = 0 as is typical in data analysis. Another example, SPs diagnosed with diabetes at ages at or younger than 1 are coded as 666, we will recode these as 1.

#### Rename columns
The NHANES repository names are replaced with more descriptive names.  

#### Average Systolic and Diastolic blood pressure measurements
Each SP's Systolic and Diastolic blood pressure are measured up to 3 times. We average all measurements respectively.


In [1]:
import pandas as pd
import numpy as np
import json 

In [2]:
df = pd.read_pickle("raw_data.pkl")

## Drop SPs age 20 and below

In [3]:
df = df[df['RIDAGEYR'] > 20]

## Recode refused/don't know to missing

In [4]:
encode_missing = json.load(open(
    'config/encode_missing.json', 'r'))

for col, codes in encode_missing.items():
    df.loc[df[col].isin(codes)] = np.nan   

## Combine similar columns

DMDHREDU (Education questionaire) will be recoded as

Education:

  * 1 -- Less than Highschool
  * 2 -- GED / Highschool graduate
  * 3 -- College graduate or higher

OHQ011 (Dental health) requires a recoding to align with OHQ845. Both are ordinal variables for similar questions. 

In [5]:
# Before combination, some values must be recoded

# Recode DMDHREDU:
df.loc[(df['DMDHREDU'] == 2),'DMDHREDU'] = 1
df.loc[df['DMDHREDU'].isin([3,4]),'DMDHREDU'] = 2
df.loc[(df['DMDHREDU'] == 5),'DMDHREDU'] = 3

# Recode OHQ011:
df['OHQ011'] = df['OHQ011'] - 10


### Combine:
#### ALQ150 and ALQ151
#### DMDHREDU and DMDHREDZ
#### SLD012H and SLQ012
#### OHQ011 and OHQ845

In [6]:
df.loc[df['ALQ150'].isna(), 'ALQ150'] = df['ALQ151']
df = df.drop(['ALQ151'],axis = 1)

df.loc[df['DMDHREDU'].isna(), 'DMDHREDU'] = df['DMDHREDZ']
df = df.drop(['DMDHREDZ'],axis = 1)

df.loc[df['SLD010H'].isna(), 'SLD010H'] = df['SLD012']
df = df.drop(['SLD012'],axis = 1)

df.loc[df['OHQ011'].isna(), 'OHQ011'] = df['OHQ845']
df = df.drop(['OHQ845'],axis = 1)

### Fix dependent columns

The column 'DBD895' gives the number of meals SP had out, the column 'DBD900' are the number of fast food meals out. If the 'DBD895' is zero, we set the 'DBD900' value to zero.

The column 'OCD150' records whether the SP worked last week, the column 'OCQ180' records the number of hours, if SP did not work last week, the number of hours is zero.

The column 'SMQ020' records whether the SP has smoked 100 cigarettes in their life, the column 'SMQ040' records whether the SP is currently a smoker. If the SP has not smoked 100 cigarettes, we categorize them as non smokers.

In [7]:

# For the number of fast food meals we will fill in 0 if no meals were eaten out
df.loc[(df['DBD895'] == 0 ),'DBD900'] = 0    

# If SP was not working last week, fill in zero hours worked
df.loc[df['OCD150'].isin([2,3,4]),'OCQ180'] = 0    
df = df.drop(['OCD150'],axis = 1)

# If SP has not smoked 100 cigarettes, then not currently smoking
df.loc[(df['SMQ020'] > 1),'SMQ040'] = 3   

Column CDQ001 records whether the SP has ever had pain or discomfort in the chest. Column CDQ010 records whether the SP has shortness of breath on stairs/inclines. For both questions SPs under 20 were not questioned. We will code these values as 'no'.

In [8]:
df.loc[(df['RIDAGEYR'] < 40),'CDQ001'] = 2
df.loc[(df['RIDAGEYR'] < 40),'CDQ010'] = 2

### Recoding

#### Diabetes

DID049 -- Age at Diagnosis.
df.DID040 == 666 encodes those SPs diagnosed with diabetes at or below age 1, we replace the coding with 1


DID010 -- Diagnosis.
For those not told they have diabetes, code 0. For those told they have diabetes or borderline diabetes, code 1. 

In [9]:
# Recode diagnosis
df.loc[(df['DIQ010'] == 2),'DIQ010'] = 0  
df.loc[(df['DIQ010'] == 3),'DIQ010'] = 1   

# Recode age at diagnosis
df.loc[(df['DID040'] == 666),'DID040'] = 1  

#### Income


Recode Household income:

Simplify the household income with the following recoding.

  * 1 -- 0 to under 20K
  * 2 -- 20K to under 45K
  * 3 -- 45K to under 75K
  * 4 -- 75K and above

In [10]:
# Under 20K
df.loc[  df['INDHHIN2'].isin([1,2,3,4,13]), 'INDHHIN2'] = 1
# 20K to 45K
df.loc[  df['INDHHIN2'].isin([5,6,7]), 'INDHHIN2'] = 2
# 45K to 75K
df.loc[  df['INDHHIN2'].isin([8,9,10]), 'INDHHIN2'] = 3
# Over 75K
df.loc[  df['INDHHIN2'].isin([14,15]), 'INDHHIN2'] = 4

#### Diet

DBQ197 -- Amount of milk consumed. Ambiguous 'varied' response will be recoded as missing.

DBD895 / DBD900 -- Number of meals out of the home / fast food meals in the last week. For both columns 5555 encodes more than 21 meals, recode as 21.


DBD905 / DBD910 -- Number of meals ready to eat / frozen in the last month. 6666 encodes more than 180 meals, recode as 180.



In [11]:

df.loc[(df.DBQ197 > 3), 'DBQ197'] = np.nan

df.loc[(df.DBD895 == 5555), 'DBD895'] = 22

df.loc[(df.DBD900 == 5555), 'DBD900'] = 22 

df.loc[(df.DBD905 == 6666), 'DBD905'] = 180 

df.loc[(df.DBD910 == 6666), 'DBD910'] = 180 

#### Hours worked

OCQ180 encodes the number of hours worked over the last week. Some survey cycles limit the largest value to 80, so we will enforce this limit over all cycles.

In [12]:
df.loc[(df.OCQ180 >= 80),'OCQ180'] = 80

#### Are you/ have you ever been pregnant.

Males will be coded as no. Females coded as not pregnant now.

In [13]:
df.loc[df['RIAGENDR'] == 1, 'RHQ131'] = 0 
df.loc[df['RIAGENDR'] == 1, 'RHD143'] = 0 

df.loc[(df['RIAGENDR'] == 2)&(df['RIDAGEYR'] <= 60), 'RHD143'] = 0 

#### Weight

Self reported weight is in pounds, but measured weight is in kg. We convert weight to kg.

In [14]:
df.loc[:,'WHD140'] = df['WHD140'].div(2.2046)

#### Yes / No recoding

In [15]:
yn_recode = json.load(open(
    'config/yn_recode.json', 'r'))

for col in yn_recode:
    df.loc[df[col]==2,col]=0    

### Rename Columns

In [16]:
col_rename = json.load(open(
    'config/col_rename.json', 'r'))

df.rename(columns = col_rename,inplace = True)    

### Average Systolic and Diastolic BP measurements

In [17]:
df['Systolic'] = df.loc[:,['BPXSY1','BPXSY2','BPXSY3']].mean(axis = 1)
df['Diastolic'] = df.loc[:,['BPXDI1','BPXDI2','BPXDI3']].mean(axis = 1)

 
df = df.drop(['BPXSY1','BPXSY2','BPXSY3','BPXDI1','BPXDI2','BPXDI3'],axis = 1)
df = df.dropna(how = 'any', subset = ['Systolic','Diastolic'])

In [18]:
df.to_pickle("preprocessed_data.pkl")

In [19]:
df.columns

Index(['Cycle', 'Age', 'Gender', 'Education', 'Ethnicity', 'InterviewWeight',
       'Alcohol', 'CholHist', 'HyperHist', 'ChestPain', 'Shortness', 'DiabAge',
       'DiabHist', 'Milk', 'FeltDown', 'Suicidality', 'FeltBad', 'WeakKidneys',
       'UrineLeak', 'HoursWorked', 'Dental', 'Pesticides', 'PregnantEver',
       'PregnantNow', 'HoursSlept', 'Smoke100', 'SmokeNow', 'MaxWeight',
       'LegLen', 'ArmCirc', 'Weight', 'BMI', 'Waist', 'ArmLen', 'Pulse',
       'LBXGLU', 'SessionTime', 'FoodFastHours', 'HDL', 'LDL', 'Tryglicerides',
       'TChol', 'HHIncome', 'MealsOut', 'ReadytoEat', 'Frozen', 'FastFood',
       'KidneyStones', 'ModWork', 'VigRec', 'VigWork', 'ModRec', 'WalkBike',
       'Systolic', 'Diastolic'],
      dtype='object')