# Data Wrangling for NHANES Data

## Summary

In this notebook we will clean data which was collected from the NHANES website: www.cdc.gov/nchs/nhanes/index.htm. This is the first step towards creating a predictive model for hypertension and diabetes. Aside from typical cleaning tasks carried out in data wrangling, there are special considerations due to the nature of the survey.

### Filling in cells skipped by design in the survey

The NHANES survey methods indicate occasionally skipping questions based on previous answers. For example, if the answer to the question 'Have you smoked 100 cigarettes in your lifetime?' is no, then the following question 'Are you currently smoking?' is skipped. In such columns we expect large numbers of missing values and they are easily filled in.

### Treating refused / don't know as missing

The NHANES survey taker records responses of the SP 'refused(to answer)' and 'don't know'. Such answers are coded as numbers which are documented on the NHANES website. There are not enough of these values overall to treat them as a separate category, so we will treat them as we treat the other missing values in the data.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df = pd.read_pickle("raw_data.pkl")

In [3]:
# Before we begin cleaning we drop SPs below age 20 
df = df[df.RIDAGEYR >= 20]
# Change floating points near zero to zero:
df = df.round()

In [4]:
print('The size of the dataset: {0} rows, {1} columns '.format(*df.shape))
# Let us view the data
df.head()

The size of the dataset: 34770 rows, 62 columns 


Unnamed: 0_level_0,RIDRETH1,RIDAGEYR,DMDHREDU,RIAGENDR,INDHHIN2,ALQ150,BPQ020,BPQ080,CDQ001,CDQ010,...,LBXTR,LBDLDL,LBXTC,PHAFSTHR,PHDSESN,LBXGLU,OHQ845,ALQ151,SLD012,DMDHREDZ
SEQN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
41475.0,5.0,62.0,4.0,2.0,6.0,2.0,1.0,2.0,1.0,1.0,...,,,179.0,7.0,1.0,,,,,
41477.0,3.0,71.0,3.0,1.0,5.0,2.0,1.0,1.0,2.0,2.0,...,,,191.0,2.0,1.0,,,,,
41479.0,1.0,52.0,1.0,1.0,8.0,2.0,2.0,,2.0,2.0,...,99.0,121.0,188.0,14.0,0.0,113.0,,,,
41481.0,4.0,21.0,4.0,1.0,6.0,2.0,2.0,,,,...,,,,12.0,0.0,,,,,
41482.0,1.0,64.0,4.0,1.0,15.0,1.0,1.0,2.0,2.0,1.0,...,,,158.0,1.0,1.0,,,,,


In [5]:
# Let us view the details about each column
print('Details about each column.\n')
df.info()


Details about each column.

<class 'pandas.core.frame.DataFrame'>
Float64Index: 34770 entries, 41475.0 to 102956.0
Data columns (total 62 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   RIDRETH1  34770 non-null  float64
 1   RIDAGEYR  34770 non-null  float64
 2   DMDHREDU  28300 non-null  float64
 3   RIAGENDR  34770 non-null  float64
 4   INDHHIN2  33998 non-null  float64
 5   ALQ150    9086 non-null   float64
 6   BPQ020    34770 non-null  float64
 7   BPQ080    31192 non-null  float64
 8   CDQ001    23220 non-null  float64
 9   CDQ010    23219 non-null  float64
 10  DID040    4624 non-null   float64
 11  DIQ010    34770 non-null  float64
 12  DBD910    34735 non-null  float64
 13  DBD900    26393 non-null  float64
 14  DBD905    34734 non-null  float64
 15  DBD895    34770 non-null  float64
 16  DBQ197    34770 non-null  float64
 17  KIQ026    34769 non-null  float64
 18  KIQ022    34769 non-null  float64
 19  KIQ005    30156 non-null  fl

### Combining similar columns

The column pairs (DMDHREDU,DMDHREDZ), (ALQ150,ALQ151), and (OHQ011,OHQ845) are essentially similar questions whose wording / categorization was slightly modified over the survey cycles. We first combine these columns.

  * When the variable ALQ150 was replaced with ALQ151 the wording was changed from 'Was there ever a period of your life when you drank 5 alcoholic drinks per day?' to 'Was there ever a period of your life when you drank 4/5 alcoholic drinks per day?' (4 for Women, 5 for Men).
  * When the variable OHQ011 was replaced with OHQ845 the wording was changed from 'How would you describe the condition of your teeth?' to 'Overall, how would you rate the health of your teeth and gums?'
  * When the variable SLD012H was replaced with SLQ012 the method was changed from directly asking 'How many hours do you sleep per night?' to asking for the sleep and wake times and taking the difference.
  * When the variable DMDHREDU was replaced with DMDHREDZ the categories for achieving levels less than HS degree and achieving an AA degree were dropped.

In [6]:
# Combine ALQ150 and ALQ151
df.loc[df['ALQ150'].isna(), 'ALQ150'] = df['ALQ151']
df = df.drop(['ALQ151'],axis = 1)

# Combine DMDHREDU and DMDHREDZ
# DMDHREDU must be recoded before combination:
df.loc[(df['DMDHREDU'] == 2),'DMDHREDU'] = 1
df.loc[(df['DMDHREDU'] == 3) | (df['DMDHREDU'] == 4),'DMDHREDU'] = 2
df.loc[(df['DMDHREDU'] == 5),'DMDHREDU'] = 3
# Combination:
df.loc[df['DMDHREDU'].isna(), 'DMDHREDU'] = df['DMDHREDZ']
df = df.drop(['DMDHREDZ'],axis = 1)

# Combine SLD012H and SLQ012
df.loc[df['SLD010H'].isna(), 'SLD010H'] = df['SLD012']
df = df.drop(['SLD012'],axis = 1)

# Combine OHQ011 and OHQ845
# OHQ011 must be recoded before combination
df['OHQ011'] = df['OHQ011'] - 10
# Combination:
df.loc[df['OHQ011'].isna(), 'OHQ011'] = df['OHQ845']
df = df.drop(['OHQ845'],axis = 1)


### Fixing dependent columns

Again, we note some columns are missing a significant number of values. Most of these are due to the survey methodology of skipping certain questions based on previous answers, we will fill in these values first. 

In [7]:
# Simplify diabetes age column
# replace < 1 yr w 1 and (refused / don't know) --> missing
df.loc[(df['DID040'] == 666),'DID040'] = 1  
df.loc[(df['DID040'] == 777) | (df['DID040'] == 999) ,'DID040'] = np.nan
#print('The median value SPs were notified they had diabetes was', df['DID040'].median())

#ax = df.plot.hist(y='DID040' )
#ax.set_title('Age SPs were told they had diabetes.')
#plt.show()

Note not all SPs will have a value in this column. Data will be imputed in the model building phase.

In [8]:
# First we will simplify the diabetes column
# DIQ010 (refused / don't know/ missing) --> No
df.loc[(df['DIQ010'] == 7) | (df['DIQ010'] == 9),'DIQ010'] = np.nan

# For those not told they have diabetes, code 0
df.loc[(df['DIQ010'] == 2),'DIQ010'] = 0  
# For those told they have diabetes, or borderline diabetes code 1 
df.loc[(df['DIQ010'] == 1) | (df['DIQ010'] == 3),'DIQ010'] = 1   



The column 'DBD895' gives the number of meals SP had out, the column 'DBD900' are the number of fast food meals out. If the 'DBD895' is zero, we set the 'DBD900' value to zero.

The column 'OCD150' records whether the SP worked last week, the column 'OCQ180' records the number of hours, if SP did not work last week, the number of hours is zero. 

The column 'SMQ020' records whether the SP has smoked 100 cigarettes in their life, the column 'SMQ040' records whether the SP is currently a smoker. If the SP has not smoked 100 cigarettes, we categorize them as non smokers.



In [9]:

# For the Fast food we will fill in 0 if no meals were eaten out
df.loc[(df['DBD895'] == 0 ),'DBD900'] = 0    

# If SP was not working last week, fill in zero hours worked
df.loc[df['OCD150'].isin([2,3,4]),'OCQ180'] = 0    
df = df.drop(['OCD150'],axis = 1)

# If SP has not smoked 100 cigarettes, then not currently smoking
df.loc[(df['SMQ020'] > 1),'SMQ040'] = 3    
 

# Now let us view missing values again
print('Number of missing values per column after filling in skipped questions and dropping rows missing blood pressure.')
df.isna().sum(axis=0)

Number of missing values per column after filling in skipped questions and dropping rows missing blood pressure.


RIDRETH1        0
RIDAGEYR        0
DMDHREDU     1164
RIAGENDR        0
INDHHIN2      772
ALQ150       8913
BPQ020          0
BPQ080       3578
CDQ001      11550
CDQ010      11551
DID040      30193
DIQ010         25
DBD910         35
DBD900         28
DBD905         36
DBD895          0
DBQ197          0
KIQ026          1
KIQ022          1
KIQ005       4614
DPQ090       4777
DPQ020       4759
DPQ060       4769
OCQ180         21
OHQ011       1384
PUQ100       4322
PAQ665          0
PAQ635          0
PAQ650          0
PAQ620          0
PAQ605          0
RHQ131      19488
RHD143      30440
SLD010H        95
SMQ020          1
SMQ040          1
WHD140        113
BPXDI1       4135
BPXSY2       3546
BPXSY1       4135
BPXDI2       3546
BPXDI3       3611
BPXSY3       3611
BPXPLS       2714
BMXWT        1773
BMXARMC      3023
BMXBMI       1831
BMXLEG       3441
BMXARML      3019
BMXWAIST     3385
LBDHDD       3393
LBXTR       19893
LBDLDL      20140
LBXTC        3393
PHAFSTHR     1943
PHDSESN   

Note there are still a large number of missing values for LBXGLU, this is blood sugar which will be used to define our target variable of being diabetic, we will drop the rows containing missing values only when we specialize to the predictive diabetes model.

In [10]:
# Viewing the number of missing values per row we find the majority have 2 or fewer
print('View the number of rows missing k values:')
df.isna().sum(axis=1).value_counts().sort_index()

View the number of rows missing k values:


0       18
1      665
2     3061
3     4433
4     2544
5     5307
6     5673
7     2603
8     2721
9     1599
10     952
11     665
12     630
13     384
14     455
15     284
16     274
17     130
18     222
19     157
20     117
21     112
22      93
23      84
24      57
25      68
26      39
27      28
28     145
29     607
30     160
31     330
32     110
33      39
34       3
35       1
dtype: int64

## Recoding / Cleaning

We now run through each survey cleaning and recoding the columns.

### Demographics

The demographics survey includes the following features
  * Age
  * Gender
  * Ethnicity
  * Education
  * Household Income
 

Age is a variable taking on integer values.

Gender
  
  * 1 -- Male
  * 2 -- Female
  
Ethnicity

  * 1 -- Mexican American
  * 2 -- Other Hispanic
  * 3 -- Non-Hispanic White
  * 4 -- Non-Hispanic Black
  * 5 -- Other Race, including Multi-Racial
  

In [11]:
# We will clean each survey one at a time, beginning with the Demographics survey.

# Demographics
# RIDAGEYR age  
# RIAGENDR gender OK
# RIDRETH1 ethnicity OK
 

df = df.rename(columns={'RIAGENDR':'Gender','RIDAGEYR':'Age','RIDRETH1':'Ethnicity'})

### Education and Income

Note above there are missing values in the education DMDREDU and Income INDHHIN2 columns. We will fill in the missing education values with the mode. The missing income values will be filled with the mean income of the their eduacation level.

Household income:

  * 1 -- 0 to under 20K
  * 2 -- 20K to under 45K
  * 3 -- 45K to under 75K
  * 4 -- 75K and above
  
Education:

  * 1 -- Less than Highschool
  * 2 -- GED / Highschool graduate
  * 3 -- College graduate or higher
  

In [12]:
# Education

# Fill in Don't know/ Refused/ Missing --> Mode 
df.loc[ (df['DMDHREDU'] == 7) | (df['DMDHREDU'] == 9), 'DMDHREDU'] = np.nan
df.loc[df['DMDHREDU'].isna(), 'DMDHREDU'] = df['DMDHREDU'].mode()[0]

# Household Income 
# Under 20K
df.loc[  df['INDHHIN2'].isin([1,2,3,4,12]), 'INDHHIN2'] = 1
# 20K to 45K
df.loc[  df['INDHHIN2'].isin([5,6,7]), 'INDHHIN2'] = 2
# 45K to 75K
df.loc[  df['INDHHIN2'].isin([8,9,10]), 'INDHHIN2'] = 3
# Over 75K
df.loc[  df['INDHHIN2'].isin([14,15]), 'INDHHIN2'] = 4
# Fill in Don't know/ Refused/ Over 20K -->  Missing 
df.loc[  df['INDHHIN2'].isin([13,77,99]), 'INDHHIN2'] = np.nan

# Impute most common income per education level:
#edu_inc = df.groupby(by=["DMDHREDU"])["INDHHIN2"].agg(pd.Series.mode).to_dict()
#def edu_inc_impute(a,b):
#    if np.isnan(b):
#        return edu_inc[a]
#    else:
#        return b
#df.loc[df['INDHHIN2'].isna(),'INDHHIN2'] = df.apply(lambda x: edu_inc_impute(x.DMDHREDU,x.INDHHIN2) ,axis = 1)

# Rename columns

df = df.rename(columns={'INDHHIN2':'HHIncome','DMDHREDU':'Education'})


### Alcohol

Ever have 4/5 or more drinks every day?
 
Code as 
  * 0 -- No
  * 1 -- Yes

In [13]:
#Alcohol
df.loc[ (df['ALQ150'] == 7) | (df['ALQ150'] == 9), 'ALQ150'] = np.nan
#df.loc[df['ALQ150'].isna(), 'ALQ150'] = 2
df.loc[df['ALQ150'] == 2, 'ALQ150'] = 0
df = df.rename(columns={'ALQ150':'Alcohol'})

### Hypertension / Cholesterol

Have you been told by your doctor you have hypertension / high cholesterol?

 
Code as 
  * 0 -- No
  * 1 -- Yes

In [14]:
# Blood Pressure & Cholesterol

# Told you have Hypertension -- refused / don't know -- > missing  
df.loc[ (df['BPQ020'] == 7) | (df['BPQ020'] == 9), 'BPQ020'] = np.nan
#df.loc[df['BPQ020'].isna(), 'BPQ020'] = 2
df.loc[df['BPQ020'] == 2, 'BPQ020'] = 0

# Told High Cholestorol -- refused / don't know --> missing 
df.loc[ (df['BPQ080'] == 7) | (df['BPQ080'] == 9), 'BPQ080'] = np.nan
#df.loc[df['BPQ080'].isna(), 'BPQ080'] = 2 
df.loc[df['BPQ080'] == 2, 'BPQ080'] = 0 

df = df.rename(columns={'BPQ020':'HyperHist','BPQ080':'CholHist'})

### Cardio health 

Do you ever have chest pain?

  * 0 -- No
  * 1 -- Yes
  
Shortness of breath on stairs or inclines?

  * 0 -- No
  * 1 -- Yes

In [15]:

# Chest pain -- refused / don't know -- > missing  
df.loc[ (df['CDQ001'] == 7) | (df['CDQ001'] == 9), 'CDQ001'] = np.nan 
df.loc[df['CDQ001'] == 2, 'CDQ001'] = 0

# Shortness of breath on stairs -- refused / don't know -- > missing  
df.loc[ (df['CDQ010'] == 7) | (df['CDQ010'] == 9), 'CDQ010'] = np.nan 
df.loc[df['CDQ010'] == 2, 'CDQ010'] = 0

df = df.rename(columns={'CDQ001':'ChestPain','CDQ010':'Shortness'})

### Diabetes

The diabetes column has already been partially cleaned above, here we simply fill in missing values and drop the redundant column.

  * 0 -- SP not told they have diabetes
  * 1 -- SP told they have borderline diabetes
  * 2 -- SP told they have diabetes  
  
Diabetes age column, for those told they have diabetes. Integer values.

  

In [16]:
# Diabetes
 

df = df.rename(columns={'DID040':'DiabAge','DIQ010':'DiabHist'})

### Diet Questionaire

How often does SP consume Milk.
Code as,
  * 0 -- Never
  * 1 -- Rarely < 1 per week
  * 2 -- Sometimes < 1 per day
  * 3 -- >= 1 per day
  
Meals out of the home over the last week, coded as an integer from 0 to 22.

Fast food meals over the last week, coded as an integer from 0 to 22.

Meals ready to eat over the last 30 days, coded as an integer from 0 to 180.

In [17]:
# Diet Behavior & Nutrition

# Past 30 days milk consumption.
# 'DBQ197' Milk consumption
# Default = Sometimes
df.loc[(df.DBQ197 > 3), 'DBQ197'] = np.nan
#df.loc[(df.DBQ197.isna()), 'DBQ197'] = 2  

# How many meals out of the home?
# DBD895 > 21 meals -> 22 meals
df.loc[(df.DBD895 == 5555), 'DBD895'] = 22
# replace refused / don't know --> missing 
df.loc[(df.DBD895 == 7777) | (df.DBD895 == 9999), 'DBD895'] = np.nan
#df.loc[df.DBD895.isna(), 'DBD895'] = df.DBD895.median()

# How many fast food meals?
# DBD900 
df.loc[(df.DBD895 == 0), 'DBD900'] = 0 
# DBD900 > 22 meals -> 22 meals
df.loc[(df.DBD900 == 5555), 'DBD900'] = 22  
# replace refused / don't know  -- > missing
df.loc[(df.DBD900 == 7777) | (df.DBD900 == 9999), 'DBD900'] = np.nan
#df.loc[df.DBD900.isna(), 'DBD900'] = df.DBD900.median() 

# How many meals ready to eat?
# DBD905
# >= 180 set to 180
df.loc[(df.DBD905 == 6666), 'DBD905'] = 180  
# replace refused / don't know  -- > missing
df.loc[(df.DBD905 == 7777) | (df.DBD905 == 9999), 'DBD905'] = np.nan

# How many frozen meals?
# DBD910
# >= 180 set to 180
df.loc[(df.DBD910 == 6666), 'DBD910'] = 180  
# replace refused / don't know  -- > missing
df.loc[(df.DBD910 == 7777) | (df.DBD910 == 9999), 'DBD910'] = np.nan

df = df.rename(columns={'DBQ197':'Milk','DBD895':'MealsOut',
                        'DBD900':'FastFood','DBD905':'ReadytoEat','DBD910':'Frozen'})

### Kidney Questionaire
Questions:

Have you ever been told you have weak kidneys?

Have you ever had a kidney stones?

  * 0 -- No
  * 1 -- Yes
  
How often do you have urinary leakage?

  * 1 -- Never
  * 2 -- Less than once a month
  * 3 -- A few times a month
  * 4 -- A few times a week
  * 5 -- Every day and or night


In [18]:

# Kidney questionaire
    
# Told Weak kidney (refused / don't know --->  missing
df.loc[(df.KIQ022 == 7) | (df.KIQ022 == 9),'KIQ022'] = np.nan
#df.loc[df.KIQ022.isna(),'KIQ022'] = 2 
df.loc[df['KIQ022'] == 2, 'KIQ022'] = 0 
    
# Kidney stones (refused / don't know --> missing
df.loc[(df.KIQ026 == 7) | (df.KIQ026 == 9),'KIQ026'] = np.nan
#df.loc[df.KIQ026.isna(), 'KIQ026'] = 2  
df.loc[df['KIQ026'] == 2, 'KIQ026'] = 0 
    
# Urinary leakage (refused / don't know --> missing 
df.loc[(df.KIQ005 == 7) | (df.KIQ005 == 9),'KIQ005'] = np.nan
#df.loc[df.KIQ005.isna(),'KIQ005'] = 1

df = df.rename(columns={'KIQ022':'WeakKidneys','KIQ026':'KidneyStones','KIQ005':'UrineLeak'})

### Dental Health

Rate the overall health of your teeth and gums

  * 1 -- Excellent
  * 2 -- Very good
  * 3 -- Good
  * 4 -- Fair
  * 5 -- Poor
  

In [19]:
# Dental health (refused / don't know --> missing 
df.loc[df.OHQ011 > 5,'OHQ011'] = np.nan
#df.loc[df.OHQ011.isna(),'OHQ011'] = 3

df = df.rename(columns = {'OHQ011':'Dental'})

### Mental health

Questions: Over the last 2 weeks how often have you felt
1. Feeling down, depressed, or hopeless
2. Feeling bad about yourself
3. Thought you would be better off dead

Each coded as

  * 0 -- Not at all
  * 1 -- Several days
  * 2 -- More than half the days 
  * 3 -- Nearly every day

In [20]:


# Felt down  refused / don't know --> missing
df.loc[(df.DPQ020 == 7) | (df.DPQ020 == 9),'DPQ020'] = np.nan

# Felt bad about yourself refused / don't know --> missing
df.loc[(df.DPQ060 == 7) | (df.DPQ060 == 9),'DPQ060'] = np.nan

# Suicidality refused / don't know --> missing
df.loc[(df.DPQ090 == 7) | (df.DPQ090 == 9),'DPQ090'] = np.nan

df = df.rename(columns = {'DPQ020':'FeltDown','DPQ060':'FeltBad','DPQ090':'Suicidality'})

### Occupation

Hours of work over the last week. Integer valued.
Starting in year 2015 the values were bounded above by 80 
and in 2017 bounded below by 5. We bound all values similarly.

In [21]:
# Hours worked
# Hours worked: refused / don't know --->  missing
df.loc[(df.OCQ180 == 77777) | (df.OCQ180 == 99999),'OCQ180'] = np.nan

# In some years the hours worked variable, is maxed at 80hours, so we will enforce this over all years.
df.loc[(df.OCQ180 >= 80),'OCQ180'] = 80

df = df.rename(columns = {'OCQ180':'HoursWorked'})

### Pesticides in the home?

In the past 7 days have pesticides been used in the home to control insects?

Coded as:
  * 0 -- No
  * 1 -- Yes


In [22]:

# Pesticides  refused / don't know --> missing
df.loc[(df.PUQ100 == 7) | (df.PUQ100 == 9),'PUQ100'] = np.nan

df.loc[df['PUQ100'] == 2, 'PUQ100'] = 0 

df = df.rename(columns = {'PUQ100':'Pesticides'})

### Physical Activity

Does your job involve moderate/vigorous work activity?

Do you walk or bike to work?

Do you participate in moderate/vigorous recreational activity?

Code as,
 
   * 0 -- No
   * 1 -- Yes
   

In [23]:
# Physical Activity questionaire
    
# Vig work  (refused / don't know --> missing
df.loc[(df.PAQ605 == 7) | (df.PAQ605 == 9),'PAQ605'] = np.nan
#df.loc[df.PAQ605.isna(),'PAQ605'] = 2   
df.loc[df['PAQ605'] == 2, 'PAQ605'] = 0  
    
# Moderate work  (refused / don't know --> missing
df.loc[(df.PAQ620 == 7) | (df.PAQ620 == 9),'PAQ620'] = np.nan
#df.loc[df.PAQ620.isna(),'PAQ620'] = 2  
df.loc[df['PAQ620'] == 2, 'PAQ620'] = 0  
    
# Walk / Bike
# (refused / don't know --> missing
df.loc[(df.PAQ635 == 7) | (df.PAQ635 == 9),'PAQ635'] = np.nan
#df.loc[df.PAQ635.isna(),'PAQ635'] = 2     
df.loc[df['PAQ635'] == 2, 'PAQ635'] = 0  
        
# Vig rec (refused / don't know --> missing
df.loc[(df.PAQ650 == 7) | (df.PAQ650 == 9),'PAQ650'] = np.nan
#df.loc[df.PAQ650.isna(),'PAQ650'] = 2   
df.loc[df['PAQ650'] == 2, 'PAQ650'] = 0    
    
# Moderate rec  (refused / don't know --> missing
df.loc[(df.PAQ665 == 7) | (df.PAQ665 == 9),'PAQ665'] = np.nan
#df.loc[df.PAQ665.isna(),'PAQ665'] = 2    
df.loc[df['PAQ665'] == 2, 'PAQ665'] = 0  
     

df = df.rename(columns={'PAQ635':'WalkBike','PAQ605':'VigWork',
                        'PAQ620':'ModWork','PAQ650':'VigRec','PAQ665':'ModRec'})


### Reproductive Health

Have you ever been pregnant?
 
  * 0 -- No
  * 1 -- Yes
  
Are you pregnant now?

  * 0 -- No
  * 1 -- Yes
  
(Males will be coded as No)


In [24]:
# Have you ever been pregnant?

# Moderate rec  (refused / don't know --> missing
df.loc[(df.RHQ131 == 7) | (df.RHQ131 == 9),'RHQ131'] = np.nan
df.loc[df['RHQ131'] == 2, 'RHQ131'] = 0  

# Are you pregnant now?
# Moderate rec  (refused / don't know --> missing
df.loc[(df.RHD143 == 7) | (df.RHD143 == 9),'RHD143'] = np.nan
df.loc[df['RHD143'] == 2, 'RHD143'] = 0  

# Males coded no
df.loc[df['Gender'] == 1, 'RHQ131'] = 0 
df.loc[df['Gender'] == 1, 'RHD143'] = 0 

df = df.rename(columns={'RHQ131':'PregnantEver','RHD143':'PregnantNow'})

### Sleep

How many hours of sleep per night? Continuous valued.

In [25]:
# Sleep questionaire

# refused / don't know --->  missing
df.loc[(df.SLD010H == 77) | (df.SLD010H == 99),'SLD010H'] = np.nan 

df = df.rename(columns ={'SLD010H':'HoursSlept'})

### Smoking

Have you smoked 100 cigarettes in your life?

  * 0 -- No
  * 1 -- Yes
  
Do you now smoke cigarettes?

  * 1 -- Every day
  * 2 -- Some days
  * 3 -- Not at all


In [26]:

# Smoking
    
# Smoking 100 (refused / don't know --> missing
df.loc[(df.SMQ020 == 7) | (df.SMQ020 == 9),'SMQ020'] = np.nan
#df.loc[df.SMQ020.isna(),'SMQ020'] = 2   
df.loc[df['SMQ020'] == 2, 'SMQ020'] = 0  
    
# Smoking 100 = No >> Currently Smoking = No
df.loc[(df.SMQ020 == 2),'SMQ040'] = 3 
    
# Currently Smoking (refused / don't know --> missing
df.loc[(df.SMQ040 == 7) | (df.SMQ040 == 9),'SMQ040'] = np.nan
#df.loc[df.SMQ040.isna(),'SMQ040'] = 3   
    

df = df.rename(columns={'SMQ020':'Smoke100','SMQ040':'SmokeNow'})

### Weight History

Self reported maximum weight, range of integer values.

In [27]:

# self reported max weight refused / don't know --> missing
df.loc[(df.WHD140 == 7777) | (df.WHD140 == 9999),'WHD140'] = np.nan

#This self reported value is in pounds, however, later we will import a weight in kg. We will transform this column to kg.
df.loc[:,'WHD140'] = df['WHD140'].div(2.2046)

df = df.rename(columns={'WHD140':'MaxWeight'})

### Fasting / Session Time
The fasting time originates from the Blood Glucose Laboratory Survey. It is not clear if this will correlate with the blood pressure outcomes, but it will be imported in the case that it is.

The fast time is reported as a non negative integer.

The session time is coded as
  * 1 -- Morning
  * 2 -- Afternoon
  * 3 -- Evening

In [28]:
# Food fast hours / Session time

df = df.rename(columns={'PHAFSTHR':'FoodFastHours','PHDSESN':'SessionTime'})

### Body Measurements

Measurements include:
 * BMI
 * Waist
 * Leg Length
 * Arm Length
 * Arm Circumference
 
Missing values will be replaced by the median

In [29]:
# Body measurement

# BMXBMI BMI measurement
# replace missing with median
#df.loc[(df.BMXBMI.isna()),'BMXBMI'] = df.BMXBMI.median()
#df.loc[(df.BMXWAIST.isna()),'BMXWAIST'] = df.BMXWAIST.median()
#df.loc[(df.BMXLEG.isna()),'BMXLEG'] = df.BMXLEG.median()
#df.loc[(df.BMXARML.isna()),'BMXARML'] = df.BMXARML.median()
#df.loc[(df.BMXARMC.isna()),'BMXARMC'] = df.BMXARMC.median()

df = df.rename(
    columns={'BMXBMI':'BMI','BMXWAIST':'Waist','BMXLEG':'LegLen','BMXARML':'ArmLen','BMXARMC':'ArmCirc','BMXWT':'Weight'})

### Circulatory Measurements

Measurements include:
 * Food in the last 30 min
 * Pulse
 * Systolic pressure
 * Diastolic pressure

Missing pulse values will be replaced with the median. 

There are 3 measurements for each of Systolic and Diastolic pressure, we take the average and then drop the outliers that are likely due to measurement error.

Those rows with missing Systolic and Diastolic averages will be dropped.


In [30]:


# Pulse
#df.loc[(df.BPXPLS.isna()),'BPXPLS'] = df.BPXPLS.median()

# Create Systolic / Diastolic pressure as the average of the measurements

df['Systolic'] = df.loc[:,['BPXSY1','BPXSY2','BPXSY3']].mean(axis = 1)
df['Diastolic'] = df.loc[:,['BPXDI1','BPXDI2','BPXDI3']].mean(axis = 1)

 
df = df.drop(['BPXSY1','BPXSY2','BPXSY3','BPXDI1','BPXDI2','BPXDI3'],axis = 1)
df = df.dropna(how = 'any', subset = ['Systolic','Diastolic'])



df = df.rename(columns={'BPXPLS':'Pulse'})

In [31]:


df = df.rename(columns={'LBDHDD':'HDL','LBDLDL':'LDL','LBXTR':'Tryglicerides','LBXTC':'TChol'})

In [32]:
# Let's check missing values
print('Missing values per feature:')
df.isna().sum(axis=0)

Missing values per feature:


Ethnicity            0
Age                  0
Education            0
Gender               0
HHIncome          2108
Alcohol           6698
HyperHist           41
CholHist          3467
ChestPain        10561
Shortness        10588
DiabAge          27728
DiabHist            22
Frozen              45
FastFood            30
ReadytoEat          76
MealsOut            24
Milk               121
KidneyStones        80
WeakKidneys         46
UrineLeak         2597
Suicidality       2708
FeltDown          2689
FeltBad           2712
HoursWorked         38
Dental            1327
Pesticides        2371
ModRec               5
WalkBike             1
VigRec               2
ModWork             14
VigWork              8
PregnantEver      1550
PregnantNow      12293
HoursSlept         103
Smoke100            21
SmokeNow             3
MaxWeight          522
Pulse                2
Weight             303
ArmCirc           1172
BMI                359
LegLen            1561
ArmLen            1168
Waist      

These values will have to be imputed in the EDA notebook.

In [33]:
df.to_pickle("clean_data.pkl")