# Cleaning the NHIS Sleep Data Set and Dealing with NaN Values

In [1]:
import numpy as np
import pandas as pd

In [3]:
# Import the dataset

nhis = pd.read_csv('NHIS_2007.csv')

# Let's look at the dataset info
nhis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4785 entries, 0 to 4784
Data columns (total 9 columns):
HHX       4785 non-null int64
FMX       4785 non-null int64
FPX       4785 non-null int64
SEX       4785 non-null int64
BMI       4785 non-null float64
SLEEP     4785 non-null object
educ      4785 non-null int64
height    4785 non-null int64
weight    4785 non-null int64
dtypes: float64(1), int64(7), object(1)
memory usage: 336.5+ KB


Notice that the 'SLEEP' column is an object when it should be numeric.

In [None]:
# Let's see what the data looks like

nhis.head()

The question marks in SLEEP are keeping the column from being numeric.

Let's fix the problem of SLEEP being a column of objects by replacing 
the ?'s with NaN's and convert the column to numeric

In [None]:
# Replace ?'s with NaN's
nhis.replace('?', np.nan, inplace = True)

# Convert the items in SLEEP to numeric using .apply()
nhis['SLEEP'] = pd.to_numeric(nhis['SLEEP'])

nhis.head()

In [None]:
# Make sure we fixed the problem

nhis.info()

In [None]:
# Let's look at the stats for each column

nhis.describe()

In [None]:
import matplotlib.pyplot as plt

# plotting all of the histograms
for col in nhis:
    plt.hist(nhis[col], range = [min(nhis[col]), max(nhis[col])])
    plt.title(col)
    plt.show()

Notice how there are some values in the histograms that don't seem to belong with the rest of the data. These are likely incorrect values.

In [None]:
# Let's fix the errant values in BMI

# Use the np.where() function to convert all of the values in BMI that are greater than 80 to NaN values
nhis['BMI'] = np.where(nhis['BMI'] > 80, np.nan, nhis['BMI'])

# Plot the histogram
plt.hist(nhis['BMI'], range = [min(nhis['BMI']), max(nhis['BMI'])])
plt.title('BMI')
plt.show()
nhis.describe()

In [None]:
# Let's fix the rest of the columns with errant values

nhis['SLEEP'] = np.where(nhis['SLEEP'] > 24, np.nan, nhis['SLEEP'])
nhis['educ'] = np.where(nhis['educ'] > 40, np.nan, nhis['educ'])
nhis['height'] = np.where(nhis['height'] > 80, np.nan, nhis['height'])
nhis['weight'] = np.where(nhis['weight'] > 400, np.nan, nhis['weight'])


# Let's look at the histograms again to make sure that did things correctly
for col in nhis:
    plt.hist(nhis[col], range = [min(nhis[col]), max(nhis[col])])
    plt.title(col)
    plt.show()

Another thing to notice from the data is that there is a male/female column which is represented by numbers. Let's make that column categorical with identifiers "F" and "M".

Our first problem is figuring out which values coorespond to male, and which coorespond to female. For that, Let's make a scatter plot with SEX and weight with the assumption that men weigh more on average than females.

In [None]:
plt.scatter(x = nhis['weight'], y = nhis['SEX'])
plt.ylabel('sex')
plt.xlabel('weight')
plt.show()

In [None]:
plt.scatter(x = nhis['height'], y = nhis['SEX'])
plt.show()

Based on the plot, higher values seem to coorespond to females, and the higher values seem to coorespond to males.

Using that knowlege, Let's turn all of the values in the SEX collumn that are higher than 1.5 equal to the string "F", and the values below 1.5 equal to the string "M".

In [None]:
nhis['SEX'] = np.where(nhis['SEX'] == 1, 'M', 'F')
nhis.head()

In [None]:
nhis.info()

### Dealing with the missing (NaN) values

There are a few ways in which you can deal with missing data:

##### 1\. Removing the rows with NaN values
    
This is only really valid when a relatively small postion of your dataset contains NaN's. It is never ideal to lose data points.
    
##### 2\. Removing the columns with NaN values
    
If only one or two non-essential columns contain all of the NaN velues, it can sometimes be effective to just remove those columns from the data inorder to save the others.

##### 3\. Imputation

Imputation is something that you could learn about for an entire semester (at least), but its basic definition is the replacement of values based on some estimation. There are different types of imputation: simple/single imputation, and multiple imputation. Simple imputation uses a single estimate to guess what a missing value should be. Examples of this are overall mean imputation, k-nearest-neighbors imputation, and linear regression. Multiple imputation uses multiple estimates to guess what the missing value should be. Multiple imputation methods are almost always more effective but are also far more complicated. A commonly used method for multiple imputation is MICE imputation which I encourage you to look up if you are interested.