# Project 1

Predicting diabetes using the NHANES dataset

About our dataset
The National Health and Nutrition Examination Survey (NHANES), administered annually by the National Center for Health Statistics, is designed to assess the general health and nutritional status of adults and children in the United States.


Data:

https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013


Goals:

- refresh general machine learning principles (train/dev/test)
- refresh neural network implementation
- handle an imbalanced dataset

Keras:

The Python Deep Learning Library Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation.


In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd

Jupyter notebook supports automated reloading of packages. So once you import a file as a module, any saved changes you make to that file will be automatically changed in this notebook. For example, go to exercise_1.py and toggle the comment the second "print" statement in helloworld(), rerunning the next cell several times

In [3]:
from exercise_1 import helloworld

Using TensorFlow backend.


In [4]:
helloworld()

Hello world!


# Viewing the dataset
We provide a helper function for you to read the SAS files into a pandas dataframe. 

To load it, you will need to install xport (a SAS interface to Python) 

```pip install xport```

In [5]:
from utils import merge_xpt, get_training_data

In [6]:
### MODIFY THIS CELL BY POINTING IT TO WHERE YOU EXTRACTED YOUR DATA ###
data_root ="/Users/furon/Desktop/Project 1/Data/nhanes"

In [7]:
df = get_training_data(data_root)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/anbangwu/Downloads/Project1/Data/nhanes\\2013-2014\\questionnaire\\DIQ_H.XPT'

In [None]:
print(df.shape)
df.describe()

The two commands above are useful for quickly getting a feel for the dataset. We definitely want to know the shape of the dataframe so we know how many features we're dealing with, and we can see the number of missing values in each column, as well as a few descriptive statistics.

# Exercise 1: 

The Diabetes column is coded as 1.0: yes and 2.0: no, and for some reason there are a few rows with a 3.0. We don't know what 3.0 means, so we will drop it. Also, we will remove all of the samples that have a NaN (for easy training later- also it seems like we'll still have enough data). 

As most of you are familiar with SQL, these references may be helpful:
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

Use it to complete clean_data_and_labels() in exercise_1.py(). 

In [None]:
from exercise_1 import clean_data_and_labels

In [None]:
df = clean_data_and_labels(df)

In [None]:
# Run this cell to check your code. 
# If you see no output, you have completed the exercise correctly.
assert np.all(df.Diabetes < 3), "Not all labels are < 3.0"
assert np.all(df.count() == len(df)), "There are still NaNs in your data"

# Exercise 2:

With the clean dataset we are ready to build and train the model.

Please build your neural network by completing build_model() in exercise_1.py. The architecture choice is up to you, but please only use fully connected (Dense) layers.

In [None]:
from exercise_1 import build_model, split_x_y

In [None]:
model = build_model()

In [None]:
# test if the model works. If it runs your model is functioning properly
x_sm, y_sm = split_x_y(df[0:10])  # load the first 10 samples
model.fit(x_sm, y_sm, batch_size=1, epochs=2, verbose=1)

# Exercise 3:
Model tuning

Optimize your neural network by modifying the cells below. Here are a few hints:

- You are not provided with a validation set, so it may be helpful to make your own. 
- Check the distribution of the labels using df.Diabetes.hist(). What should you do about this? (make any changes to preprocess_dataset() in exercise_1.py- OPTIONAL)

In [None]:
data_root = "/Users/anbangwu/Downloads/Project1/Data/nhanes"
df = get_training_data(data_root)
df = clean_data_and_labels(df)

In [None]:
# useful imports
from sklearn.model_selection import train_test_split
from exercise_1 import preprocess_dataset

In [None]:
X = df.drop("Diabetes", axis=1)
y = df["Diabetes"]

In [None]:
class_weight = {0: 10.,
                1: 1.}

In [None]:
model = build_model()

In [None]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
y = pd.get_dummies(y)

In [None]:
history = model.fit(x= X_scaled, y = y.values, epochs = 200, verbose = 0, class_weight=class_weight)

When you are happy with your model training, run the below cell to generate the final labels for submission.

In [None]:
from utils import get_testing_data
test_data, _ = get_testing_data(data_root)
test_data = pd.DataFrame(test_data)
test_data
test_data = scaler.transform(test_data)
predicted = model.predict_classes(test_data)
with open("exercise_1_output.txt", "w") as f:
    [f.write("{}\n".format(p+1)) for p in predicted]

In [None]:
sum(predicted==0)

# Bonus Exercise 4:
(OPTIONAL)

Do you find any relationship between Age and whether a person has Diabetes or not?

# Bonus Exercise 5:
(OPTIONAL)

Predict whether the person sleeps for less than mean sleeping hours across the dataset or more?

# Bonus Exercise 6:
(OPTIONAL)

Predict the alcohol consumption of the person?