# Neural Networks

This homework is due on or before Tuesday 30 October, 11:59pm Eastern time. Publish your code to GitHub and provide a link to it in your Canvas submission.

For this problem set, we will use the CDC Diabetes Health Indicators dataset from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators). You can load it into your development environment as a Pandas dataframe with:

```bash
pip install ucimlrepo
```

```python
import pandas as pd
from ucimlrepo import fetch_ucirepo

# fetch dataset
cdc_diabetes_health_indicators = fetch_ucirepo(id=891)

# data (as pandas dataframes)
X = cdc_diabetes_health_indicators.data.features
y = cdc_diabetes_health_indicators.data.targets

# metadata
print(cdc_diabetes_health_indicators.metadata)

# variable information
print(cdc_diabetes_health_indicators.variables)
```

This dataset was created by the Centers for Disease Control and Prevention to better understand the relationship between lifestyle and diabetes in the US. Each row represents a person participating in this study.

## Part 1: Feature Selection

Our dataset contains a participant ID column, `Diabetes_binary` (which is the column we will use as our label), and 21 additional columns that can all serve as possible inputs to our model. A complete data dictionary is available at the [UCI dataset page](https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators).

**Create a dataframe with `Diabetes_binary` and as many additional columns from our original dataset as you feel are necessary as features for a predictive model. Explain your choices.**

## Part 2: Data Cleaning

Based on the dataset that you created for Part 1, **normalize any numeric features, dummy- or one-hot encode any categorical features, and remove any outliers or spurious records. Explain your choices.**

You can use Tensorflow's [`CategoryEncoding` preprocessing layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/CategoryEncoding) for any boolean ot categorical columns. For more on using preprocessing layers, check out this [Tensorflow tutorial](https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers#apply_the_keras_preprocessing_layers).

## Part 3: Feature Engineering

Based on the dataset that you created for Part 2, **create one or more engineered feature columns and explain why you chose to create these, or explain why you don't feel any are needed.**

## Part 4: Binary classification

Based on the dataset that you created for Part 3:

  - Split your dataset into training and testing samples at an 80:20 ratio;
  - Train a feed-forward neural network to predict whether an individual either has diabetes, or is at risk of developing diabetes.
    - This should include at least 2 hidden layers
    - The output layer should be a single neuron with a sigmoid activation function
    - The model should use a [binary crossentropy loss function](https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy)
    - You can use the Adam optimizer, or another if you prefer
    - You [model's metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics) should include accuracy and F1 score

**What is your model's accuracy on both the training and testing datasets?**

  - How does the accuracy compare against the F1 score?
  - Looking at a confusion matrix of your model's predictions (i.e., the true and false positive and negative predictions), would you consider your model to be a good classifier or not? Why?

In [1]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3


In [3]:
import pandas as pd
from ucimlrepo import fetch_ucirepo

# fetch dataset
cdc_diabetes_health_indicators = fetch_ucirepo(id=891)

# data (as pandas dataframes)
X = cdc_diabetes_health_indicators.data.features
y = cdc_diabetes_health_indicators.data.targets

In [7]:
X.head()


Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,1,1,1,40,1,0,0,0,0,1,...,1,0,5,18,15,1,0,9,4,3
1,0,0,0,25,1,0,0,1,0,0,...,0,1,3,0,0,0,0,7,6,1
2,1,1,1,28,0,0,0,0,1,0,...,1,1,5,30,30,1,0,9,4,8
3,1,0,1,27,0,0,0,1,1,1,...,1,0,2,0,0,0,0,11,3,6
4,1,1,1,24,0,0,0,1,1,1,...,1,0,2,3,0,0,0,11,5,4


In [23]:
new_data=X.loc[:,["Age","BMI","PhysHlth", "DiffWalk","HighChol","PhysActivity", "MentHlth"]]
new_data.head()

Unnamed: 0,Age,BMI,PhysHlth,DiffWalk,HighChol,PhysActivity,MentHlth
0,9,40,15,1,1,0,18
1,7,25,0,0,0,1,0
2,9,28,30,1,1,0,30
3,11,27,0,0,0,1,0
4,11,24,0,0,1,1,3


In [26]:
new_data["Unhealthy_old"]= (
    (new_data['Age'] > 33) &
    (new_data['PhysHlth'] > 7)
).astype(int)
new_data['overall_health'] = (new_data['PhysHlth'] + new_data['MentHlth']) / 2


new_data.head()

Unnamed: 0,Age,BMI,PhysHlth,DiffWalk,HighChol,PhysActivity,MentHlth,Unhealthy_old,overall_health
0,9,40,15,1,1,0,18,0,16.5
1,7,25,0,0,0,1,0,0,0.0
2,9,28,30,1,1,0,30,0,30.0
3,11,27,0,0,0,1,0,0,0.0
4,11,24,0,0,1,1,3,0,1.5


In [34]:
avg_age = new_data["Age"].mean()
st_dev = new_data["Age"].std()
new_data["Age"] = (new_data["Age"] - avg_age) / st_dev

In [35]:
new_data.describe()

Unnamed: 0,Age,BMI,PhysHlth,DiffWalk,HighChol,PhysActivity,MentHlth,Unhealthy_old,overall_health
count,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0
mean,9.500792000000001e-17,28.382364,4.242081,0.168224,0.424121,0.756544,3.184772,0.0,3.713426
std,1.0,6.608694,8.717951,0.374066,0.49421,0.429169,7.412847,0.0,6.645639
min,-2.302427,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.6653479,24.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
50%,-0.01051634,27.0,0.0,0.0,0.0,1.0,0.0,0.0,0.5
75%,0.6443152,31.0,3.0,0.0,1.0,1.0,2.0,0.0,3.5
max,1.626563,98.0,30.0,1.0,1.0,1.0,30.0,0.0,30.0


In [37]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(new_data,y, test_size=0.2)

X_train.describe()

Unnamed: 0,Age,BMI,PhysHlth,DiffWalk,HighChol,PhysActivity,MentHlth,Unhealthy_old,overall_health
count,202944.0,202944.0,202944.0,202944.0,202944.0,202944.0,202944.0,202944.0,202944.0
mean,0.001137,28.382381,4.247497,0.167756,0.424871,0.75682,3.19329,0.0,3.720393
std,0.998847,6.605753,8.72739,0.37365,0.494325,0.429004,7.422481,0.0,6.655536
min,-2.302427,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.665348,24.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
50%,-0.010516,27.0,0.0,0.0,0.0,1.0,0.0,0.0,0.5
75%,0.644315,31.0,3.0,0.0,1.0,1.0,2.0,0.0,3.5
max,1.626563,98.0,30.0,1.0,1.0,1.0,30.0,0.0,30.0


In [40]:
def df_to_dataset(dataframe, labels, shuffle=True, batch_size=32):
  df = dataframe.copy()
  df = {key: value[:,tf.newaxis] for key, value in dataframe.items()}
  ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  ds = ds.prefetch(batch_size)
  return ds

In [39]:
import tensorflow as tf

In [41]:
inputs = {
    'Age':
        tf.keras.layers.Input(
            shape=(1,),
            dtype=tf.float32,
            name='Age'),
    'BMI':
        tf.keras.layers.Input(
            shape=(1,),
            dtype=tf.float32,
            name='BMI'),
    'overall_health':
        tf.keras.layers.Input(
            shape=(1,),
            dtype=tf.float32,
            name='overall_health')
}

In [43]:
# Concatenate our inputs into a single tensor.
preprocessing_layers = tf.keras.layers.Concatenate()(
    [inputs.get('Age'), inputs.get('BMI'), inputs.get('overall_health')])

hidden1 = tf.keras.layers.Dense(
    units=32, name='hidden1')(preprocessing_layers)

hidden2 = tf.keras.layers.Dense(
  units=8, name='hidden2')(hidden1)

dense_output = tf.keras.layers.Dense(
    units=1,
    name='dense_output')(hidden2)

# Define an output dictionary we'll send to the model constructor.
outputs = {
'dense_output': dense_output
}

In [None]:
dnn_model = tf.keras.Model(inputs=inputs, outputs=outputs)
dnn_model.compile(
    # ???
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
    # Binary crossentropy
    loss=tf.keras.losses.BinaryCrossentropy(),
    # Precision, Recall, F1
    metrics=[tf.keras.metrics.F1Score()]
)