# Predicting Diabetes Occurrence in Pima Indians

In this notebook, we will be using the Pima Indians Diabetes dataset to build a model that can predict, given the factors, whether a given Pima Indian develops diabetes. The large number of missing values in the dataset will be filled in using **stochastic regression imputation**. 
Owing to the **class imbalance**, we use the precision and recall metrics to get a clearer picture of the model's performance. 

These concepts will be explained in more detail in the relevant sections, and links to pages where you can learn more will be provided at the bottom.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# **Checking out the data**

In [None]:
full_dataset = pd.read_csv('/kaggle/input/pima-indians-diabetes-database/diabetes.csv')
#Dataset accessible at https://www.kaggle.com/uciml/pima-indians-diabetes-database

print("Shape of dataset: " + str(full_dataset.shape) + "\n")
print("Number of zero outcomes (Did not develop diabetes): " + str(len(full_dataset[full_dataset["Outcome"]==0]))+ "\n")
print("Number of one outcomes (Developed diabetes): " + str(len(full_dataset[full_dataset["Outcome"]==1]))+ "\n")

pd.options.display.width = 0

print(full_dataset.head())

print("\nStats: \n")

print(full_dataset.describe())

Note that of all 768 records, only 268 actually developed diabetes - about a third of the dataset. This represents a **class imbalance**. Class imbalances can be a reason for misleadingly high accuracy score - for example, if we produce a model for this dataset which only predicts an outcome of zero, its accuracy will be 65%.

Note also that some zero values are present where the feature logically cannot be a zero. For example, a person cannot have zero insulin, or they would be dead. The same goes for blood pressure, skin thickness, and BMI. We'll find out, in the first and second code cells below, how many items in each column are zero or NaN.

We will fill these zero values by using some method of data imputation. 

In [None]:
(full_dataset==0).sum(axis=0)  #this shows how many values are zero in each column

In [None]:
(pd.isna(full_dataset)==True).sum(axis=0)  #this shows how many values are nan in each column

# Preparing the data

You will notice that while there are a few zero values in some fields - 11 for BMI, 35 for blood pressure, and 5 for glucose - there are many zero values for skin thickness and Insulin (227 and 374 respectively). These numbers represent a large fraction of the dataset, and therefore what we do with them carries more weight.

Let us use a simple approach to fill the BMI, BloodPressure and Glucose columns - we will calculate the mean of those values throughout the netire dataset, and replace the zero values with the corresponding mean.

In [None]:
def replace_with_mean(df,feature):
    
    feature_mean = df[feature].mean()
    df[feature] = list(map(lambda x: feature_mean if x==0 else x, df[feature]))


replace_with_mean(full_dataset,"BloodPressure")
replace_with_mean(full_dataset,"BMI")
replace_with_mean(full_dataset,"Glucose")

You can run this code cell to verify that there are no more zero values in these 3 columns.

In [None]:
(full_dataset==0).sum(axis=0)

Dealing with the Insulin and SkinThickness values will be more complicated. If we simply fill all the missing values with their mean values computed from the same number of values, that reduces the variance of the data. There will be too many people all having the exact same value of insulin or skin thickness. This fools the model into getting the wrong idea of how much effect those features have on the outcome. 


An approach we can take is to use data from similar cases to estimate a replacement value for the missing feature.

To get an idea of how to choose similar cases, let's plot a correlation matrix. This shows the relationship between every pair of features in the dataset. 

To plot this correlation matrix, we will have to eliminate all records with those zero values. Let's do that on a copy of the full dataset.


In [None]:
modified = full_dataset[(full_dataset["Insulin"]!=0) & (full_dataset["SkinThickness"]!=0)]
(modified==0).sum(axis=0)

In [None]:
corrMatrix = modified.corr()
sns.heatmap(corrMatrix,annot=True)

Observations:
High correlation between skin thickness and BMI, and between insulin and glucose.

So we'll try to fill in the missing values using those. Let's make scatterplots to get an idea of how exactly they are related.

In [None]:
AX = sns.regplot(x="SkinThickness",y="BMI",data=modified)

In [None]:
AX = sns.regplot(x="Glucose",y="Insulin",data=modified)

By eyeballing these two plots, you can tell that linear regression with a little random error added would be a good way to fill in the missing skin thickness values. We don't want to go with the exact values that will be lying on the line shown, as that would also give a false sense of uniformity. 

You can tell that the majority of the data points lie in a certain region around the line, something like a +- 10 region for BMI, and +-50 for insulin. So we can add a random number from that range to the predicted value. 

That is what **stochastic regression imputation** is. 

Let's build models to predict insulin and skin thickness based on glucose and BMI respectively. 

We'll be using scikit-learn's inbuilt LinearRegression model to do this. 

In [None]:
BMI = modified["BMI"].to_numpy().reshape((-1,1))  #reshaping the dataframe in a way that the LinearRegression model can accept
SkinT = modified["SkinThickness"].to_numpy()
skint_model = LinearRegression().fit(BMI,SkinT)

In [None]:
#Running a test prediction! The double set of square brackets is because the model expects a 2D array containing the data

skint_model.predict([[40]])

Let's do the same for Insulin.

In [None]:
Glucose = modified["Glucose"].to_numpy().reshape((-1,1))
Insulin = modified["Insulin"].to_numpy()
insulin_model = LinearRegression().fit(Glucose,Insulin)

In [None]:
insulin_model.predict([[100]])

Now to replace the zero values of Insulin and Skin Thickness with the predicted values. We'll write a function that takes the dataset, names of the features, the trained model, and the error range as arguments, and replaces the relevant zero values with the values obtained by getting the model's prediction and adding a random error.

In [None]:
def replace(df,model,x,y,error_range):
    #fill the missing values of the feature y, using values predicted from its x value.
    
    reshaped = df[x].to_numpy().reshape((-1,1))   #reshaping before feeding to the LinearRegression model.
    y_pred = model.predict(reshaped)              #values predicted to be on the line
    
    random_err_array = np.random.randint(low=-1*error_range,high=error_range,size=len(y_pred))
    #generating an array of random integers within the given error range
    
    y_pred += random_err_array                    #adding the random errors
    
    for i in range(len(df[y])):
        if df[y][i]==0:
            df[y][i] = y_pred[i]                  #replacing the zero values in the dataset.
            


replace(full_dataset,insulin_model,"Glucose","Insulin",50)
replace(full_dataset,skint_model,"BMI","SkinThickness",10)
full_dataset.head()

In [None]:
(full_dataset==0).sum(axis=0)

# Scaling the data and removing outliers (Normalization)

It's important that all the data be in roughly the same range. If not, the model may assign a disproportionately higher importance to features with larger absolute values. Scaling the features to the same range also allows gradient descent to converge faster. 

A popular method of scaling data is to use the Z-score. That's what we do in the following cell.

In [None]:
full_mean = full_dataset.mean()
full_std = full_dataset.std()

full_norm = (full_dataset-full_mean)/full_std

full_norm["Outcome"] = full_dataset["Outcome"] 
#we don't want the outcome column to be scaled, so we replace it with the original
full_norm.head()


Taking a look at the data to check for any remaining outliers. We generally expect Z-scores to lie between -3 and +3.

In [None]:
sns.distplot(full_norm["Pregnancies"])

In [None]:
sns.distplot(full_norm["Insulin"])

Turns out there are some large positive outliers. Let's put a cap on the values in each column - if any value is greater than 3, it gets replaced by 3. This won't affect the Outcome column, as its values are all 0 or 1. 

After running the below cell, you can re-run the two cells above to observe any changes.

In [None]:
for col in full_norm.columns:
    full_norm[col] = list(map(lambda x: min(x,3),full_norm[col]))

We are finally done with normalizing the data!

# Preparing the model

We start by shuffling the data and splitting into train and test sets. 

In [None]:
final_shuffled = full_norm.sample(frac=1).reset_index(drop=True)  
#this fully shuffles the dataset to avoid any sort of ordering being recognized by the model as a trainable quality

final_shuffled.head(10)

In [None]:
train, test = train_test_split(final_shuffled, test_size=0.2)
#model will be run on the train set w a validation split of 0.2
#test will be used to evaluate the model

In [None]:
train_x = train.iloc[:,0:8]
train_y = train.iloc[:,8]

test_x = test.iloc[:,0:8]
test_y = test.iloc[:,8]

In [None]:
train_x.head()

**Defining plotting functions**

This plotting function will take the history and metrics of a model, and plot their change over the successive epochs as a graph. This is a better way to visualize how the model learns, than reading through lines of verbose output.

In [None]:
def plot_curve(epochs, history, list_of_metrics):  

  plt.figure()
  plt.xlabel("Epoch")
  plt.ylabel("Value")

  for m in list_of_metrics:
    x = history[m]
    plt.plot(epochs[1:], x[1:], label=m)

  plt.legend()

print("Defined the function which plots the learning curve.")

**Creating the model**

We're using a simple Sequential model with one output neuron. Sigmoid activation is used for the neuron, so that we get a probability as an output. i.e. - a prediction of 0.65 would mean that the model gives the particular scenario 65% probability of being a diabetes patient.

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units=1, input_shape=(8,),activation=tf.sigmoid),)

Here comes the classification threshold and precision and recall metrics, which I mentioned at the beginning of the notebook.

**Classification threshold** - the point above which the output probability is classified as a 1. Intuitively, we tend to use 0.5 as the threshold. But, depending on the data, especially in instances of class imbalance like we have here, changing up the threshold and seeing how it affects the metrics will help us tune the model. We'll be starting with a 0.5 here, and changing it up a bit.

**Precision** - As I mentioned earlier, accuracy isn't the greatest metric in situations with a class imbalance. In binary classification problems, the precision metric will tell you the ratio of correctly predicted positives to total predicted positives. For this particular problem, precision is the percentage of times the model is correct when it predicts that a particular individual has diabetes.

**Recall** - The recall metric tells you the ratio of correctly predicted positives to total actual positives. For this problem, it's the percentage of times the model predicts diabetes in people who actually have diabetes. 


Let's say our model is being used to warn people that they might have diabetes, and to get tested. We would want to make sure we warned all of the people who actually had diabetes, right? If we warn a few people who don't actually have diabetes, that's fine. But we want to make sure that everyone who actually has diabetes is warned. 

In other words, we want a high recall. But be careful - high recall usually leads to lower precision, so keep an eye on the precision metric to make sure it's not dropping too much.

In [None]:
classification_threshold = 0.5

METRICS = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy', threshold=classification_threshold),
      tf.keras.metrics.Precision(name='precision',thresholds=classification_threshold),
      tf.keras.metrics.Recall(name='recall',thresholds=classification_threshold)
]

In [None]:
model.compile(optimizer='Adam',loss=tf.keras.losses.BinaryCrossentropy(),metrics=METRICS)

In [None]:
history = model.fit(train_x,train_y,batch_size=10,epochs=50,validation_split=0.2, verbose=0)

In [None]:
epochs = history.epoch
hist = pd.DataFrame(history.history)
plot_curve(epochs,hist,['accuracy','precision','recall','loss'])

In [None]:
model.evaluate(test_x,test_y)

You can see that the recall metric isn't doing well during either the training or the testing. Let's lower the threshold to 0.35 and see how it looks.

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units=1, input_shape=(8,),activation=tf.sigmoid),)

classification_threshold = 0.35

METRICS = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy', threshold=classification_threshold),
      tf.keras.metrics.Precision(name='precision',thresholds=classification_threshold),
      tf.keras.metrics.Recall(name='recall',thresholds=classification_threshold)
]

model.compile(optimizer='Adam',loss=tf.keras.losses.BinaryCrossentropy(),metrics=METRICS)
history = model.fit(train_x,train_y,batch_size=10,epochs=50,validation_split=0.2,verbose=0)
epochs = history.epoch
hist = pd.DataFrame(history.history)
plot_curve(epochs,hist,['accuracy','precision','recall','loss'])

This graph looks much better. Let's evaluate the model on the test set.

In [None]:
model.evaluate(test_x,test_y)

Notice that the recall is higher during training than during testing?

This sounds like overfitting. Let's introduce some regularization. 

Regularization works by preventing the model's weights from continually rising in an attempt to reach zero loss. A penalty term which estimates the complexity of the model is added to the loss. For logistic regression, L2 regularization is often used. 

In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units=1, input_shape=(8,),
                                kernel_regularizer=tf.keras.regularizers.L1L2(l1=0.0, l2=0.2),
                                  activation=tf.sigmoid),)

classification_threshold = 0.35

METRICS = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy', threshold=classification_threshold),
      tf.keras.metrics.Precision(name='precision',thresholds=classification_threshold),
      tf.keras.metrics.Recall(name='recall',thresholds=classification_threshold)
]

model.compile(optimizer='Adam',loss=tf.keras.losses.BinaryCrossentropy(),metrics=METRICS)
history = model.fit(train_x,train_y,batch_size=10,epochs=50,validation_split=0.2,verbose =0)
epochs = history.epoch
hist = pd.DataFrame(history.history)
plot_curve(epochs,hist,['accuracy','precision','recall','loss'])

In [None]:
model.evaluate(test_x,test_y)

We finally have a higher recall than before! It's not perfect, but it looks pretty good. Our precision is also pretty good-looking.

Feel free to copy and edit this notebook, and play around with the classification threshold and L1 and L2 values to see how much more you can optimize these metrics. Make sure you're training a fresh model from scratch each time. You can edit the values in the single cell above and run it repeatedly. 

Happy training!

***References :***

Data imputation - an overview of the methods used:

https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/

Why mean imputation isn't always a good idea:

https://www.theanalysisfactor.com/mean-imputation/

The effect of outliers and how scaling with Z-scores works:

https://developers.google.com/machine-learning/crash-course/representation/cleaning-data

Precision and recall metrics:

https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall

L2 regularization and the Lambda value:

https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/l2-regularization
https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/lambda
https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c
