# Diabetes Prediction
HCDE 530 - Spring 2021 - Mini Project 2

## **Project Overview**

After experimenting with several ideas, I chose a topic that plays a significant role in my day to day activities. Initially I looked at mortality data from the CDC, then scraped Twitter data but I ended up pivoting a third time choosing the route of analyzing a diabetes dataset. I obtained the CSV file from the UCI Machine Learning Repository through [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database/data). 

Diabetes affects millions of adults in the US, disproportionally [affecting minorities](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3830901/). For this project, I'll explore the capabilities of several [scikitlearn](https://scikit-learn.org/)'s ML models in predicting diabetes and the factors that influence the diagnosis. Additionally, I'll explore accuracy of the models with respect to the diabetes dataset.

[source](https://www.kaggle.com/uciml/pima-indians-diabetes-database/data)


## **Method**
[scikitlearn](https://scikit-learn.org/)
   * kNN
   * Decision Tree
   * Random Forest
   * Logistic Regression
   * Deep Learning


## **Analysis**

1. Import Libraries &  loading diabetes dataset

2. Data Analysis & Visualization

3. Relationships in the data

4. Data preparation: split & train

5. Exploring the models
    


#### 1. Import libraries and load diabetes.csv as a dataframe

In [None]:
import pandas as pd #data analyis
import numpy as np #data analyis
import matplotlib.pyplot as plt #visualization lib
import seaborn as sns #statistical visualization
import warnings #hide warnings from notebook

#ignore warnings
warnings.filterwarnings ('ignore')

#shows visualization in kaggle notebook
%matplotlib inline 
sns.set_style('darkgrid')

#sns.set_style('darkgrid')
# 'rainbow'


In [None]:
data = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
data.head(10)

In [None]:
data.shape


In [None]:
data.describe()

The dataset has 9 columns and 768 rows. The code below shows the data types. 

In [None]:
#data.describe()
data.info()

Based on the data above, 6 columns have the `int64` data type, two columns as `float64` and the final column as an `object` data type.


#### 2. Data analysis & visualization

#### **Data**

**Pregnancies** Number of times pregnant

**Glucose** Plasma glucose concentration a 2 hours in an oral glucose tolerance test

**BloodPressure** Diastolic blood pressure (mm Hg)

**SkinThickness** Triceps skin fold thickness (mm)

**Insulin** 2-Hour serum insulin (mu U/ml)

**BMI** Body mass index (weight in kg/(height in m)^2)

**DiabetesPedigreeFunction** Diabetes pedigree function

**Age** Age (years)

**Outcome** Class variable (0 or 1) 268 of 768 are 1, the others are 0

In [None]:
data.head(10)

Replacing 1 and 0 with Diabetic and Not Diabetic using the `.replace()` method

In [None]:
data['Outcome'] = data['Outcome'].replace([0],'Not Diabetic')


In [None]:
data['Outcome'] = data['Outcome'].replace([1],'Diabetic')
data.head()

#### Dependent variable

Visualizing the distribution of the dependent variable, `Outcome` to see if the dataset is balanced. The target variable is the Outcome: diabetic(1) or non-diabetic(0).

In [None]:
# seaborn plot, 
sns.countplot(x='Outcome', data = data, palette ="magma") 
print(data.groupby('Outcome').size())


Shows 1:2 ratio between diabetics and non-diabetics

#### Independent variable distributions

In [None]:
#Pregnancies
plt.figure(1)
plt.subplot(121), sns.distplot(data['Pregnancies'])
plt.subplot(122), data['Pregnancies'].plot.box(figsize=(16,5)), 
plt.show()
#print(data.groupby('Pregnancies').size())


In [None]:
#Glucose
plt.figure(1)
plt.subplot(121), sns.distplot(data['Glucose'])
plt.subplot(122), data['Glucose'].plot.box(figsize=(16,5))
plt.show()

In [None]:
# Blood Pressure
plt.figure(1)
plt.subplot(121), sns.distplot(data['BloodPressure'])
plt.subplot(122), data['BloodPressure'].plot.box(figsize=(16,5))
plt.show()

In [None]:
#Skin Thickness
plt.figure(1)
plt.subplot(121), sns.distplot(data['SkinThickness'])
plt.subplot(122), data['SkinThickness'].plot.box(figsize=(16,5))
plt.show()

In [None]:
#insulin
plt.figure(1)
plt.subplot(121), sns.distplot(data['Insulin'])
plt.subplot(122), data['Insulin'].plot.box(figsize=(16,5))
plt.show()

In [None]:
#BMI 
plt.figure(1)
plt.subplot(121), sns.distplot(data['BMI'])
plt.subplot(122), data['BMI'].plot.box(figsize=(16,5))
plt.show()

In [None]:
# Diabetes Pedigree Function
plt.figure(1)
plt.subplot(121), sns.distplot(data['DiabetesPedigreeFunction'])
plt.subplot(122), data['DiabetesPedigreeFunction'].plot.box(figsize=(16,5))
plt.show()

In [None]:
# Age 
plt.figure(1)
plt.subplot(121), sns.distplot(data['Age'])
plt.subplot(122), data['Age'].plot.box(figsize=(16,5))
plt.show()

**Independent Feature Distribution Matrix**

In [None]:
r = data.hist(figsize=(20,20))

**Plotting relationships in the dataset**

In [None]:
#Check for missing or null  values in the dataset.
#data.isnull().values.any()
data.isnull().sum()

In [None]:
sns.scatterplot(x='Age',y='Insulin',data=data)


In [None]:
sns.pointplot(x='Outcome', y= 'BMI', data=data)


In [None]:
f, ax = plt.subplots(figsize=(10, 10))
ax=sns.swarmplot(x="Pregnancies", y="Age", hue="Outcome",
              palette="magma", data=data)


In [None]:
sns.boxplot(x='Outcome', y='Glucose', data=data)

In [None]:
#correlaion of skin thickness and insulin
sns.regplot(x='SkinThickness', y= 'Insulin', data=data)

**Display the relationship of diabetics and non-diabetics across the feature matrix**


The pairplot matrix below displays the relationship of diabetics and non-diabetics with respect to the eight independent features in the dataset. 

In [None]:
sns.pairplot(data, hue="Outcome")

Showing the correlation using `corr()`

In [None]:
data.corr()

Visualizing the correlation of features in the dataset using `heatmap`

In [None]:
#get correlations of each features in dataset
corrmat = data.corr() # use to check corellation
top_corr_features = corrmat.index #colormapping for correlation
plt.figure(figsize=(15,12))

#plot heat map
g = sns.heatmap(data[top_corr_features]. corr(), annot=True, cmap = "magma")

#### 3. Data Preparation, Splitting & Training



**Split the dataset into indep and dependent variables.**

* X independent variables
* y: dependent variables

In [None]:
X = data.drop('Outcome', axis=1)
y = data['Outcome']

**Train the data**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33,
                                                    random_state=101)

**Check for missing zero values**

In [None]:
print("total number of rows: {0}",format(len(data)))
print("number of rows missing Pregnancies: {0}",format(len(data.loc[data['Pregnancies'] == 0])))
print("number of rows missing Glucose: {0}",format(len(data.loc[data['Glucose'] == 0])))
print("number of rows missing BloodPressure: {0}",format(len(data.loc[data['BloodPressure'] == 0])))
print("number of rows missing SkinThickness: {0}",format(len(data.loc[data['SkinThickness'] == 0])))
print("number of rows missing Insulin: {0}",format(len(data.loc[data['Insulin'] == 0])))
print("number of rows missing BMI: {0}",format(len(data.loc[data['BMI'] == 0])))
print("number of rows missing DiabetesPedigreeFunction: {0}",format(len(data.loc[data['DiabetesPedigreeFunction'] == 0])))
print("number of rows missing Age: {0}",format(len(data.loc[data['Age'] == 0])))


To normalize the data with 0 values, I'll use the sklearn SimpleImputer where anywhere there's a 0, the value will be replaced by the column mean.

In [None]:
#Imputer when missing values are 0, take the mean and replace it 
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=0, strategy='mean')


X_train = imp.fit_transform(X_train, y=None,)
X_test = imp.fit_transform(X_test, y=None,)


#### 4. Exploring the Models

### *k-Nearest Neighbor Classifier*
To make a prediction, the algorithm finds the nearest "neighbor" with the closest data points to predict the accuracy of the model

In [None]:
from sklearn.neighbors import KNeighborsClassifier

training_accuracy = []
test_accuracy = []

# try n_neighbors from 1 to 10
neighbors_settings = range(1, 9)

for n_neighbors in neighbors_settings:
    # build the model
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X_train, y_train)
    
    # record training set accuracy
    training_accuracy.append(knn.score(X_train, y_train))
    
    # record test set accuracy
    test_accuracy.append(knn.score(X_test, y_test))

plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()

In [None]:
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)

print('Accuracy of K-NN classifier on training set: {:.2f}'.format(knn.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'.format(knn.score(X_test, y_test)))

For kNN, as the number of neighbors increase (features), the accuracy of the model declines. The accuracy of kNN training set is 0.77 and the accuracy of the test set is 0.74.

### *Decision Tree Classifier*


In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

In [None]:
#predictions = tree.predict(X_test)

#### **Feature Importance In Decision Trees**

Feature importance ranks each feature based on the decision a tree makes. 0 = not used, 1 = perfect fit for target

In [None]:
print("Feature importances:\n{}".format(tree.feature_importances_))

Plotting ranked features

In [None]:
(pd.Series(tree.feature_importances_, index=X.columns)
   .plot(kind='barh'))

For decision trees, the top 3 most important features are Glucose, Age, & BMI.

### *Random Forest Classifier*

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_depth=3, n_estimators=100, random_state=0)
rf.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(rf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(rf.score(X_test, y_test)))

In [None]:
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test,predictions))

**Random Forest Feature Importance**

In [None]:
(pd.Series(rf.feature_importances_, index=X.columns)
   .plot(kind='barh'))

For random forest classification, the top 3 most important features are Glucose, BMI, & Age.

### *Logistic Regression*

In [None]:
from sklearn.linear_model import LogisticRegression


logreg = LogisticRegression().fit(X_train, y_train)

print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))

print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))

### *Neural Network*

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(random_state=42)
mlp.fit(X_train, y_train)

print("Accuracy on training set: {:.2f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(mlp.score(X_test, y_test)))

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
mlp = MLPClassifier(random_state=0)
mlp.fit(X_train_scaled, y_train)

print("Accuracy on training set: {:.3f}".format(
    mlp.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test_scaled, y_test)))

In [None]:
data_features = [x for i,x in enumerate(data.columns) if i!=8]

plt.figure(figsize=(20, 5))
plt.imshow(mlp.coefs_[0], interpolation='none', cmap='magma')
plt.yticks(range(8), data_features)
plt.xlabel("Columns in weight matrix")
plt.ylabel("Input Feature")
plt.colorbar()

## **Conclusion**


In [None]:
knn = print('Accuracy of K-NN classifier test set: {:.2f}'.format(knn.score(X_test, y_test)))
dt = print("Accuracy on Decision Tree test set: {:.3f}".format(tree.score(X_test, y_test)))
rf = print("Accuracy on Random Forest test set: {:.3f}".format(rf.score(X_test, y_test)))
lr = print("Accuracy on Logistic Regression test set score: {:.3f}".format(logreg.score(X_test, y_test)))
nn = print("Accuracy on Neural Network test set: {:.3f}".format(mlp.score(X_test_scaled, y_test)))

Based on my analysis, Glucose, Age & BMI are the most significant features according to the Decision Tree and Random Forest Models. The model with the most accurate testing sets are logistic regression, followed by the the multi-layer perceptron (neural network model), random forest classifier, knn and lastly, decision trees. 

[Resources](https://towardsdatascience.com/machine-learning-for-diabetes-562dd7df4d42)

In [None]:
models = pd.DataFrame({
    'Model': ['KNN', 'Decision Tree Classifier', 'Random Forest Classifier',  'Logistic Regression','Neural Networks'],
    'Score': [ 0.740, 0.713, 0.744, 0.787, 0.780]
})

models.sort_values(by = 'Score', ascending = False)

In [None]:
plt.figure(figsize=(16,8))
plt.ylabel("Accuracy %")
plt.xlabel("Algorithms")
sns.barplot(x=models['Model'],y=models['Score'], palette='magma' )
plt.show()