![](https://i.pinimg.com/originals/09/53/81/0953813004d675ca814403fbb649f8b7.png)

## Goals
Create a machine learning algorithm to predict if a patient has diabetes or not .

## Data
The data sample is for female patients at least 21 years of age or older with Pima Native American heritage.

## Conclusions
- Major Takeaways: 
    - The patient's Glucose level has the highest impact on becoming diabetic
    - Other features, such as high BMI, can increase the risk.  
    
- Final model to predict a patient's diagnosis of diabetes: Random Forest

| Model    | RandomForest | (max_depth=5, random_state=123)             | ['Glucose','Age','BMI','insulin_glucose_cluster','DiabetesPedigreeFunction'] |
|----------|--------------|---------------------------------------------|------------------------------------------------------------------------------|
| DF       | Accuracy     | Recall on Positive (predicting diabetic) | Precision on Positive (predicting diabetic)                               |
| Train    | 86%          | 75%                                         | 84%                                                                          |
| Validate | 78%          | 64%                                         | 73%                                                                          |
| Test     | 75%          | 63%                                         | 70%                                                                          |

- Next Steps:
    - Create more features with clustering/binning
    - Statistically test more features

## How to Reproduce:
1. Go over the Readme.md file in the repository of this project [here](https://github.com/ThompsonBethany01/Predicting-Diabetes-Onset).
2. Download Data_Analysis.ipynb, Prepare.py, and the dataset in your working directory.
3. Run this notebook.

## Thought Process
The predictive variable is the patient being diabetic or not, 0 or 1, making it a classification problem. With a classification problem:
- We create algorithms based on the labeled outcome variable.
- This produces a decision rule to classify future data with.
- We generalize the trends/patterns in the data to predict the future/unseen data.

# Table of Contents <a class="anchor" id="top"></a>
1. [Acquire](#acquire)
2. [Prepare](#prepare)
3. [Explore](#explore)
4. [Modeling](#model)
5. [Final Conclusions](#fin)

In [None]:
# initial imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

#import modules
import pima_prepare as Prepare

# Acquire <a class="anchor" id="acquire"></a>
Dataset from UCI Machine Learning via Kaggle [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database) saved in .csv file  
#### Steps:
1. Read csv file into df
2. Summarize data
3. Create data dictionary

In [None]:
# needs saved csv file to continue
df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')

In [None]:
print('The dataframe has', df.shape[0], 'rows and', df.shape[1], 'columns.')

In [None]:
print('The columns are named: ', df.columns.to_list())

In [None]:
# What are the data types and null counts for each column?
df.info()

#### What do we learn from df.info()?
- there are no null values
- most columns are integers
- BMI and DPF are decimals (floats)

In [None]:
# what is the distribution of the numeric columns? (All columns)
df.describe()

#### What do we learn from df.describe()?
- greatest variation in Insluin
- Many features have a minimum of 0. Is this feasible?
    - someone can not have 0 for BMI, Glucose, or BloodPressure
- Insulin maximum is 846, is this possible or an anamoly cause by a typo? Need to research.

In [None]:
# looking at the distribution of features all at once simply with df.hist()
df.hist()
plt.tight_layout()

### Takeaways
768 observations  
- 8 columns and 1 predictive column as diabetic or not  

All numeric values, integers or floats  
- Average diabetic diagnosis is 0, non-diabetic  
- All continuous features except pregnancies and outcome which are discrete  

No null values  
- Observations with 0 for values that cannot be, such as BMI and Blood Pressure, have 0 for multiple features  
    - Could be null values that were replaced with 0

[Table of Contents](#top)

# Prepare <a class="anchor" id="prepare"></a>
For Exploration:
- Create new features by bining demographics  or clustering
    - age into 20s, 30s, etc
    - bmi into low, middle, high
    - blood pressure into low, good, high
    - create features based on clustering

For Modeling:
- Split into train, valideate, test
- Scale the data - fitting on train df only
- Determine if outliers/anomalies to remove (after MVP complete)

## Prepare.py Module contains functions used below
### Prepare.prep_df
- Replaces values of 0 in...
    - BMI
    - Glucose
    - BloodPressure
    - SkinThickness
    - Insulin  
   with the mean of the feature
   
- Bins features with pd.qcut(cuts features into even bins based on number of bins specified)
    - age
    - bmi
    - bloodpressure

- Creates feature for patient having both high bloodpressure and bmi

In [None]:
# prepping df before split with function
df = Prepare.prep_df(df)

In [None]:
# quality control, checking the df looks accurate
df

### Prepare.split_df
Splits Data into
- 70% train
- 20% validate
- 10% test  

printing the returned shape of the split df

In [None]:
# splitting df with function
train, validate, test = Prepare.split_df(df)

In [None]:
# quality control, checking train df looks like the df above but with smaller mixed index
train

### Prepare.scale_dfs
- Scaling the Data Using Min-Max Scaler
- transforms the range of data points to 0 - 1
- fits scaler to train only, then transforms on all 3 dfs
- returns the split dfs scaled

In [None]:
# calling split df function
X_train_scaled, X_validate_scaled, X_test_scaled = Prepare.scale_dfs(train, validate, test, 'Outcome')

In [None]:
# quality control, does df look the same as train but with scalled values?
X_train_scaled

### Prepare.create_clusters
- Creating Clusters on Scaled Data
- multitude of parameters allow one function to create any cluster
    - train, validate, and test scaled dfs to fit the cluster model to train only, then transform on all dfs
    - train, validate, test to add the clusters to the unscaled dfs as well for exploration
    - features = what to create the clusters on
    - columns = name of the columns when adding the clusters to the dfs
    - n = number of groups within the cluster to make
    - cluster = name of the original cluster before splitting into dummies
    
#### For each cluster:
1. visualize the number to set for n with elbow test
2. call the function with n set from elbow test

### Age and BMI Cluster

In [None]:
# elbow test to determine n
from sklearn.cluster import KMeans

# features to predict cluster on, only fitting model on X(train)
X = X_train_scaled[['age_bins','bmi_bins']]

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(9, 6))
    pd.Series({k: KMeans(k).fit(X).inertia_ for k in range(2, 18)}).plot(marker='x')
    plt.xticks(range(2, 18))
    plt.xlabel('k')
    plt.ylabel('inertia')
    plt.title('Change in inertia as k increases')
    
# will start with 4 clusters

In [None]:
# creating cluster with function from Prepare.py

features = ['age_bins','bmi_bins']
columns = ['age_bmi_cluster1','age_bmi_cluster2','age_bmi_cluster3','age_bmi_cluster4']
n = 4
cluster = 'age_bmi_cluster'

X_train_scaled, X_validate_scaled, X_test_scaled, train, validate, test = Prepare.create_clusters(X_train_scaled, X_validate_scaled, X_test_scaled, train, validate, test, features, n, columns, cluster)

### Pregnancy Cluster

In [None]:
# feature to create cluster on, only fitting model on X(train)
X = X_train_scaled[['Pregnancies']]

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(9, 6))
    pd.Series({k: KMeans(k).fit(X).inertia_ for k in range(2, 18)}).plot(marker='x')
    plt.xticks(range(2, 18))
    plt.xlabel('k')
    plt.ylabel('inertia')
    plt.title('Change in inertia as k increases')
    
# will start with 4 clusters

In [None]:
# creating cluster with function from Prepare.py

features = ['Pregnancies']
columns = ['pregnancy_cluster1','pregnancy_cluster2','pregnancy_cluster3','pregnancy_cluster4']
n = 4
cluster = 'pregnancy_cluster'

X_train_scaled, X_validate_scaled, X_test_scaled, train, validate, test = Prepare.create_clusters(X_train_scaled, X_validate_scaled, X_test_scaled, train, validate, test, features, n, columns, cluster)

### Insulin and Glucose Cluster

In [None]:
# features to predict cluster on, only fitting model on X(train)
X = X_train_scaled[['Insulin','Glucose']]

with plt.style.context('seaborn-whitegrid'):
    plt.figure(figsize=(9, 6))
    pd.Series({k: KMeans(k).fit(X).inertia_ for k in range(2, 18)}).plot(marker='x')
    plt.xticks(range(2, 18))
    plt.xlabel('k')
    plt.ylabel('inertia')
    plt.title('Change in inertia as k increases')
    
# will start with 5 clusters

In [None]:
# creating cluster with function from Prepare.py

features = ['Insulin','Glucose']
columns = ['insulin_glucose_cluster1','insulin_glucose_cluster2','insulin_glucose_cluster3','insulin_glucose_cluster4','insulin_glucose_cluster5']
n = 5
cluster = 'insulin_glucose_cluster'

X_train_scaled, X_validate_scaled, X_test_scaled, train, validate, test = Prepare.create_clusters(X_train_scaled, X_validate_scaled, X_test_scaled, train, validate, test, features, n, columns, cluster)

In [None]:
# quality control, do we see the clusters added to the end of train scaled df?
X_train_scaled.head(3).T

In [None]:
# train.to_csv('train.csv')

### Takeaways
- imputed 0 values that could not be 0 with the mean
- created features based on binning
- split the data for exploration and modeling
- scaled the data based on split train df
- created clusters based on split scaled train df

### Next Steps
- are there other clusters that could be more significant in modeling?
- are there outliers/anomalies to deal with?

[Table of Contents](#top)

# Explore <a class="anchor" id="explore"></a>
- determine trends in patient being diabetic or not
    - X feature(s) vs. Outcome
- test the significance with hypothesis testing, such as with:
    - t-test
    - chi-squared contingancy table
    - peirson correlation test
- explore interaction of independent features to determine what clusters to create
- visualize clusters created

In [None]:
# visualizing distribution of Y feature (predictive variable)
plt.figure(figsize=(10,7))
train.Outcome.value_counts().sort_index().plot.bar()
diabetic_rate = train.Outcome.mean()
plt.title(f"Overall diabetes diagnosis rate: {diabetic_rate:.2%}", size=17)
plt.xlabel('Is diabetic?', size=17)
plt.ylabel('Count of Patients', size=17)

### Overall, most patients are not diagnosed with diabetes.

In [None]:
# visualizing overall linear correlation of all features
corr = train.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, vmin=-1, vmax=1)

### Outcome has strongest correlation with Glucose, which is also reflected later in model. This feature had the strongest importance for all modeling algorithms.
#### Note: Features can have non-linear correlation, which would not be captured in this heatmap.

## Looking at Independent Features vs. Diabetic Outcome
### Diagnosis Rate within Subgroups of Age, BMI, and BP Bins

In [None]:
# visualizing subgroups within feature bins, is there a significant difference of diabetic diagnosis?
# categorical features we can compare
features = ['age_bins', 'bmi_bins', 'bp_bins']

# overall diagnosis of diabetes on whole train df
diabetic_rate = train.Outcome.mean()

# plotting subgroups diagnosis rate and comparing to overal with dashed line
_, ax = plt.subplots(nrows=1, ncols=3, figsize=(16, 6), sharey=True)
for i, feature in enumerate(features):
    sns.barplot(feature, 'Outcome', data=train, ax=ax[i], alpha=.8)
    ax[i].set_xlabel('')
    ax[i].set_ylabel('Diabetes Diagnosis Rate', size=13)
    ax[i].set_title(feature, size=15)
    ax[i].axhline(diabetic_rate, ls='--', color='grey')

Notes:
- Age bins have higher diagnosis rates in the older bins.
- BMI bins have higher diagnosis rates at higher BMI
- BloodPressure bins have higher diagnosis rates at higher bloodpressure

## Statistically Testing these Hypothesis Seen in Visuals

### Correlation Test
- do two samples have a linear relationship?
- null hypothesis is that there is no linear correlation between the two variables
- the correlation coefficient is a unitless continuous numerical measure between -1 and 1, where 1 = perfect correlation and -1 = perfect negative correlation

### Age vs. Diabetes Diagnosis
$H_O$: There is no significant correlation between age and diabetes diagnosis.   
$H_a$: Older populations correlate with a higher rate of diabetes (in female patients +21 with Pima Indian heritage).

In [None]:
# not normally distributed, so we do not do a t-test
plt.hist(train.Age)
print('Age average:', round(train.Age.mean(),2), '\nAge median:', round(train.Age.median(),2))

In [None]:
from scipy import stats

x = train.Age
y = train.Outcome

corr, p = stats.pearsonr(x, y)
print('correlation coeeficient:', corr, '\n\np-value:', p)

#### P is less than alpha (.01), we reject the null hypothesis.
There is a significant correlation between age and diagnosis rates of diabetes. The correlation coeeficient being a positive number tells us the rate increases as age increases.

### T-test
- Compare the mean for a specific subgroup against the population mean.
- One of the assumptions of the t-test is that the continuous variable is normally distributed. To check this, we can make a quick visualization.

### BMI vs. Diabetes Diagnosis
$H_O$: There is no significant difference between BMI and diabetes diagnosis.  
$H_a$: Populations with higher BMI have a significantly higher rate of diabetes (in female patients +21 with Pima Indian heritage).

## alpha = .01

In [None]:
# is the continuous variable normally distributed?
plt.hist(train.BMI)
plt.title('Distribution of BMI', size=15)
print('BMI average:', round(train.BMI.mean(),2), '\nBMI median:', round(train.BMI.median(),2))

In [None]:
from math import sqrt

# tesing subgroup of bin 3 of BMI
top_BMI = train[train.bmi_bins == 3]

μ = train.Outcome.mean()
xbar = top_BMI.Outcome.mean()
s = top_BMI.Outcome.std()
n = top_BMI.shape[0]
degf = n - 1
standard_error = s / sqrt(n)

t = (xbar - μ) / (s / sqrt(n))
print('t-value:', round(t,4))
      
p = stats.t(degf).sf(t) * 2 # *2 for two-tailed test
print('p-value:', round(p,4))

#### P is less than alpha (.01), we reject the null hypothesis.
Populations with higher BMI have a significantly higher rate of diabetes (in female patients +21 with Pima Indian heritage). This is specifically patients in the bmi_bins of 3, those with a BMI higher than 34.867.

### BloodPressure vs. Diabetes Diagnosis
$H_O$: There is no significant difference between age and diabetes diagnosis.  
$H_a$: Populations with higher blood pressure have a significantly higher rate of diabetes (in female patients +21 with Pima Indian heritage).

In [None]:
# is the continuous variable normally distributed?
plt.hist(train.BloodPressure)
plt.title('Distribution of BP', size=15)
print('Blood Pressure average:', round(train.BloodPressure.mean(),2), '\nBlood Pressure median:', round(train.BloodPressure.median(),2))

In [None]:
# tesing subgroup of bin 3 of BMI
top_BP = train[train.bp_bins == 3]

μ = train.Outcome.mean()
xbar = top_BP.Outcome.mean()
s = top_BP.Outcome.std()
n = top_BP.shape[0]
degf = n - 1
standard_error = s / sqrt(n)

t = (xbar - μ) / (s / sqrt(n))
print('t-value:', round(t,4))
      
p = stats.t(degf).sf(t) * 2 # *2 for two-tailed test
print('p-value:', round(p,4))

#### P is not lower than alpha(.01), we fail to reject the null hypothesis.
There is no significant difference between age and diabetes diagnosis. This is specific for patients in the bp_bin of 3. Could splitting the bins into smaller categories create a cluster that is significantly higher rates of diabetes?

### Diagnosis Rates Within Subgroups of Pregnancy Count

In [None]:
# overall diagnosis of diabetes on whole train df
diabetic_rate = train.Outcome.mean()

plt.figure(figsize=(13,8))

# plotting pregnancy count diagnosis rate and comparing to overall with dashed line
sns.barplot('Pregnancies', 'Outcome', data=train, alpha=.8)
plt.xlabel('Count of Pregnancies')
plt.ylabel('Diabetes Diagnosis Rate', size=13)
plt.title('Diagnosis Rate by Pregnancy Count', size=15)
plt.axhline(diabetic_rate, ls='--', color='grey')

In [None]:
# note the few amount of pregnancies at 10 and higher
train[train.Outcome == 1].Pregnancies.value_counts()

### Count of Pregnancies vs. Diabetes Diagnosis - Correlation Test
$H_O$: There is no significant correlation between age and diabetes diagnosis.   
$H_a$: Older populations correlate with a higher rate of diabetes (in female patients +21 with Pima Indian heritage).

In [None]:
x = train.Pregnancies
y = train.Outcome

corr, p = stats.pearsonr(x, y)
print('correlation coeeficient:', corr, '\n\np-value:', p)

#### P is less than alpha (.01), we reject the null hypothesis.
There is a significant correlation between count of pregnancies and diagnosis rates of diabetes. The correlation coefficient being a positive number tells us the rate increases as the count increases. Howver, you can see the correlation is not as strong as age, as the coefficient is .19 as opposed to .24 in age.

### Visualizing Interaction of Age and BMI Bins

In [None]:
plt.figure(figsize=(13,9))
sns.swarmplot(x="bmi_bins", y="Age", data=train, hue="Outcome", palette="Set2")
plt.legend()
plt.title('Diabetes Diagnosis by Age and BMI Bins', size=15)

### Visualzing Interaction of Glucose and Insulin

In [None]:
sns.relplot(x="Glucose", y="Insulin", hue="Outcome", data=train, height=6, aspect=1.6)
plt.xlim(-5, 250)
plt.title('Diabetes Diagnosis with BMI vs. Glucose', size=15)

### Exploring Interaction of X Variables
- are there any clear groupings within the independent features?
- are groupings clearer when adding hue for diagnosis?
- what clusters can be created?

### Interaction of Features with Glucose and the Outcome

In [None]:
plt.figure(figsize=(15,4))

plt.subplot(131)
sns.scatterplot(x=train.Glucose, y=train.BloodPressure, hue=train.Outcome)

plt.subplot(132)
sns.scatterplot(x=train.Glucose, y=train.Insulin, hue=train.Outcome)

plt.subplot(133)
sns.scatterplot(x=train.Glucose, y=train.BMI, hue=train.Outcome)

### Scatterplots can show if there is a distinction between the variables
- With all, it seems Glucose has the greater effect on the clustering of diabetic outcome
- i.e., Farther right on the Glucose (higher glucose) has more diabetic
- Higher up on the y axis (other feature) does not effect the clustering of diabetic

In [None]:
sns.scatterplot(x=train.age_bins, y=train.bmi_bins, hue=train.Outcome)
plt.title('Average Outcome by Age and BMI bins', size=15)

### How to interpret:
Examples
- when age_bin == 1 and bmi_bin == 3, average Outcome == 1 "diabetic"
    - younger patients with higher bmi have an average of diabetic
- when age_bin == 4 and bmi_bin == 1, average Outcome == 0 "non - diabetic"
    - older patients with lower bmi have an average of non-diabetic
- when age_bin == 2 and bmi_bin == 2, average Outcome == 1 "diabetic"

## Visualizing Clusters
- Are there subgroups in the clusters that have a higher rate of daibetes diagnosis? Above the dashed line?

In [None]:
# comparing the dummy variables created from the age_bmi_cluster
features = ['age_bmi_cluster1', 'age_bmi_cluster2', 'age_bmi_cluster3','age_bmi_cluster4']

# overall diagnosis of diabetes on whole train df
diabetic_rate = train.Outcome.mean()

# plotting subgroups diagnosis rate and comparing to overal with dashed line
_, ax = plt.subplots(nrows=1, ncols=4, figsize=(16, 6), sharey=True)
for i, feature in enumerate(features):
    sns.barplot(feature, 'Outcome', data=train, ax=ax[i], alpha=.8)
    ax[i].set_xlabel('')
    ax[i].set_ylabel('Diabetes Diagnosis Rate', size=13)
    ax[i].set_title(feature, size=15)
    ax[i].axhline(diabetic_rate, ls='--', color='grey')

In [None]:
# comparing the dummy variables created from the age_bmi_cluster
features = ['pregnancy_cluster1', 'pregnancy_cluster2', 'pregnancy_cluster3','pregnancy_cluster4']

# overall diagnosis of diabetes on whole train df
diabetic_rate = train.Outcome.mean()

# plotting subgroups diagnosis rate and comparing to overal with dashed line
_, ax = plt.subplots(nrows=1, ncols=4, figsize=(16, 6), sharey=True)
for i, feature in enumerate(features):
    sns.barplot(feature, 'Outcome', data=train, ax=ax[i], alpha=.8)
    ax[i].set_xlabel('')
    ax[i].set_ylabel('Diabetes Diagnosis Rate', size=13)
    ax[i].set_title(feature, size=15)
    ax[i].axhline(diabetic_rate, ls='--', color='grey')

In [None]:
# comparing the dummy variables created from the age_bmi_cluster
features = ['insulin_glucose_cluster1', 'insulin_glucose_cluster2', 'insulin_glucose_cluster3','insulin_glucose_cluster4','insulin_glucose_cluster5']

# overall diagnosis of diabetes on whole train df
diabetic_rate = train.Outcome.mean()

# plotting subgroups diagnosis rate and comparing to overal with dashed line
_, ax = plt.subplots(nrows=1, ncols=5, figsize=(16, 6), sharey=True)
for i, feature in enumerate(features):
    sns.barplot(feature, 'Outcome', data=train, ax=ax[i], alpha=.8)
    ax[i].set_xlabel('')
    ax[i].set_ylabel('Diabetes Diagnosis Rate', size=13)
    ax[i].set_title(feature, size=15)
    ax[i].axhline(diabetic_rate, ls='--', color='grey')

### Takeaways
- Glucose has the greatest correlation with diabetes diagnosis
- Statistical testing found Age, BMI, and Pregnancies failed to reject the null hypothesis of "{feature} does not significantly influence diabetes"
- Certain subgroups within clusters have a higher rate of diagnosis than the average

### Next Steps
- can create different clusters
- bin more continuous variables

[Table of Contents](#top)

# Modeling <a class="anchor" id="model"></a>
##### Outcome of patient being diabetic or not is the predictive feature, Y
#### Steps
1. Create the Baseline model for comparison based on most common diagnosis
2. Create models fit to the train df only
3. Validate on top 3 models, tuning hyperparameters
4. Use final top model evaluated on test
5. Determine next steps/conclusions

[Skip to Modeling Summary](#model-summary)

### Baseline

In [None]:
train.Outcome.value_counts()

In [None]:
# taking a look at the same barplot again, overall rate of diabetes diagnosis
train.Outcome.value_counts().sort_index().plot.bar()
diabetic_rate = train.Outcome.mean()
plt.title(f"Overall diabetes diagnosis rate: {diabetic_rate:.2%}", size=15)
plt.xlabel('Is diabetic?', size=13)
plt.ylabel('Count of Patients', size=13)

In [None]:
y_train = train[['Outcome']]

In [None]:
# most common diagnosis is non-diabetic, this will be our baseline
y_train['baseline_prediction'] = 0

baseline_accuracy = (y_train.baseline_prediction == train.Outcome).mean()

print(f'baseline accuracy: {baseline_accuracy:.2%}')

## Creating Classification Models
#### Models Created
- LogisticRegression
- DecisionTree
- RandomForest
- KNN
- RidgeClassifier Model
- SGDClassifier

#### Primary Evaluation Metric
Is it more dangerous to predict diabetic when actually not, or not diabetic when actually diabetic? 
   - It is better to predict Diabetic because a patient not being diagnosed could lead to harm to the patient
   - We want the model to predict 1 better, aka have a higher recall score and precision
       - recall: 
       - TP / (TP + FN)
       - % of acually positive cases that were predicted as positive
       - Optimize for recall when missing actual positive cases is expensive or deadly
       
### Determine What Features to Model on Using:
- SelectKBest
- model.feature_importances_

In [None]:
# modeling imports
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier

# SelectKBest features
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

# evaluation metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

### Logistic Regression Model

In [None]:
# does better on all features than when using Top 10 SelectKBest features
X_train = X_train_scaled
y_train = train[['Outcome']]

# create model object
logit = LogisticRegression(C=10, random_state=123)

# fit to train
logit.fit(X_train, y_train)

# predict on train
y_pred = logit.predict(X_train)

#evaluate
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train, y_train)))

In [None]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

In [None]:
print(confusion_matrix(y_train, y_pred))

In [None]:
print(classification_report(y_train, y_pred))

## Decision Tree Model

In [None]:
X_train = X_train_scaled

# create the model
clf = DecisionTreeClassifier(max_depth=5, random_state=123)

# fit to train
clf.fit(X_train, y_train)

col = X_train_scaled.columns

#modelname.feature_importance_
y = clf.feature_importances_

#plot
fig, ax = plt.subplots(figsize=(13,9)) 
width = .75 # the width of the bars 
ind = np.arange(len(y)) # the x locations for the groups
plt.barh(ind, y, width, color="green")
ax.set_yticks(ind+width/10)
ax.set_yticklabels(col, minor=False)
plt.title('Feature importance in Decision Classifier')
plt.xlabel('Relative importance')
plt.ylabel('feature')

In [None]:
# features to model on
X_train = X_train_scaled[['Glucose','BMI','DiabetesPedigreeFunction','Age','SkinThickness']]

# create the model
clf = DecisionTreeClassifier(max_depth=5, random_state=123)

# fit to train
clf.fit(X_train, y_train)

# predict on train
y_pred = clf.predict(X_train)

# evaluate
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))

In [None]:
confusion_matrix(y_train, y_pred)

In [None]:
print(classification_report(y_train, y_pred))

## Random Forest Model
- visualize feature importance for model
- model on specific features

In [None]:
X_train = X_train_scaled

In [None]:
# create the model
rf = RandomForestClassifier(max_depth=5, random_state=123)

# fit to train
rf.fit(X_train, y_train)

In [None]:
col = X_train_scaled.columns

#modelname.feature_importance_
y = rf.feature_importances_

In [None]:
#plot
fig, ax = plt.subplots(figsize=(13,9)) 
width = .75 # the width of the bars 
ind = np.arange(len(y)) # the x locations for the groups
plt.barh(ind, y, width, color="green")
ax.set_yticks(ind+width/10)
ax.set_yticklabels(col, minor=False)
plt.title('Feature importance in RandomForest Classifier')
plt.xlabel('Relative importance')
plt.ylabel('feature')

In [None]:
# features to model on
X_train = X_train_scaled[['Glucose','Age','BMI','insulin_glucose_cluster','DiabetesPedigreeFunction']]

# create the model
rf = RandomForestClassifier(max_depth=5, random_state=123)

# fit to train
rf.fit(X_train, y_train)

# predict on train
y_pred = rf.predict(X_train)

# evaluate
print('Accuracy of random forest classifier on training set: {:.2f}'
     .format(rf.score(X_train, y_train)))

In [None]:
print(classification_report(y_train, y_pred))

In [None]:
print(rf.feature_importances_)

## KNN Model

In [None]:
# initialize the ML algorithm
lm = LinearRegression()

# create the rfe object, indicating the ML object (lm) and the number of features to select
rfe = RFE(lm, 12)

# fit the data using RFE
rfe.fit(X_train_scaled,y_train.Outcome)  

# get the mask of the columns selected
feature_mask = rfe.support_

# get list of the column names. 
rfe_feature = X_train_scaled.iloc[:,feature_mask].columns.tolist()

In [None]:
# Features elected by SelectKBest
print('SelectKBest Top 12 Features:')
rfe_feature

In [None]:
# 10 features have the same rank of 1
rfe.ranking_

In [None]:
# select features to model
X_train = X_train_scaled[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction','Age','age_bins','bp_bins','high_bmi_bp','age_bmi_cluster4']]

# create the model
knn = KNeighborsClassifier(n_neighbors=5)

# fit to train
knn.fit(X_train, y_train)

# predict on train
y_pred = knn.predict(X_train)

# evaluate
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))

In [None]:
print(classification_report(y_train, y_pred))

## RidgeClassifier Model

In [None]:
# select features to model
X_train = X_train_scaled[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction','Age','age_bins','bp_bins','high_bmi_bp','age_bmi_cluster4']]

# create the model object
clf2 = RidgeClassifier(random_state=123)

# fit to train only
clf2.fit(X_train, y_train)

y_pred = clf2.predict(X_train)

# evaluate with score, returns the mean accuracy on the given test data and labels
print('Accuracy of Ridge classifier on training set:', round(clf2.score(X_train, y_train),2))

In [None]:
print(classification_report(y_train, y_pred))

## SGDClassifier Model

In [None]:
# select features to model
X_train = X_train_scaled[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction','Age','age_bins','bp_bins','high_bmi_bp','age_bmi_cluster4']]

clf3 = SGDClassifier(max_iter=1000, tol=1e-3, random_state=123)

clf3.fit(X_train, y_train)

y_pred = clf3.predict(X_train)

print('Accuracy of SGD classifier on training set:', round(clf3.score(X_train, y_train),2))

In [None]:
print(classification_report(y_train, y_pred))

## Evaluating Top 3 on Validate - Tuning Hyperparameters
1. RandomForest Model at .86 accuracy, .77 recall
2. DecisionTree at .85 accuracy, .77 recall
3. KNN at .82 accuracy, .72 recall

In [None]:
# splitting into y
y_validate = validate[['Outcome']]

### RandomForest on Validate

In [None]:
# features to model on
X_validate = X_validate_scaled[['Glucose','Age','BMI','insulin_glucose_cluster','DiabetesPedigreeFunction']]

# predict on validate
y_pred = rf.predict(X_validate)

# evaluate
print('Accuracy of random forest classifier on validate set: {:.2f}'
     .format(rf.score(X_validate, y_validate)))

In [None]:
print(classification_report(y_validate, y_pred))

### DecisionTree on Validate

In [None]:
X_validate = X_validate_scaled[['Glucose','BMI','DiabetesPedigreeFunction','Age','SkinThickness']]

# predict on validate
y_pred = clf.predict(X_validate)

# evaluate
print('Accuracy of Decision Tree classifier on validate set: {:.2f}'
     .format(clf.score(X_validate, y_validate)))

In [None]:
print(classification_report(y_validate, y_pred))

### KNN on Validate

In [None]:
# features created model on
X_validate = X_validate_scaled[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction','Age','age_bins','bp_bins','high_bmi_bp','age_bmi_cluster4']]

# predict on train
y_pred = knn.predict(X_validate)

# evaluate
print('Accuracy of KNN classifier on validate set: {:.2f}'
     .format(knn.score(X_validate, y_validate)))

In [None]:
print(classification_report(y_validate, y_pred))

## Evaluating Top Model on Test - Determine if Overfit
- RandomForest did better on recall

In [None]:
# splitting into X and y
# features to model on
X_test = X_test_scaled[['Glucose','Age','BMI','insulin_glucose_cluster','DiabetesPedigreeFunction']]
y_test = test[['Outcome']]

In [None]:
# predict on validate
y_pred = rf.predict(X_test)

# evaluate
print('Accuracy of random forest classifier on test set: {:.2f}'
     .format(rf.score(X_test, y_test)))

In [None]:
print(classification_report(y_test, y_pred))

# Modeling Summary <a class="anchor" id="model-summary"></a>
##### Baseline Accuracy: 66%, recall 0%
## All Models Tested on Train
| Model Type          | Hyperparameters         | Features                                                                                                                                                            | Accuracy | Recall on True Positive (Diabetic Predicted Diabetic) |
|---------------------|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|-------------------------------------------------------|
| Logistic Regression | C=10                    | All Features                                                                                                                                                        | 79%      | 62%                                                   |
| Decision Tree       | max_depth=5             | ['Glucose','BMI','DiabetesPedigreeFunction', 'Age']                                                                                                                 | 84%      | 70%                                                   |
| Random Forest       | max_depth=5             | ['Glucose','Age','BMI', 'insulin_glucose_cluster','DiabetesPedigreeFunction']                                                                                       | 85%      | 75%                                                   |
| KNN                 | n_neighbors=5           | ['Pregnancies','Glucose','BloodPressure', 'SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction', 'Age','age_bins','bp_bins','high_bmi_bp','age_bmi_cluster4'] | 80%      | 62%                                                   |
| Ridge Classifier    | None                    | ['Pregnancies','Glucose','BloodPressure', 'SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction', 'Age','age_bins','bp_bins','high_bmi_bp','age_bmi_cluster4'] | 78%      | 57%                                                   |
| SGD Classifier      | max_iter=1000, tol=1e-3 | ['Pregnancies','Glucose','BloodPressure', 'SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction', 'Age','age_bins','bp_bins','high_bmi_bp','age_bmi_cluster4'] | 68%      | 7%                                                   |

## All Models Tested on Validate
| Model Type    | Hyperparameters | Features                                                                                                                                                            | Accuracy | Recall on True Positive (Diabetic Predicted Diabetic) |
|---------------|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|-------------------------------------------------------|
| Decision Tree | max_depth=5     | ['Glucose','BMI','DiabetesPedigreeFunction', 'Age']                                                                                                                 | 74%      | 58%                                                   |
| Random Forest | max_depth=5     | ['Glucose','Age','BMI', 'insulin_glucose_cluster','DiabetesPedigreeFunction']                                                                                       | 78%      | 64%                                                   |
| KNN           | n_neighbors=5   | ['Pregnancies','Glucose','BloodPressure', 'SkinThickness','Insulin','BMI', 'DiabetesPedigreeFunction', 'Age','age_bins','bp_bins','high_bmi_bp','age_bmi_cluster4'] | 75%      | 55%                                                   |

## Final Model Metrics with Train, Validate, and Test: Random Forest

| Model    | RandomForest | (max_depth=5, random_state=123)             | ['Glucose','Age','BMI','insulin_glucose_cluster','DiabetesPedigreeFunction'] |
|----------|--------------|---------------------------------------------|------------------------------------------------------------------------------|
| DF       | Accuracy     | Recall on Positive (predicting diabetic) | Precision on Positive (predicting diabetic)                               |
| Train    | 85%          | 75%                                         | 84%                                                                          |
| Validate | 78%          | 64%                                         | 73%                                                                          |
| Test     | 76%          | 60%                                         | 69%                                                                          |

[Table of Contents](#top)

# Conclusions <a class="anchor" id="fin"></a>
- final model outperforms baseline (64% accuracy, 0% recall)
- most clusters created were not significant for random forest model
- emphasis on modeling performance with True and False positives.
    - diagnosising a patient early on prevents further harm to the patient if medicine/therapy is needed
    - not diagnosing a patient can lead to dangerous levels of Blood Glucose

## Next Steps
- hypothesis testing on more features
- create new clusters and test signifcance in modeling with visuals/hypothesis testing

[Table of Contents](#top)