# Pima Indians Diabetes ML Framework

Fitting multiple models and evaluating performance. See `pima_diabetes_dev.ipynb` for info.

## Setup

General Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Scikit-learn Specific Imports

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


Import `diabetes.csv` and impute values into relevant columns. See `pima_diabetes_dev.ipynb` for info.

In [3]:
diabetes = pd.read_csv('data/diabetes.csv')
diabetes['BloodPressure'] = diabetes['BloodPressure'].replace(0, diabetes[diabetes['BloodPressure'] > 0].BloodPressure.median())
diabetes['BMI'] = diabetes['BMI'].replace(0, diabetes[diabetes['BMI'] > 0].BloodPressure.mean())
diabetes['SkinThickness'] = diabetes['SkinThickness'].replace(0, diabetes[diabetes['SkinThickness'] > 0].BloodPressure.median())

## Train-Test Splitting 

Let's stratify by diabetes outcome so that there's equal proportions for train and test.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(diabetes.drop(['Outcome'], axis = 1), diabetes['Outcome'], test_size = 0.4, random_state = 42,stratify = diabetes['Outcome'])

## Fitting framework: 

* Fits either Logistic Regression or Support Vector, Random Forest or Gradient Boosting Classifiers
* Tunes hyper-parameters with `GridSearchCV`
* Evaluates Training and Test Accuracy

In [6]:
models = {"logreg": LogisticRegression(), "svc": SVC(), "rf": RandomForestClassifier(), "gb": GradientBoostingClassifier()}

models_desc = {
    "logreg": "Logistic Regression", 
    "svc": "Support Vector Classifier", 
    "rf": "Random Forest Classifier",
    "gb": "Gradient Boosting Classifier"
}

models_out = []

tuning_params_gb = dict(gb__min_samples_leaf = [3, 4, 5], gb__min_samples_split = [0.005, 0.01, 0.02, 0.05], gb__max_depth = [3, 4, 5])
tuning_params_logreg = dict(logreg__penalty = ['l2', 'none'])
tuning_params_svc = dict(svc__C = np.linspace(0.1,2,20), svc__kernel = ['linear','poly', 'rbf'])
tuning_params_rf = dict(rf__max_depth = [1, 3, 5, 10, 15, 20], rf__max_features = ['auto', 'sqrt'], rf__min_samples_leaf = [2, 3, 4], rf__min_samples_split = [0.005, 0.01, 0.02, 0.05])

tuning_params = dict(logreg=tuning_params_logreg, svc = tuning_params_svc, rf = tuning_params_rf, gb = tuning_params_gb)

for model in models:
    pl = Pipeline([("Scale", StandardScaler()), (model, models[model])])
    searcher = GridSearchCV(pl, tuning_params[model])
    searcher.fit(X_train, y_train)

    search_df = pd.DataFrame(searcher.cv_results_)
    mean_test = float(search_df[search_df['params'] == searcher.best_params_]['mean_test_score'].drop_duplicates())
    mean_sd = float(search_df[search_df['params'] == searcher.best_params_]['std_test_score'].drop_duplicates())
    
    models_out.append([models_desc[model],
        mean_test,
        mean_sd,
        searcher.score(X_test, y_test), 
        searcher.best_params_])

summary_stats = pd.DataFrame.from_records(models_out, columns= ['Model', 'Mean_Train_Acc', 'Std_Train_Acc', 'Test_Acc', 'Best_Params'])

display(summary_stats)

Unnamed: 0,Model,Mean_Train_Acc,Std_Train_Acc,Test_Acc,Best_Params
0,Logistic Regression,0.76087,0.032969,0.737013,{'logreg__penalty': 'l2'}
1,Support Vector Classifier,0.793478,0.050047,0.74026,"{'svc__C': 0.7, 'svc__kernel': 'rbf'}"
2,Random Forest Classifier,0.793478,0.035721,0.733766,"{'rf__max_depth': 15, 'rf__max_features': 'aut..."
3,Gradient Boosting Classifier,0.795652,0.046828,0.75,"{'gb__max_depth': 5, 'gb__min_samples_leaf': 3..."


## Summary

* All four models appear sensible choices, being sensible for binary classification problems. 
* There is approximately 3-5% variation in the 5-fold cross validation scores for each estimator.
* The Gradient Boosting classifier appears to have marginally better train and test performance than the other classifers. SVC performs well too.
