<img src="https://media.nature.com/lw800/magazine-assets/d41586-020-01891-8/d41586-020-01891-8_18111016.jpg" height="800px" width="800px">

People with type 1 diabetes are unable to produce the hormone insulin. Photo Credit: Bernard Chantal/Alamy

# About the Data

This dataset was sourced from the National Institute of Diabetes and Digestive and Kidney Diseases. The dataset is used to predict whether or not a patient has diabetes, based on certain diagnostic measurements. Several constraints were placed on the selection criteria from a larger database, meaning that the observations are female patients at least 21 years old and of Pima Native American heritage.

## Objective
Conduct EDA on the diabetes dataset and build an ensemble classifier to improve the AUC score.

## Dataset
The dataset consists of several features and a target variable --> Outcome. 

* **Pregnancies**: Number of times pregnant
* **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* **BloodPressure**: Diastolic blood pressure (mm Hg)
* **SkinThickness**: Triceps skin fold thickness (mm)
* **Insulin**: 2-Hour serum insulin (mu U/ml)
* **BMI**: Body mass index (weight in kg/(height in m)^2)
* **DiabetesPedigreeFunction**: provides some data on diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient
* **Age**: Age (years)
* **Outcome**: Class variable (0 or 1) indicating whether or not a patient has diabetes

# Import Data & Summary Statistics

In [None]:
# Import packages for EDA and Data Visualization
import pandas as pd
pd.set_option('display.max_columns', 50)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
sns.set(style="white")

SEED = 42  # random seed for modeling

# Import Classes for Model Traning and Feature Exploration / Engineering 
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from scipy.stats import mstats
from sklearn.preprocessing import (StandardScaler, LabelEncoder, PolynomialFeatures)
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import (cross_val_score, RepeatedStratifiedKFold,
                                     train_test_split, GridSearchCV)
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier

# Import several other Classes for Ensembling / Stacking 
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier, RandomForestClassifier, VotingClassifier
import xgboost as xgb

In [None]:
diabetesDF = pd.read_csv('/kaggle/input/pima-indians-diabetes-database/diabetes.csv')

Metadata

In [None]:
display(diabetesDF.info())

Descriptive Stats

In [None]:
display(diabetesDF.describe())

First 5 observations

In [None]:
display(diabetesDF.head())

Last 5 observations

In [None]:
display(diabetesDF.tail())

* It's clear that we have zeroes in the BloodPressure, Skin Thickness & BMI attributes.  Other attributes also have zeroes, however, without knowing information about how the data was collected I am hesitant to impute zeroes in other features such as Insulin, for example.
* You can think of zeroes as you would missing values.
* Let's address these zeroes with Imputation.

## Missing Value Imputation - Iterative Imputer

We are going to use Scikit-Learn's Iterative Imputer to impute values that are currently set to zero.  Iterative Imputer imputes missing values by modeling each feature as a function of other features.

### Before Imputation

In [None]:
# Define columns with missing values
missing_colsDF = diabetesDF[['BloodPressure','SkinThickness','BMI']]

# show the pairplot
sns.pairplot(missing_colsDF)
plt.show()

# replace zeroes with np.nan
missing_colsDF = missing_colsDF.replace(0, np.nan)
print(missing_colsDF.info())

It's clear from the visualizations that these 3 attributes are multi-modal, due to the presence of missing values.

### After Imputation

In [None]:
# Iteratively impute
imp_iter = IterativeImputer(max_iter=5, sample_posterior=True, random_state=SEED)
diabetes_imp_iter = imp_iter.fit_transform(missing_colsDF)

# Convert returned array to DataFrame
diabetes_imp_iterDF = pd.DataFrame(diabetes_imp_iter, columns=missing_colsDF.columns)

# show the pairplot
sns.pairplot(diabetes_imp_iterDF)
plt.show()

# Check the DataFrame's info
print(diabetes_imp_iterDF.info())
print()

Missing Values have been successfully imputed and as a result the distributions are more normal, compared to before imputation.

In [None]:
# drop original columns from dataframe 
diabetesDF = diabetesDF.drop(['BloodPressure','SkinThickness','BMI'], axis=1)

# and add new columns containing imputations
diabetesDF = diabetesDF.join(diabetes_imp_iterDF)

# Check the DataFrame's info
print(diabetesDF.info())

# EDA

Now we will examine the distrubtion of all the attributes in the dataset by Outcome.

In [None]:
sns.pairplot(diabetesDF, hue='Outcome', palette="Blues")
plt.show()

* There are less younger women that have diabetes, compared to women in their 30s-50s.
* Women that have diabetes have higher BMI.
* Diabetes pedigree function is lower for women that have diabetes; Same thing with insulin & glucose.
* Pregnacies are also lower among women that have diabetes.
* There are outliers present in almost all of the attributes.

## Outlier Detection & Replacement

Let's take a closer look at the outliers by visualing box plots by Outcome.

In [None]:
def boxplot(df: pd.DataFrame) -> None:
    """
    Visualize a boxplot for each feature for each class.
    """
    fig, axis = plt.subplots()

    for col in df.columns:
        if col != 'Outcome':
            sns.boxplot(x='Outcome', y=col, data=df, palette='Blues')
            plt.show()
            
# call the function to display the boxplot
boxplot(df=diabetesDF)

Let's replace the outliers with the 5th and 95th percentile values.

In [None]:
def winsorizeDF(df: pd.DataFrame) -> pd.DataFrame:
    """
    Replace outliers with 5th and 95th percentile values.
    """
    return df.apply(lambda x: mstats.winsorize(x, limits=[.05, .05]))

# apply winsorize and show summary stats
diabetes_winDF = winsorizeDF(df=diabetesDF.drop('Outcome', axis=1))
diabetesDF = diabetes_winDF.join(diabetesDF['Outcome'])
display(diabetesDF.describe())

Outliers have been replaced!

## Class Imbalance
Next, let's take a look at the proportion of classes.

In [None]:
sns.countplot(diabetesDF['Outcome'], palette="Blues")
plt.show()

# Show the % share of classes
print(round(diabetesDF['Outcome'].value_counts(normalize=True)*100))

* 35% diabetes prevalence rate for Pima Native American women.
* The class imbalance is not too bad; A little less than a 2:1 ratio for the negative to positive diabetes diagnoses.

### Correlation Heatmap

In [None]:
# # Create the correlation matrix
corr = diabetesDF.corr()

# Generate a mask for the upper triangle 
mask = np.triu(np.ones_like(corr, dtype=bool))

# Add the mask to the heatmap
plt.figure(figsize = (10,5))
sns.heatmap(corr, mask=mask, center=0, linewidths=1, annot=True, fmt=".2f", cmap="Blues")
plt.show()

* SkinThickness has a moderately positive relationship with BMI.
* Age has a moderately positive relationship with Pregancies.
* Glucose has a moderate to low positive relationship with Outcome.

## PCA for Feature Exploration

We can use Principal Component Analysis as a means to look at the separation of the features in a 2-dimensional space by Outcome.

In [None]:
# create the feature set: X
X = diabetesDF.drop('Outcome', axis=1)

In [None]:
# Build the PCA pipeline
pipe = Pipeline([('scaler', StandardScaler()),
        		 ('reducer', PCA(n_components=2, random_state=SEED))])

# Fit it to the dataset and extract the component vectors
pc = pipe.fit_transform(X)
vectors = pipe.steps[1][1].components_.round(2)

# Print feature effects
print('PC 1 effects = ' + str(dict(zip(X.columns, vectors[0]))))
print('PC 2 effects = ' + str(dict(zip(X.columns, vectors[1]))))

* The first PC has a moderate to low postive relationship with the SkinThickness, BMI and BloodPressure
* The second PC has a moderately positive relationship with Pregnancies and Age

In [None]:
# Add the 2 components to X
X['PC 1'] = pc[:, 0]
X['PC 2'] = pc[:, 1]

# join outcome variable from original DF
X = X.join(diabetesDF['Outcome'])

# Use the Outcome feature to color the PC 1 vs PC 2 in the scatterplot
sns.scatterplot(data=X, x='PC 1', y='PC 2', hue='Outcome', palette="Blues")
plt.show()

There isn't a clear separation of classes when we visualize the PCs by Outcome.  This means we'll have to take special care when we engineer features for modeling..

# Model Training & Evaluation

We're going to use AUC as the metric to evaluate our model since we noticed the presence of class imbalance earlier.

## Create a Naive Classifier

Let's start by defining our features and target variables and evaluating a Naive Classifier. This is a Dummy Classifier that uses the modal value for its predictions.

In [None]:
def define_X_y(df: pd.DataFrame) -> tuple:
    """
    Create X and y variables for ML Classification
    """
    # create the feature set: X
    X = df.drop('Outcome', axis=1)

    # create the target variable: y
    y = df['Outcome']
    
    # Convert X matrix data types to 'float32' for consistency using .astype()
    X = X.astype('float32')

    # Convert y (target) array to 'str' using .astype()
    y = y.astype('str')

    # Encode class labels in y array using dot notation with LabelEncoder().fit_transform()
    # Hint: y goes in the fit_transform function call
    y = LabelEncoder().fit_transform(y)
    
    return X, y

# call the function to return features & target
X, y = define_X_y(df=diabetesDF)

In [None]:
# Instantiate a DummyClassifier with 'most_frequent' strategy
naive = DummyClassifier(strategy='most_frequent', random_state=SEED)

In [None]:
def evaluate_model(model, X, y) -> np.ndarray:
    """
    Cross-validate model training and print out AUC.
    """
    # Create RepeatedStratifiedKFold cross-validator with 10 folds, 3 repeats and a seed of 1.
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=SEED)

    # Calculate AUC for each training round using `cross_val_score()` with model instantiated, data to fit, target variable, 'AUC' scoring, cross validator, n_jobs=-1, and error_score set to 'raise'
    auc_scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1, error_score='raise')

    # Print mean and standard deviation of n_scores: 
    print('Model AUC: %.4f (%.4f)' % (np.mean(auc_scores), np.std(auc_scores)))
    print()
    
    # return auc
    return auc_scores

In [None]:
# evaluate Naive Classifier
naive_auc = evaluate_model(naive, X=X, y=y)

The Naive Classifier has 50% AUC score.  This is as good as a random guess. 

## Create a Baseline Classifier

Next, we will create a baseline classifier to generate predictions.  We will use a Decision Tree for our baseline classifier.

In [None]:
# Instantiate a DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=SEED)

# evaluate Baseline Classifier
dt_auc = evaluate_model(dt, X=X, y=y)

Great! We see a ~19 point increase in AUC by using a Decision Tree. Let's use some more robust models now and see how they perform on the data.

## Feature Engineering - Polynomial Interactions

Let's start the feature engineering process by creating polynomial interaction features.

In [None]:
# create the transformer
poly = PolynomialFeatures(include_bias=False, interaction_only=True)

# fit the transformer
poly_X = poly.fit_transform(X)

# Create a list of features
feature_list = poly.get_feature_names(X.columns)

# Create DF
polyDF = pd.DataFrame(data=poly_X, columns=feature_list)
print(f"New DataFrame columns with Polynomial Features! \n \n {polyDF.columns}")

## Better Models + Hyper-Parameter Tuning

Next, let's define some higher-level learners, which include: SVM, Guassian Naive Bayes, Random Forest, XGBoost, and Logistic Regression.

In [None]:
# instantiate list of models to be evaluated: model_list
model_list = [
    ('SVM', SVC()),
    ('Bayes', GaussianNB()),
    ('RF', RandomForestClassifier()),
    ('XGB', xgb.XGBClassifier()),
    ('LR', LogisticRegression())
]

In [None]:
def best_model(model, param_grid, X, y) -> None:
    """
    Identify best model by running a Grid Search on model hyper-parameters.
    """
    pipe = Pipeline([('scaler', StandardScaler()), 
                   ('classifier', model)])
    
    # Create random search object using k-fold cv
    clf = GridSearchCV(pipe, param_grid=param_grid, cv=5, scoring='roc_auc', n_jobs=-1)

    # Fit on data
    best_clf = clf.fit(X, y)
    best_hyperparams = best_clf.best_estimator_.get_params()['classifier']

    # Print the values used for both Parameters & Score
    print(best_hyperparams) 
    print()
    print("Best Hyper-parameters: ", best_clf.best_params_)
    print()
    print("Best AUC Score: ", round(best_clf.best_score_, 4))
    print()

In [None]:
def best_model_hyp_params(model_list: list, X: pd.DataFrame, y: np.ndarray) -> None:
    for model in model_list:

        if model[0] == 'SVM':
            param_grid = {'classifier__kernel' : ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed']} 

            # perform hyper-parameter tuning and evaluate model
            best_model(model=model[1], param_grid=param_grid, X=X, y=y)

        if model[0] == 'Bayes': 
            param_grid = {'classifier__var_smoothing' : np.array([1e-09, 1e-08])} 

            # perform hyper-parameter tuning and evaluate model
            best_model(model=model[1], param_grid=param_grid, X=X, y=y)

        if model[0] == 'RF': 
            param_grid = {'classifier__criterion' : np.array(['gini', 'entropy']),
                          'classifier__max_depth' : np.arange(3,8)} 

            # perform hyper-parameter tuning and evaluate model
            best_model(model=model[1], param_grid=param_grid, X=X, y=y)

        if model[0] == 'XGB':
            param_grid = {'classifier__learning_rate' : np.arange(0.022,0.04,.01),
                          'classifier__max_depth' : np.arange(3,8)} 

            # perform hyper-parameter tuning and evaluate model
            best_model(model=model[1], param_grid=param_grid, X=X, y=y)

        if model[0] == 'LR':
            param_grid = {'classifier__penalty' : np.array(['l1', 'l2', 'elasticnet', 'none']),
                          'classifier__C' : np.array([0.001,0.01,0.1,1,10,100])} 

            # perform hyper-parameter tuning and evaluate model
            best_model(model=model[1], param_grid=param_grid, X=X, y=y)
            
# run the Grid Search
best_model_hyp_params(model_list=model_list, X=poly_X, y=y)

Random Forest & Logistic Regression have the best AUC scores among the models that were trained.  Let's eliminate the Gaussian Naive Bayes Classifier from the competition since it was one of the weakest models.  We still have use for XGBoost when we perform model ensembling.  Let's spend some time to understand these models.

## Feature Importance

Now we will look at the Feature Importance for the top 2 performing models: Random Forest & Logistic Regression.

In [None]:
def rf_feature_importance() -> pd.DataFrame:
    """
    Plot feature importance for Random Forest Classifier.
    """
    # fit RandomForestClassifier
    rfc = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=6, random_state=SEED, n_jobs=-1).fit(poly_X, y)

    # Calculate feature importances
    feature_importances = rfc.feature_importances_

    # Create a list of features
    feature_list = poly.get_feature_names(X.columns)

    # Save the results inside a DataFrame using feature_list as an index
    feature_importances = pd.DataFrame(
        index=feature_list, 
        data=feature_importances, 
        columns=["Feature Importance"]) \
    .sort_values(by=["Feature Importance"], ascending=False)
    
    # reset index
    feature_importances = feature_importances.reset_index()

    # plot barplot
    pal = sns.color_palette("Blues")
    plt.figure(figsize=(10,8))
    sns.barplot(x="Feature Importance", y="index", data=feature_importances,
                label="Total", color=pal[3])
    plt.ylabel('')
    plt.title('Random Forest Feature Importance \n')
    plt.tight_layout()
    plt.show()
    
    return feature_importances

# show RF feature importance
rfc_feat_importanceDF = rf_feature_importance()

In [None]:
def lr_feature_importance() -> pd.DataFrame:
    """
    Plot feature importance for SVM.
    """
    # fit LogisticRegression Classifier
    lr = LogisticRegression(C=1.0, penalty='l2', solver='liblinear', random_state=SEED).fit(poly_X, y)

    # Calculate feature importances
    feature_importances = abs(lr.coef_[0])

    # Create a list of features
    feature_list = poly.get_feature_names(X.columns)

    # Save the results inside a DataFrame using feature_list as an index
    feature_importances = pd.DataFrame(
        index=feature_list, 
        data=feature_importances, 
        columns=["Coefficients"]) \
    .sort_values(by=["Coefficients"], ascending=False)
    
    # reset index
    feature_importances = feature_importances.reset_index()

    # plot barplot
    pal = sns.color_palette("Blues")
    plt.figure(figsize=(10,8))
    sns.barplot(x="Coefficients", y="index", data=feature_importances,
                label="Total", color=pal[3])
    plt.ylabel('')
    plt.title('Logistic Regression Feature Importance  \n')
    plt.tight_layout()
    plt.show()
    
    return feature_importances

# show RF feature importance
lr_feat_importanceDF = lr_feature_importance()

It's interesting to see how Logistic Regression & Random Forest rank features with varying levels of importance:

* Random Forest ranks Glucose BMI, Glucose, Glucose Age, Age BMI, & Glucose BloodPressure as the top 5 important features.
* LogisticRegression ranks Pregancies DiabetesPedigreeFunction, BloodPressure, Pregancies, Age & BMI as the top 5 important features. Logistic Regression also has a number of zero-value coefficients due to L2 regularization.

# Ensemble Modeling - Stacking vs. Voting

Now we will create various types of baseline models, including Logistic Regression using Scikit-Learn, for comparison to ensemble methods.
We will build layers and stack them up; We will also use soft voting classifiers. Lastly, we will calculate and visualize the AUC score.

In [None]:
def model_stacking():
    """
    Create a stacked ML classifier that uses Random Foreset and XGBoost for first layer predictions.
    Then it uses the Logistic Regression Model to make predictions, based on the predictions from the first layer as features.
    """
    # Create an empty list for the base models called layer1
    layer1 = list()

    # Append tuple with classifier name and instantiations (no arguments) for RF, and SVC
    # Hint: layer1.append(('ModelName', Classifier()))
    layer1.append(('RF', 
                   RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=6, random_state=SEED)))
    layer1.append(('LR',
                   LogisticRegression(C=1.0, penalty='l2', solver='liblinear', random_state=SEED)))
                   

    # Instantiate Logistic Regression as meta learner model called layer2
    layer2 = xgb.XGBClassifier(base_score=0.5, booster='gbtree',importance_type='gain',
                                     learning_rate=0.032, max_depth=3, n_estimators=100, n_jobs=-1, random_state=SEED)

    # Define StackingClassifier() called model passing layer1 model list and meta learner with 5 cross-validations
    model = StackingClassifier(estimators=layer1, final_estimator=layer2, cv=5, n_jobs=-1)

    # return model
    return model

In [None]:
def model_voting():
    """
    Create a voting ML classifier that averages the predictions of Random Foreset, XGBoost, and Logistic Regression through a soft vote.
    """
    # instantiate base learners
    rf = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=6, n_jobs=-1, random_state=SEED)
    xgb_clf = xgb.XGBClassifier(base_score=0.5, booster='gbtree',importance_type='gain',
              learning_rate=0.032, max_depth=3, n_estimators=100, n_jobs=-1, random_state=SEED)
    lr = LogisticRegression(C=1.0, penalty='l2', random_state=SEED, n_jobs=-1)
    
    # instantiate Voting Classifier
    model = VotingClassifier([('RF', rf),
                            ('XGB', xgb_clf),
                            ('LR', lr)],
                           voting='soft', n_jobs=-1)
    
    # return model
    return model

In [None]:
def create_models():
    """
    Add key:value pairs to dictionary with key as ModelName and value as instantiations for Random Forest, SVM, XGBoost & LogisticRegression
    as base models, and Stacked models created with custom functions.
    """
    # Create empty dictionary called models
    models = dict()

    # Hint: models['ModelName'] = Classifier()
    models['RF'] = RandomForestClassifier(n_estimators=100, criterion='entropy', max_depth=6, n_jobs=-1, random_state=SEED)
    models['XGB'] = xgb.XGBClassifier(base_score=0.5, booster='gbtree',importance_type='gain',
              learning_rate=0.032, max_depth=3, n_estimators=100, n_jobs=-1, random_state=SEED)
    models['LR'] = LogisticRegression(random_state=SEED, n_jobs=-1)

    # Add key:value pair to dictionary with key called Stacking and value that calls model_stacking() function
    models['Stacking'] = model_stacking()
    
    # Add key:value pair to dictionary with key called Voting and value that calls model_voting() function
    models['Voting'] = model_voting()
    
    # return dictionary
    return models

# Assign get_models() to a variable called models
models = create_models()

In [None]:
def evaluate_stacked_models(models: dict) -> None:
    """
    Evaluate the models and store results.
    """
    # Create an empty list for the results
    results = list()

    # Create an empty list for the model names
    names = list()

    # Create a for loop that iterates over each name, model in models dictionary 
    for name, model in models.items():
        
        print(f'{name}: {model}')
        print()

        # Call evaluate_model(model) and assign it to variable called scores
        scores = evaluate_model(model=Pipeline([
            ('scaler', StandardScaler()), 
            ('classifier', model)
        ]), X=poly_X, y=y)
        
        # Append output from scores to the results list
        results.append(scores)
        
        # Append name to the names list
        names.append(name)

    # Plot model performance for comparison using names for x and results for y and setting showmeans to True
    fig = plt.figure(figsize=(8,4))
    
    sns.boxplot(x=names, y=results, showmeans=True, palette="Blues")
    plt.title('Model AUC Comparison')
    plt.show()
    
# evaluate stacked models 
evaluate_stacked_models(models=models)

## Confusion Matrix & Classification Report

Finally let's look at the confusion matrix and classification report.

In [None]:
def classfication_metrics(models: dict) -> None:
    """
    Evaluate the other classification metrics for the voting classifier
    """
    # Create a for loop that iterates over each name, model in models dictionary 
    for name, model in models.items():

        print(f'{name}: {model}')
        print()

        if name == 'Voting':
            
            # instantiate the pipeline
            pipe = Pipeline([
                ('scaler', StandardScaler()), 
                ('classifier', model)
            ])

            # split into train & test sets
            X_train, X_test, y_train, y_test = train_test_split(poly_X, y, test_size=0.2, stratify=y, random_state=SEED)

            # fit the pipeline
            pipe.fit(X_train, y_train)

            # make predictions
            y_pred = pipe.predict(X_test)

            # print the confusion matrix
            print(confusion_matrix(y_true=y_test, y_pred=y_pred))
            print()

            # print the classification report
            print(classification_report(y_true=y_test, y_pred=y_pred))
            print()
            
# evaluate stacked models 
classfication_metrics(models=models)         

##  Observation
* Our Voting Classifier got ~84% AUC, which was .004 higher than Logistic Regression. Even if it was a negligible difference, model ensembling still paid off!
* The voting classifier predicted 82 true positives and 29 true negatives.
* It also predicted 18 false positives and 25 false negatives.

## Next Steps
* We could try other models in stacking layer 1
* We could try other models in stacking layer 2.
* We could add more hyperparameters to our parameter grid, but this will slow the computation time and add additional expense.