#### Model Training
Import Data and Required Packages

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

Import ML models and evaluation metrics

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import RandomizedSearchCV

import warnings

Define evaluate_model function to evaluate the following metrics on the true and predicted values:
* precision - Of all rows the model labeled as positive, what fraction were actually positive
* recall - What fraction of actual positives did we correctly identify
* f1 score - Calculated as 2*(Precision * Recall)/(Precision + Recall)
* ROC_AUC - measures the area under the Receiver Operating Characteristic (ROC) curve.
* PR_AUC - (Precision-Recall Area Under the Curve) representing the area under its Precision-Recall curve, which plots precision against recall at various classification thresholds

In [16]:
def evaluate_model(true, predicted, model, X_test):
    precision = precision_score(true, predicted , zero_division = 0)
    recall = recall_score(true, predicted , zero_division = 0)
    f1 = f1_score(true , predicted, zero_division = 0)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    roc_auc = roc_auc_score(true, y_pred_proba)
    pr_auc = average_precision_score(true, y_pred_proba)
    return precision, recall, f1, roc_auc, pr_auc

Define function to train and evaluate each model for the scaled data

* Logistic Regression - map inputs to a probability between 0 and 1
* Logistic Regression with L1 - (LASSO) adds a penalty for the absolute size of coefficients to the model's loss function, forcing less important features' coefficients to become exactly zero
* Logistic Regression with L2 - (Ridge) adds a penalty term to the cost function, proportional to the sum of the squared coefficients, to prevent overfitting by shrinking weights towards zero without eliminating them
* Support vector classifier - finds the optimal boundary (hyperplane) to separate different classes of data, maximizing the margin between them for accurate classification
* K-Neighbors classifier - classifies a new data point by finding its 'k' closest neighbors in the training data and assigning it the most common class (majority vote) among those neighbors
* Decision tree - tree-like structure of decisions and their possible consequences to classify data
* Rand forest classifier - builds many individual decision trees on random subsets of your data and features, then combines their predictions (via majority vote for classification)
* XGB classifier - eXtreme Gradient Boosting algorithm to build highly accurate models by combining many weak decision trees sequentially, with each tree correcting the errors of the last
* Catboost classifier - automatically processes categorical features, It builds an ensemble of decision trees, where each new tree corrects the errors of the previous ones
* Adaboost classifier - combining the outputs of many simple "weak" classifiers
* Gradient boost classifier - sequentially combining many simple, "weak" models (usually decision trees) to correct the errors of the previous ones

In [None]:
def classification(X_train, y_train, X_test, y_test):

    models = {
        "Logistic Regression": LogisticRegression(),

        "Lasso": LogisticRegression(penalty='l1', solver='liblinear'),
   
        "Ridge": LogisticRegression(penalty='l2', solver='liblinear'),

        "K-Neighbors Classifier": KNeighborsClassifier(),

        "Decision Tree": DecisionTreeClassifier(),

        "Random Forest Classifier": RandomForestClassifier(),

        "XGBClassifier": XGBClassifier(), 
        
        "CatBoosting Classifier": CatBoostClassifier(verbose=False),

        "AdaBoost Classifier": AdaBoostClassifier(),

        "GradientBoosting Classifier": GradientBoostingClassifier()
    }
 
    # Lists to store results
    results_train = []
    results_test = []
    
    for model_name, model in models.items():
 
        # Train model
        model.fit(X_train, y_train)
        
        # Training predictions
        y_train_pred = model.predict(X_train)
        accuracy_train = accuracy_score(y_train, y_train_pred)
        train_precision, train_recall, train_f1, train_roc_auc, train_pr_auc = evaluate_model(
            y_train, y_train_pred, model, X_train
        )
        
        # Test predictions
        y_test_pred = model.predict(X_test)
        accuracy_test = accuracy_score(y_test, y_test_pred)
        test_precision, test_recall, test_f1, test_roc_auc, test_pr_auc = evaluate_model(
            y_test, y_test_pred, model, X_test
        )
        
        # Store training results
        results_train.append({
            'Model': model_name,
            'Accuracy': accuracy_train,
            'Precision': train_precision,
            'Recall': train_recall,
            'F1 Score': train_f1,
            'ROC AUC': train_roc_auc,
            'PR AUC': train_pr_auc
        })
        
        # Store test results
        results_test.append({
            'Model': model_name,
            'Accuracy': accuracy_test,
            'Precision': test_precision,
            'Recall': test_recall,
            'F1 Score': test_f1,
            'ROC AUC': test_roc_auc,
            'PR AUC': test_pr_auc
        })
    
    # Create DataFrames
    df_train = pd.DataFrame(results_train)
    df_test = pd.DataFrame(results_test)
    
    # Display tables
    print("=" * 120)
    print("TRAINING SET PERFORMANCE")
    print("=" * 120)
    print(df_train.to_string(index=False, float_format=lambda x: f'{x:.4f}'))
    print("\n")
    
    print("=" * 120)
    print("TEST SET PERFORMANCE")
    print("=" * 120)
    print(df_test.to_string(index=False, float_format=lambda x: f'{x:.4f}'))
    print("\n")
    

Load in the scaled data to train

In [18]:
X_train=pd.read_csv('data/scaled_unbalanced_X_train.csv')
y_train=pd.read_csv('data/y_train.csv')

In [19]:
X_train.shape

(202944, 20)

In [20]:
X_test=pd.read_csv('data/X_test.csv')
y_test=pd.read_csv('data/y_test.csv')

In [21]:
X_test.shape


(50736, 20)

Run the models with the scaled and unbalanced data

In [22]:
classification(X_train, y_train, X_test, y_test)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  return self._fit(X, y)
  return fit_method(estimator, *args, **kwargs)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


TRAINING SET PERFORMANCE
                      Model  Accuracy  Precision  Recall  F1 Score  ROC AUC  PR AUC
        Logistic Regression    0.8476     0.5524  0.1836    0.2756   0.8165  0.4321
                      Lasso    0.8476     0.5522  0.1842    0.2763   0.8167  0.4320
                      Ridge    0.8476     0.5524  0.1839    0.2759   0.8167  0.4320
     K-Neighbors Classifier    0.8726     0.6766  0.3701    0.4785   0.8965  0.5484
              Decision Tree    0.9932     0.9987  0.9579    0.9779   0.9998  0.9985
   Random Forest Classifier    0.9931     0.9946  0.9617    0.9779   0.9992  0.9971
              XGBClassifier    0.8623     0.6709  0.2501    0.3644   0.8500  0.5381
     CatBoosting Classifier    0.8651     0.7001  0.2549    0.3737   0.8491  0.5535
        AdaBoost Classifier    0.8498     0.5623  0.2202    0.3165   0.8205  0.4502
GradientBoosting Classifier    0.8513     0.5808  0.2092    0.3076   0.8259  0.4633


TEST SET PERFORMANCE
                      Model 

Observations:
* With regards to ROC_AUC score, Adaboost performs best, and also has the best PR AUC
* The Random forest has the best f1 score
* The logistic regression models, including Lasso and Ridge, do not perform well, not better than a guess
* XGB performs better than GradientBoosting which makes sense as it is an enhanced version of it
* Notice trend of higher recall and lower precision because the positive class (having diabetes) is the minority class 



With an unbalanced dataset we would want to penalize misclassifying the minority class (1 as having diabetes) more heavily - which means we want to maximize precision (minimizing false positives).

Adaboost has best precision and best scores overall so would be model of choice for the scaled and unbalanced dataset


Load in the scaled and Balanced datasets

In [23]:
X_train=pd.read_csv('data/scaled_balanced_X_train.csv')
y_train=pd.read_csv('data/scaled_balanced_y_train.csv')

Run the classification models on the scaled and Balanced dataset

In [24]:
classification(X_train, y_train, X_test, y_test)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  return self._fit(X, y)
  return fit_method(estimator, *args, **kwargs)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


TRAINING SET PERFORMANCE
                      Model  Accuracy  Precision  Recall  F1 Score  ROC AUC  PR AUC
        Logistic Regression    0.7464     0.7357  0.7691    0.7520   0.8211  0.7852
                      Lasso    0.7464     0.7357  0.7691    0.7520   0.8212  0.7852
                      Ridge    0.7464     0.7357  0.7691    0.7520   0.8211  0.7852
     K-Neighbors Classifier    0.8601     0.8076  0.9455    0.8711   0.9534  0.9393
              Decision Tree    0.9950     0.9990  0.9909    0.9949   0.9999  0.9999
   Random Forest Classifier    0.9950     0.9970  0.9929    0.9950   0.9997  0.9997
              XGBClassifier    0.8699     0.9102  0.8209    0.8632   0.9476  0.9567
     CatBoosting Classifier    0.8784     0.9224  0.8263    0.8717   0.9518  0.9607
        AdaBoost Classifier    0.7991     0.7872  0.8197    0.8031   0.8877  0.8929
GradientBoosting Classifier    0.8386     0.8396  0.8370    0.8383   0.9254  0.9347


TEST SET PERFORMANCE
                      Model 

Observations of results:

* The top 3 models with the best ROC_AUC scores were AdaBoost, Random Forest and gradeint boosting. 
* These 3 also have the best PR_AUC scores
* Once again random forest has the best f1 score
* The recall is very high 0.99 for Adaboost and Gradient boost, probably due to model trying to capture the minority class

* For the scaled and balanced data, Random forest performs the best overall and would choose as model for this dataset. 

Define function for models that do not require scaling

In [25]:
def no_scale_classification(X_train, y_train, X_test, y_test):

    models = {
        "Decision Tree": DecisionTreeClassifier(),

        "Random Forest Classifier": RandomForestClassifier(),

        "XGBClassifier": XGBClassifier(), 
        
        "CatBoosting Classifier": CatBoostClassifier(verbose=False),

        "AdaBoost Classifier": AdaBoostClassifier(),

        "GradientBoosting Classifier": GradientBoostingClassifier()
    }
 
    # Lists to store results
    results_train = []
    results_test = []
    
    for model_name, model in models.items():
        # Train model
        model.fit(X_train, y_train)
        
        # Training predictions
        y_train_pred = model.predict(X_train)
        accuracy_train = accuracy_score(y_train, y_train_pred)
        train_precision, train_recall, train_f1, train_roc_auc, train_pr_auc = evaluate_model(
            y_train, y_train_pred, model, X_train
        )
        
        # Test predictions
        y_test_pred = model.predict(X_test)
        accuracy_test = accuracy_score(y_test, y_test_pred)
        test_precision, test_recall, test_f1, test_roc_auc, test_pr_auc = evaluate_model(
            y_test, y_test_pred, model, X_test
        )
        
        # Store training results
        results_train.append({
            'Model': model_name,
            'Accuracy': accuracy_train,
            'Precision': train_precision,
            'Recall': train_recall,
            'F1 Score': train_f1,
            'ROC AUC': train_roc_auc,
            'PR AUC': train_pr_auc
        })
        
        # Store test results
        results_test.append({
            'Model': model_name,
            'Accuracy': accuracy_test,
            'Precision': test_precision,
            'Recall': test_recall,
            'F1 Score': test_f1,
            'ROC AUC': test_roc_auc,
            'PR AUC': test_pr_auc
        })
    
    # Create DataFrames
    df_train = pd.DataFrame(results_train)
    df_test = pd.DataFrame(results_test)
    
    # Display tables
    print("=" * 120)
    print("TRAINING SET PERFORMANCE")
    print("=" * 120)
    print(df_train.to_string(index=False, float_format=lambda x: f'{x:.4f}'))
    print("\n")
    
    print("=" * 120)
    print("TEST SET PERFORMANCE")
    print("=" * 120)
    print(df_test.to_string(index=False, float_format=lambda x: f'{x:.4f}'))
    print("\n")
    

Load in the unscaled and unbalanced data

In [26]:
X_train=pd.read_csv('data/unscaled_unbalanced_X_train.csv')
y_train=pd.read_csv('data/y_train.csv')

Run the classification on the unscaled and unbalanced data

In [27]:
no_scale_classification(X_train, y_train, X_test, y_test)

  return fit_method(estimator, *args, **kwargs)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


TRAINING SET PERFORMANCE
                      Model  Accuracy  Precision  Recall  F1 Score  ROC AUC  PR AUC
              Decision Tree    0.9932     0.9987  0.9579    0.9779   0.9998  0.9985
   Random Forest Classifier    0.9931     0.9944  0.9619    0.9778   0.9992  0.9971
              XGBClassifier    0.8623     0.6709  0.2501    0.3644   0.8500  0.5381
     CatBoosting Classifier    0.8651     0.7001  0.2549    0.3737   0.8491  0.5535
        AdaBoost Classifier    0.8498     0.5623  0.2202    0.3165   0.8205  0.4502
GradientBoosting Classifier    0.8513     0.5808  0.2092    0.3076   0.8259  0.4633


TEST SET PERFORMANCE
                      Model  Accuracy  Precision  Recall  F1 Score  ROC AUC  PR AUC
              Decision Tree    0.7799     0.3140  0.3430    0.3279   0.5999  0.2119
   Random Forest Classifier    0.8429     0.4957  0.2179    0.3027   0.7951  0.3922
              XGBClassifier    0.8516     0.5698  0.2117    0.3087   0.8245  0.4500
     CatBoosting Classifier 

Observations on the unscaled vs scaled unbalanced data:

* All the boosting models: XGB, CatBoost, AdaBoost and GradientBoosting have similar ROC AUC scores and PR AUC. Gradient boost has the best scores here. These scores are better with the unscaled data, and these models do not require feature scaling. 
* Note that Gradient boost performs better than XGB, which builds upon gradient boost, which indicated a simpler dataset capture or necessity for hyperparam tuning
* Precision is higher than recall for this unscaled data, when the opposite was true when data was scaled. 

* Considering all the metrics, Adaboost and Gradient boost perform similarily but would choose Gradeint Boost for the unscaled and unbalanced dataset. 


Load in the unscaled and Balanced data

In [28]:
X_train=pd.read_csv('data/unscaled_balanced_X_train.csv')
y_train=pd.read_csv('data/unscaled_balanced_y_train.csv')

Run the classification models on unscaled and balanced data

In [29]:
no_scale_classification(X_train, y_train, X_test, y_test)

  return fit_method(estimator, *args, **kwargs)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


TRAINING SET PERFORMANCE
                      Model  Accuracy  Precision  Recall  F1 Score  ROC AUC  PR AUC
              Decision Tree    0.9950     0.9990  0.9909    0.9949   0.9999  0.9999
   Random Forest Classifier    0.9950     0.9970  0.9929    0.9950   0.9997  0.9998
              XGBClassifier    0.8704     0.9107  0.8214    0.8637   0.9480  0.9572
     CatBoosting Classifier    0.8792     0.9226  0.8280    0.8727   0.9522  0.9610
        AdaBoost Classifier    0.8254     0.8219  0.8309    0.8264   0.9147  0.9252
GradientBoosting Classifier    0.8517     0.8693  0.8277    0.8480   0.9350  0.9451


TEST SET PERFORMANCE
                      Model  Accuracy  Precision  Recall  F1 Score  ROC AUC  PR AUC
              Decision Tree    0.7480     0.2959  0.4424    0.3546   0.6233  0.2200
   Random Forest Classifier    0.8209     0.4277  0.4266    0.4272   0.7973  0.3905
              XGBClassifier    0.8357     0.4734  0.4419    0.4571   0.8248  0.4525
     CatBoosting Classifier 

Observations of results

* All the boosting models: XGB, CatBoost, AdaBoost and GradientBoosting have similar ROC AUC scores and PR AUC. Catboost has the best scores here. These scores are better with the unscaled data, and these models do not require feature scaling. 

* The precision and recall scores across models are more balanced for the balanced data than unbalanced, which is to be expected.

* The F1 score is also better for balanced dataset

* Considering all the metrics, Would chooses CatBoost for this dataset

* Overall, the precision and recall is more balanced and f1 score is better across models for the unscaled and balanced dataset, so I would choose to balance the dataset. 

* Overall, ROC_AUC and PR_AUC scores are better for unscaled data, with the models requiring scaling (logistic regression, k neighbors, etc) not performing substantially better than the models that do not require feature scaling. So I will choose to leave data unscaled for the best model. 