### Project 31 - Porto Seguro’s Safe Driver Prediction - Predict if a driver will file an insurance claim next year.
## Phase 3: Modeling and Error Analysis

**In this phase, we will use the below algorithms for our modeling:**
1. Naive Bayes Classifier (GaussianNB)
2. L1-regularized Logistic Regression
3. L2-regularized Logistic Regression
4. Elastic Net-regularized Logistic Regression
5. Decision Tree Classifier

**For model evaluation/ error analysis, we will evaluate/analyse the models using:**
1. AUROC
2. Log Loss
3. Normalized Gini Coefficient

Import the libraries that we'll need

In [3]:
# Numpy and Pandas
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)

# Matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Seaborn for easier visualization
import seaborn as sns
sns.set_style('darkgrid')

# Supress Future warning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

# import scikit learn
import sklearn
sklearn.set_config(print_changed_only=False)

# For train-test split
from sklearn.model_selection import train_test_split

# Feature Scaling using StandardScaler
from sklearn.preprocessing import StandardScaler

# Naive Bayes
from sklearn.naive_bayes import GaussianNB

# Import Logistic Regression
from sklearn.linear_model import LogisticRegression, SGDClassifier

# Import Support Vector Classifier
from sklearn.svm import SVC, NuSVC

# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Function for creating model pipelines
from sklearn.pipeline import make_pipeline

# Cross_validation
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Classification metrics
from sklearn.metrics import roc_curve, roc_auc_score, log_loss

# Import pickle
import pickle

Load our analytical base (table from previous phase)

In [4]:
abt = pd.read_csv('analytical_base_table.csv')

In [5]:
print(abt.shape)

abt.head()

(595212, 39)


Unnamed: 0,ps_calc_02,ps_ind_12_bin,ps_ind_07_bin,ps_ind_15,ps_car_08_cat,ps_car_02_cat,ps_car_13,ps_car_07_cat,ps_car_04_cat,ps_car_14,ps_ind_02_cat,ps_ind_17_bin,ps_car_11,ps_ind_18_bin,ps_car_12,ps_ind_04_cat,ps_reg_03,ps_reg_01,ps_car_15,ps_calc_01,ps_car_09_cat,ps_ind_14,ps_ind_16_bin,ps_ind_05_cat,ps_ind_11_bin,ps_ind_01,ps_reg_02,ps_ind_08_bin,ps_car_10_cat,ps_car_03_cat,ps_car_05_cat,ps_car_01_cat,ps_car_06_cat,ps_calc_03,ps_ind_10_bin,ps_ind_06_bin,ps_ind_03,ps_ind_09_bin,target
0,0.5,0,1,11,0,1,0.883679,1,0,0.37081,2,1,2,0,0.4,1,0.71807,0.7,3.605551,0.6,0,0,0,0,0,2,0.2,0,1,-1,1,10,4,0.2,0,0,5,0,0
1,0.1,0,0,3,1,1,0.618817,1,0,0.388716,1,0,3,1,0.316228,0,0.766078,0.8,2.44949,0.3,2,0,0,0,0,1,0.4,1,1,-1,-1,11,11,0.3,0,0,7,0,0
2,0.7,0,0,12,1,1,0.641586,1,0,0.347275,4,0,1,0,0.316228,1,-1.0,0.0,3.316625,0.5,2,0,1,0,0,5,0.0,1,1,-1,-1,7,14,0.1,0,0,9,0,0
3,0.9,0,0,8,1,1,0.542949,1,0,0.294958,1,0,1,0,0.374166,0,0.580948,0.9,2.0,0.6,3,0,1,0,0,0,0.2,0,1,0,1,7,11,0.1,0,1,2,0,0
4,0.6,0,0,9,1,1,0.565832,1,0,0.365103,2,0,3,0,0.31607,1,0.840759,0.7,2.0,0.4,2,0,1,0,0,0,0.6,0,1,-1,-1,11,14,0.0,0,1,0,0,0


Separate the dataframe into separate objects for the imput features(X) and target variable(y)

In [6]:
X = abt.drop('target', axis=1)
y = abt.target

In [7]:
print(X.shape)
print(y.shape)

(595212, 38)
(595212,)


Split X and y into training an test sets

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.2,
                                                   stratify=abt.target,
                                                   random_state=1234)

print(len(X_train), len(X_test), len(y_train), len(y_test))

476169 119043 476169 119043


In [9]:
y_train.value_counts()

0    458814
1     17355
Name: target, dtype: int64

In [10]:
y_test.value_counts()

0    114704
1      4339
Name: target, dtype: int64

### Helper functions

**1.Function to calculate Gini coefficient and Normalized Gini Coefficient**

In [11]:
# Function to calculate Gini coefficient and Normalized Gini Coefficient
def gini(actual, pred):
    if (len(actual) == len(pred)):
        all = np.asarray(np.c_[actual, pred, np.arange(len(actual))], dtype=np.float)
        all = all[np.lexsort((all[:, 2], -1 * all[:, 1]))]
        totalLosses = all[:, 0].sum()
        giniSum = all[:, 0].cumsum().sum() / totalLosses

        giniSum -= (len(actual) + 1) / 2.
        return giniSum / len(actual)
    return 0

# Function to calculate Normalized Gini Coefficient
def gini_normalized(actual, pred):
    return gini(actual, pred) / gini(actual, actual)

**2. Function to fit a RandomizedSearchCV model of specified algorithm and display its cross-validated score**

In [12]:
# Function to fit a RandomizedSearchCV model of specified algorithm and display its cross-validated score
def fit_model_and_display_score(name, X_train, y_train, cv_val):
    # Create cross-validation object from pipeline and hyperparameters
    model = RandomizedSearchCV(estimator=pipelines[name],
                    param_distributions=hyperparameters[name],
                    cv=cv_val,
                    n_jobs=-1,
                    random_state=123,
                    verbose=1)
    
    # Fit model on X_train, y_train
    model.fit(X_train, y_train)
    
    # Print '{name} has been fitted'
    print(name, 'has been fitted')
      
    # Best score and params
    print("Best CV score for "+ name + ": ") 
    print( np.round(model.best_score_, 3) )
    print("Best parameters for " + name + ": ") 
    print( model.best_params_ )
    
    # return the model
    return model

**3. Function to display Confusion Matrix, TPR and FPR**

In [13]:
# Function to display Confusion Matrix, TPR and FPR
def display_model_cm_fpt_tpr(y_test, pred):
    # Confusion Matrix
    print("Confusion Matrix:")
    cm = confusion_matrix (y_test, pred)
    print( cm )
    print()
    # True Positives (TP)
    tp = cm[1][1]

    # False Positives (FP)
    fp = cm[0][1]

    # True Negatives (TN)
    tn = cm[0][0]

    # False Negatives (FN)
    fn = cm[1][0]
    
    #TPR
    true_positive_rate = tp/(tp+fn)
    print( 'TPR:', np.round(true_positive_rate, 3) )
    
    #FPR
    false_positive_rate = fp/(tn+fp)
    print( 'FPR:', np.round(false_positive_rate, 3) )

**4. Function to determine best threshold for classification models**

NOTE: The below code for g-means and threshold is partially taken from *https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/*

In [14]:
# Function to determine best threshold for classification models
def find_best_classification_threshold(y_test, pred_prob):
    # calculate roc curves
    fpr, tpr, thresholds = roc_curve(y_test, pred_prob)

    # calculate the geometric mean for each threshold
    gmeans = np.sqrt(tpr * (1-fpr))
    
    # locate the index of the largest g-mean
    ix = np.argmax(gmeans)
    print('Best Threshold=%.3f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))
    
    #return best threshold
    return thresholds[ix]

**5. Function to save a fitted model to disk**

In [15]:
# Function to save a fitted model to disk
def save_model_to_disk(model, filename):
    filename = filename
    pickle.dump(model, open(filename, 'wb'))
    print("Model saved to disk as", filename)

## Build the model Pipeline

NOTE: We will add more algorithms to the pipeline as we need

In [16]:
#Crete the Pipeline dictionary
pipelines = {
    'nb': make_pipeline(StandardScaler(), GaussianNB()),
    'l1': make_pipeline(StandardScaler(), LogisticRegression(penalty='l1', solver = 'liblinear', random_state=123)),
    'l2': make_pipeline(StandardScaler(), LogisticRegression(penalty='l2', solver = 'liblinear', random_state=123)),
}

In [17]:
# Create hyperparameters dictionary
hyperparameters = {}

Create empty dictionary called fitted_models, to include models that have been tuned using cross-validation

In [18]:
fitted_models = {}

### 1. Naive Bayes (Our Baseline Model)

Tunable hyperparameters of Naive Bayes:

In [19]:
# List tuneable hyperparameters of our Naive Bayes pipeline
pipelines['nb'].get_params()

{'memory': None,
 'steps': [('standardscaler',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('gaussiannb', GaussianNB(priors=None, var_smoothing=1e-09))],
 'verbose': False,
 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'gaussiannb': GaussianNB(priors=None, var_smoothing=1e-09),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'gaussiannb__priors': None,
 'gaussiannb__var_smoothing': 1e-09}

For Naive Bayes, the impactful hyperparameter is the **var_smoothing**.

Declare the parameter grid for **nb (Gaussian Naive Bayes Classifier)**

In [20]:
# Gaussian Naive Bayes hyperparameters
nb_hyperparameters = {
    'gaussiannb__var_smoothing': np.logspace(0,-9, num=100)
}

Create the hyperparameters dictionary. We will add to this dictionary, as we apply more algorithms

In [21]:
# Create or Update hyperparameters dictionary
hyperparameters['nb'] = nb_hyperparameters

**Fit the NaiveBayes model, and save it into fitted_models dictionary**

In [22]:
nb_model = fit_model_and_display_score(name='nb', X_train=X_train, y_train=y_train, cv_val=10)

Fitting 10 folds for each of 10 candidates, totalling 100 fits
nb has been fitted
Best CV score for nb: 
0.946
Best parameters for nb: 
{'gaussiannb__var_smoothing': 1.0}


We find that **GaussianNaiveBayes with var_smoothing=1.0**  is our GaussianNB's best estimator

In [23]:
# Store model in fitted_models[name] 
fitted_models['nb'] = nb_model

In [24]:
# Save the Gaussian Naive Bayes model to disk
save_model_to_disk(nb_model, 'nb_model.sav')

Model saved to disk as nb_model.sav


Get the predicted classes from our Gaussian Naive Bayes model

In [25]:
# Predict classes using our fitted Gaussian Naive Bayes model
nb_pred = fitted_models['nb'].predict(X_test)

Check the Confusion Matrix, TPR and FPR

In [26]:
display_model_cm_fpt_tpr(y_test, nb_pred)

Confusion Matrix:
[[112354   2350]
 [  4184    155]]

TPR: 0.036
FPR: 0.02


**Lets check the AUROC, Log-Loss and Gini Coefficient:**

In [27]:
# Predict PROBABILITIES using fitted gaussian Naive Bayes
nb_pred_prob = fitted_models['nb'].predict_proba(X_test)

# Get JUST the PREDICTION PROBABILITY for positive class
nb_pred_prob = [ p[1] for p in nb_pred_prob ]

In [28]:
# Calculate AUROC
print("For Naive Bayes model:")
print( "AUROC: ",np.round(roc_auc_score(y_test, nb_pred_prob),3) )

# Calculate Log-Loss
log_loss_nb = np.round(log_loss(y_test, nb_pred_prob), 3)
print("Log Loss: ", log_loss_nb)

# Calculate Gini Coefficient
print("Normalized ini Coefficient: ", np.round(gini_normalized(y_test, nb_pred),3) )

For Naive Bayes model:
AUROC:  0.613
Log Loss:  0.279
Normalized ini Coefficient:  0.019


**Summary of the Naive Bayes model:**
1. We get cross-validated score of 0.946
2. TPR = 0.036, FPR = 0.02
3. AUROC = 0.613
4. Log Loss = 0.279
5. Normalized Gini Coefficient = 0.019

Let's consider this as our baseline scores, and go ahead with other models.

### 2. L1-regularized Logistic Regression

In [29]:
# List tuneable hyperparameters of Logistic Regression pipeline
pipelines['l1'].get_params()

{'memory': None,
 'steps': [('standardscaler',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('logisticregression',
   LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                      intercept_scaling=1, l1_ratio=None, max_iter=100,
                      multi_class='auto', n_jobs=None, penalty='l1',
                      random_state=123, solver='liblinear', tol=0.0001, verbose=0,
                      warm_start=False))],
 'verbose': False,
 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l1',
                    random_state=123, solver='liblinear', tol=0.0001, verbose=0,
                    warm_start=False),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,

For **regularized logistic regression**, the most impactful hyperparameter is the **strength of the penalty.**

Declare the hyperparameter grids for l1 ( 𝐿1 -regularized logistic regression) as values between 0.001 and 1000

In [30]:
# L1-regularized Logistic Regression hyperparameters
l1_hyperparameters = {
    'logisticregression__C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000]
}

In [31]:
# Add to the existing hyperparameters dictionary
hyperparameters['l1'] = l1_hyperparameters

#### Fit the L1-Logistic regression model, and save it into fitted_models dictionary

In [32]:
l1_model = fit_model_and_display_score(name='l1', X_train=X_train, y_train=y_train, cv_val=10)

Fitting 10 folds for each of 10 candidates, totalling 100 fits
l1 has been fitted
Best CV score for l1: 
0.964
Best parameters for l1: 
{'logisticregression__C': 5}


We find that **Logistic Regression with C=5**  is our L1-Logistic Regression's best estimator with **cross-validated score of 0.964**

In [33]:
# Store model in fitted_models[name] 
fitted_models['l1'] = l1_model

In [34]:
# Save the L1 Logistic Regression model to disk
save_model_to_disk(l1_model, 'l1_model.sav')

Model saved to disk as l1_model.sav


Get the **predicted classes** from our L1-regularized Logistic Regression model

In [35]:
# Predict classes using our fitted L1-regularized Logistic Regression model
l1_pred = fitted_models['l1'].predict(X_test)

Check the **Confusion Matrix, TPR and FPR**

In [36]:
display_model_cm_fpt_tpr(y_test, l1_pred)

Confusion Matrix:
[[114704      0]
 [  4339      0]]

TPR: 0.0
FPR: 0.0


**TPR and FPR values are 'zero'**, i.e., **the fitted L1-regularized Logistic Regression predicts only negative class**  with the default threshold=0.5.

Lets find the best threshold for L1 Logistic regression model (using G-Mean).

In [37]:
# Predict PROBABILITIES using fitted L1-Logistic Regression
l1_pred_prob = fitted_models['l1'].predict_proba(X_test)

# Get JUST the PREDICTION PROBABILITY for positive class
l1_pred_prob = [ p[1] for p in l1_pred_prob ]

In [38]:
# Best threshold for L1 Logistic Regression model
l1_threshold = np.round( find_best_classification_threshold(y_test, l1_pred_prob), 3 )

Best Threshold=0.035, G-Mean=0.585


Get the updated predictions with the **best threshold=0.035**

In [39]:
l1_pred = (fitted_models['l1'].predict_proba(X_test)[:,1] >= 0.035).astype(bool)

# Display confusion matrix for y_test and pred
display_model_cm_fpt_tpr(y_test, l1_pred)

Confusion Matrix:
[[66964 47740]
 [ 1797  2542]]

TPR: 0.586
FPR: 0.416


Lets check the **AUROC, Log Loss and Gini Coefficient**  of the L1-regularized Logistic regression model

In [40]:
# Calculate AUROC
print("For L1-regularized Logistic Regression model:")
print( "AUROC: ",np.round(roc_auc_score(y_test, l1_pred_prob),3) )

# Calculate Log-Loss
log_loss_l1 = np.round(log_loss(y_test, l1_pred_prob), 3)
print("Log Loss: ", log_loss_l1)

# Calculate Gini Coefficient
print("Normalized Gini Coefficient: ", np.round(gini_normalized(y_test, l1_pred),3) )

For L1-regularized Logistic Regression model:
AUROC:  0.617
Log Loss:  0.153
Normalized Gini Coefficient:  0.172


**Summary of the L1-regularized Logistic regression model:**
1. We get cross-validated score of 0.964
2. True Positive rate = 0.586, False Positive Rate = 0.416
3. We get AUROC of 0.617 (Slight improvement as compared to our Naive Bayes model)
4. Log-Loss of 0.153 (Better than our Naive Bayes model)
5. Normalized Gini Coefficient = 0.172 (Much improved than Naive Bayes model)

So, we see improved scores as compared to Gaussian Naive Bayes model

### 3. L2-regularized Logistic regression

In [41]:
# L2-regularized Logistic Regression hyperparameters
l2_hyperparameters = {
    'logisticregression__C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000]
}

In [42]:
# Add to the existing hyperparameters dictionary
hyperparameters['l2'] = l2_hyperparameters

#### Fit the L2-Logistic regression model, and save it into fitted_models dictionary

In [43]:
l2_model = fit_model_and_display_score(name='l2', X_train=X_train, y_train=y_train, cv_val=10)

Fitting 10 folds for each of 10 candidates, totalling 100 fits
l2 has been fitted
Best CV score for l2: 
0.964
Best parameters for l2: 
{'logisticregression__C': 5}


We find that **Logistic Regression with C=5**  is our L2-Logistic Regression's best estimator with **cross-validated score of 0.964**

In [44]:
# Store model in fitted_models[name] 
fitted_models['l2'] = l2_model

In [45]:
# Save the L2 Logistic Regression model to disk
save_model_to_disk(l2_model, 'l2_model.sav')

Model saved to disk as l2_model.sav


Get the **predicted classes** from our L2-regularized Logistic Regression model

In [46]:
# Predict classes using our fitted L2-regularized Logistic Regression model
l2_pred = fitted_models['l2'].predict(X_test)

Check the **Confusion Matrix, TPR and FPR**

In [47]:
display_model_cm_fpt_tpr(y_test, l2_pred)

Confusion Matrix:
[[114704      0]
 [  4339      0]]

TPR: 0.0
FPR: 0.0


**TPR and FPR values are 'zero'**, i.e., **the fitted L2-regularized Logistic Regression predicts only negative class**  with the default threshold=0.5.

Lets find the best threshold for L2 Logistic regression model (using G-Mean).

In [48]:
# Predict PROBABILITIES using fitted L2-Logistic Regression
l2_pred_prob = fitted_models['l2'].predict_proba(X_test)

# Get JUST the PREDICTION PROBABILITY for positive class
l2_pred_prob = [ p[1] for p in l2_pred_prob ]

In [49]:
# Best threshold for L2 Logistic Regression model
l2_threshold = np.round( find_best_classification_threshold(y_test, l2_pred_prob), 3 )

Best Threshold=0.035, G-Mean=0.585


Get the updated predictions with the **best threshold=0.035**

In [50]:
l2_pred = (fitted_models['l2'].predict_proba(X_test)[:,1] >= 0.035).astype(bool)

# Display confusion matrix for y_test and pred
display_model_cm_fpt_tpr(y_test, l2_pred)

Confusion Matrix:
[[66963 47741]
 [ 1797  2542]]

TPR: 0.586
FPR: 0.416


Lets check the **AUROC, Log Loss and Gini Coefficient**  of the L2-regularized Logistic regression model

In [51]:
# Calculate AUROC
print("For L2-regularized Logistic Regression model:")
print( "AUROC: ",np.round(roc_auc_score(y_test, l2_pred_prob),3) )

# Calculate Log-Loss
log_loss_l2 = np.round(log_loss(y_test, l2_pred_prob), 3)
print("Log Loss: ", log_loss_l2)

# Calculate Gini Coefficient
print("Normalized Gini Coefficient: ", np.round(gini_normalized(y_test, l2_pred),3) )

For L2-regularized Logistic Regression model:
AUROC:  0.617
Log Loss:  0.153
Normalized Gini Coefficient:  0.172


**Summary of the L2-regularized Logistic regression model:**
1. We get cross-validated score of 0.964 (same as L1-Logistic Regression)
2. True Positive Rate = 0.586, False Positive Rate = 0.416 (same as L1-Logistic Regression)
3. We get AUROC of 0.617 (same as L1-Logistic Regression)
4. Log loss of 0.153 (same as L1-Logistic Regression)
5. Normalized Gini Coefficient = 0.172 (same as L1-Logistic Regression)

So, we see that the performance of our L2_regularized Logistic Regression model is very similar to our L1-regularized Regression model.

### 4. ElasticNet-regularized Logistic Regression

In [52]:
#Update the Pipeline dictionary
pipelines['enet'] = make_pipeline(StandardScaler(), 
                                  LogisticRegression(penalty='elasticnet', solver = 'saga', random_state=123))

In [53]:
# Elastic Net - Logistic Regression hyperparameters
enet_hyperparameters = {
    'logisticregression__C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000],
    'logisticregression__l1_ratio': [.1, .5, .7,.9]
}

In [54]:
# Add to the existing hyperparameters dictionary
hyperparameters['enet'] = enet_hyperparameters

**Fit the ElasticNet-Logistic regression model, and save it into fitted_models dictionary**

In [55]:
enet_model = fit_model_and_display_score(name='enet', X_train=X_train, y_train=y_train, cv_val=10)

Fitting 10 folds for each of 10 candidates, totalling 100 fits
enet has been fitted
Best CV score for enet: 
0.964
Best parameters for enet: 
{'logisticregression__l1_ratio': 0.1, 'logisticregression__C': 1000}


We find that **Elastic Net Logistic Regression with C=1000 and L1_ratio=0.1**  is our Elastic Net-Logistic Regression's best estimator with **cross-validated score=0.964**

In [56]:
# Store model in fitted_models[name] 
fitted_models['enet'] = enet_model

In [57]:
# Save the Elastic Net Logistic Regression model to disk
save_model_to_disk(enet_model, 'enet_model.sav')

Model saved to disk as enet_model.sav


Get the **predicted classes** from our ElasticNet-regularized Logistic Regression model

In [58]:
# Predict classes using our fitted ElasticNet-regularized Logistic Regression model
enet_pred = fitted_models['enet'].predict(X_test)

Check the **Confusion Matrix, TPR and FPR**

In [59]:
display_model_cm_fpt_tpr(y_test, enet_pred)

Confusion Matrix:
[[114704      0]
 [  4339      0]]

TPR: 0.0
FPR: 0.0


**TPR and FPR values are 'zero'**, i.e., **the fitted Elastic Net-regularized Logistic Regression predicts only negative class**  with the default threshold=0.5.

Lets find the best threshold for Elastic Net Logistic regression model (using G-Mean).

In [60]:
# Predict PROBABILITIES using fitted L2-Logistic Regression
enet_pred_prob = fitted_models['enet'].predict_proba(X_test)

# Get JUST the PREDICTION PROBABILITY for positive class
enet_pred_prob = [ p[1] for p in enet_pred_prob ]

In [61]:
# Best threshold for Elastic Net Logistic Regression model
enet_threshold = np.round( find_best_classification_threshold(y_test, enet_pred_prob), 3 )

Best Threshold=0.035, G-Mean=0.585


Get the updated predictions with the **best threshold=0.035**

In [62]:
enet_pred = (fitted_models['enet'].predict_proba(X_test)[:,1] >= 0.035).astype(bool)

# Display confusion matrix for y_test and pred
display_model_cm_fpt_tpr(y_test, enet_pred)

Confusion Matrix:
[[66964 47740]
 [ 1797  2542]]

TPR: 0.586
FPR: 0.416


Lets check the **AUROC, Log Loss and Gini Coefficient**  of the Elastic Net-regularized Logistic regression model

In [63]:
# Calculate AUROC
print("For Elastic Net-regularized Logistic Regression model:")
print( "AUROC: ",np.round(roc_auc_score(y_test, enet_pred_prob),3) )

# Calculate Log-Loss
log_loss_enet = np.round(log_loss(y_test, enet_pred_prob), 3)
print("Log Loss: ", log_loss_enet)

# Calculate Gini Coefficient
print("Normalized Gini Coefficient: ", np.round(gini_normalized(y_test, enet_pred),3) )

For Elastic Net-regularized Logistic Regression model:
AUROC:  0.617
Log Loss:  0.153
Normalized Gini Coefficient:  0.172


**Summary of the Elastic Net-regularized Logistic regression model:**
1. We get cross-validated score of 0.964 (same as L2-Logistic Regression)
2. True Positive Rate = 0.586, False Positive Rate = 0.416 (same as L1-Logistic Regression and L2-Logistic Regression)
3. AUROC of 0.617 (same as L1 and L2-Logistic Regression)
4. Log Loss of 0.153 (same as L1 and L2-Logistic Regression)
5. Normalized Gini Coefficient = 0.172 (same as L1 and L2-Logistic Regression)

Hence, we conclude that **all the Logistic Regression models (L1, L2 and Elastic Net) perform the same.**

### 5. Decision Tree Classifier

In [64]:
#Update the Pipeline dictionary
pipelines['dt'] = make_pipeline(StandardScaler(), 
                                  DecisionTreeClassifier(splitter='best', random_state=123))

In [65]:
# List tuneable hyperparameters of our SGD Classifier pipeline
pipelines['dt'].get_params()

{'memory': None,
 'steps': [('standardscaler',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('decisiontreeclassifier',
   DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                          max_depth=None, max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_samples_leaf=1,
                          min_samples_split=2, min_weight_fraction_leaf=0.0,
                          random_state=123, splitter='best'))],
 'verbose': False,
 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'decisiontreeclassifier': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                        max_depth=None, max_features=None, max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_samples_leaf=1,
                        min_samples_split=2, min_weight_fraction_leaf=0.0,
                        random_state=123, splitter='best'),
 'st

In [66]:
# DecisionTree Classifier hyperparameters
dt_hyperparameters = {
    "decisiontreeclassifier__max_depth": [2,4,6,8,10],
    "decisiontreeclassifier__max_features": [5, 10],
    "decisiontreeclassifier__min_samples_leaf": [10, 20, 30],
    "decisiontreeclassifier__criterion": ["gini", "entropy"]
}

In [67]:
# Add to the existing hyperparameters dictionary
hyperparameters['dt'] = dt_hyperparameters

**Fit the DecisionTree Classification model, and save it into fitted_models dictionary**

In [68]:
dt_model = fit_model_and_display_score(name='dt', X_train=X_train, y_train=y_train, cv_val=10)

Fitting 10 folds for each of 10 candidates, totalling 100 fits
dt has been fitted
Best CV score for dt: 
0.964
Best parameters for dt: 
{'decisiontreeclassifier__min_samples_leaf': 10, 'decisiontreeclassifier__max_features': 10, 'decisiontreeclassifier__max_depth': 6, 'decisiontreeclassifier__criterion': 'gini'}


We find that **DecisionTree Classifier with min_samples_leaf= 10, max_features = 10, max_depth= 6 and criterion= 'gini'**  is our DecisionTree Classifier's best estimator with **cross-validated score=0.964**

In [69]:
# Store model in fitted_models[name] 
fitted_models['dt'] = dt_model

In [70]:
# Save the Decision Tree Classifier model to disk
save_model_to_disk(dt_model, 'dt_model.sav')

Model saved to disk as dt_model.sav


Get the **predicted classes** from our ElasticNet-regularized Logistic Regression model

In [71]:
# Predict classes using our fitted Decision Tree Classifier model
dt_pred = fitted_models['dt'].predict(X_test)

Check the **Confusion Matrix, TPR and FPR**

In [72]:
display_model_cm_fpt_tpr(y_test, dt_pred)

Confusion Matrix:
[[114701      3]
 [  4338      1]]

TPR: 0.0
FPR: 0.0


**TPR and FPR values are 'zero'**  with the default threshold=0.5.

Lets find the best threshold for Elastic Net Logistic regression model (using G-Mean).

In [73]:
# Predict PROBABILITIES using fitted DecisionTree Classifier
dt_pred_prob = fitted_models['dt'].predict_proba(X_test)

# Get JUST the PREDICTION PROBABILITY for positive class
dt_pred_prob = [ p[1] for p in dt_pred_prob ]

In [74]:
# Best threshold for DecisionTree Classifier model
dt_threshold = np.round( find_best_classification_threshold(y_test, dt_pred_prob), 3 )

Best Threshold=0.037, G-Mean=0.567


Get the updated predictions with the **best threshold=0.037**

In [75]:
dt_pred = (fitted_models['dt'].predict_proba(X_test)[:,1] >= 0.037).astype(bool)

# Display confusion matrix for y_test and pred
display_model_cm_fpt_tpr(y_test, dt_pred)

Confusion Matrix:
[[67574 47130]
 [ 1976  2363]]

TPR: 0.545
FPR: 0.411


Lets check the **AUROC, Log Loss and Gini Coefficient**  of the Elastic Net-regularized Logistic regression model

In [76]:
# Calculate AUROC
print("For Decision Tree Classifier model:")
print( "AUROC: ",np.round(roc_auc_score(y_test, dt_pred_prob),3) )

# Calculate Log-Loss
log_loss_dt = np.round(log_loss(y_test, dt_pred_prob), 3)
print("Log Loss: ", log_loss_dt)

# Calculate Gini Coefficient
print("Normalized Gini Coefficient: ", np.round(gini_normalized(y_test, dt_pred),3) )

For Decision Tree Classifier model:
AUROC:  0.595
Log Loss:  0.155
Normalized Gini Coefficient:  0.136


**Summary of the Decision Tree Classifier model:**
1. We get cross-validated score of 0.964 (same as Logistic Regression)
2. True Positive Rate = 0.545, False Positive Rate = 0.411 (slightly less than the Logistic Regression models)
3. AUROC of 0.595 (less than the Logistic Regression models)
4. Log Loss of 0.155 (Logistic Regression models are slightly better than this)
5. Normalized Gini Coefficient = 0.136 (less than the Logistic Regression models)

Hence, we conclude that the **Decision Tree Classifier's evaluation metrics are slightly poorer than the Logistic Regression models (L1, L2 and Elastic Net)**

## Summary

As part of model building, we built the below models:
1. Naive Bayes Classifier (GaussianNB)
2. L1-regularized Logistic Regression
3. L2-regularized Logistic Regression
4. Elastic Net-regularized Logistic Regression
5. Decision Tree Classifier

As part of model evaluation and error analysis, we evaluated/anaylzed their:
1. AUROC
2. Log Loss
3. Normalized Gini Coefficient

We found that all the **Logistic Regression models (L1, L2 and Elastic-Net) performed the best** and had the same metric values:
* AUROC of 0.617
* Log Loss of 0.153
* Normalized Gini Coefficient = 0.172

**NOTE:** In next phase, we will cover more advanced modeling techniques like Random Forests, Gradient Boosted Trees, XGBoost Classifier, Multi Layer Perceptron.
We will try to match the Kaggle topper's leaderboard Normalized Gini Coefficient score of **0.297**