This notebook is a continuation of `09_classifier.ipynb`. We built some basic classification models using tree-based classifiers in that notebook. In this notebook, we will optimize the hyper-parameters of those models to try to improve their performance. The full version of this notebook is available in `11_classifier_fine_tuning.ipynb`. The dataset used for this exercise is borrowed from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients).

### Import packages

In [None]:
# data processing
import pandas as pd
import numpy as np

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

# modeling
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# grid search
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

### Set-up

In [None]:
# input file location and name
infile = 'https://raw.githubusercontent.com/vishal-git/dapt-631/main/data/credit_default_model_data.csv'

# target variable (column name)
target = 'default payment next month'

sns.set(style='darkgrid')

### Read data

In [None]:
df = pd.read_csv(infile)

df.shape

In [None]:
df.head()

### Set-up X and y

In [None]:
y = df[target]
X = df.drop(target, axis=1)

X_train = X[X['group'] == 'M'].drop('group', axis=1)
X_test = X[X['group'] == 'T'].drop('group', axis=1)
X_valid = X[X['group'] == 'V'].drop('group', axis=1)

y_train = y[X['group'] == 'M']
y_test = y[X['group'] == 'T']
y_valid = y[X['group'] == 'V']

print(len(X_train), len(X_test), len(X_valid))

del df

### Decision Tree

We will fine-tune hyper-parameters for decision (classification) tree now.

#### Max Depth

In [None]:
# create a list of all values we would like to test
max_depths = 

We will build a decision tree model using each value of `max_depth`. Once all models are built, we will pick the best value for `max_depth` based on the model performance on the test set.

In [None]:
# create empty arrays -- we will use these to store model performance values
auc_train, auc_test = 

for d in max_depths:
    #--

Let's plot the model performances.

In [None]:
plt.figure(figsize=(12, 9))

plt.plot()
plt.plot()

plt.xticks(max_depths)
plt.ylim([0.0, 1.0])

plt.xlabel('Max Depth', fontsize=14)
plt.ylabel('AUC', fontsize=14)
plt.title('Default Risk Model: Decision Tree (Max Depth)', fontsize=16)
plt.legend(loc='best', fontsize=14);

Find out where AUC on the test set maximizes.

In [None]:
best_loc = 

In [None]:
best_auc = 

In [None]:
best_max_depth = 

In [None]:
plt.figure(figsize=(12, 9))

plt.plot()
plt.plot()

plt.plot([5, 5], [0, 1], color='gray', linewidth=1, linestyle='--')
plt.text(5+.2, 0.4, f'Best AUC={best_auc:.2f} (max_depth={best_max_depth})', fontsize=14,
         color='royalblue', weight='semibold')

plt.xticks(max_depths)
plt.ylim([0.0, 1.0])

plt.xlabel('Max Depth', fontsize = 14)
plt.ylabel('AUC', fontsize = 14)
plt.title('Default Risk Model: Decision Tree (Max Depth)', fontsize = 16)
plt.legend(loc='best', fontsize = 14);

#### Minimum Samples in the leaf nodes

In [None]:
# create a list of all values we would like to test
min_smpl_leaf = [0.4, 0.3, 0.2, 0.1, 0.05, 0.02, 0.01, 0.001]

# create empty arrays -- we will use these to store model performance values
auc_train, auc_test = [], []

for msl in min_smpl_leaf:
    
    #--

In [None]:
# identify the best value for min_samples_leaf
best_loc = [i for i, auc_test_value in enumerate(auc_test) if auc_test_value == max(auc_test)][0]
best_auc = auc_test[best_loc]
best_msl = min_smpl_leaf[best_loc]

In [None]:
# plot the model performances
plt.figure(figsize=(12, 9))

plt.plot(min_smpl_leaf, auc_train, color='tomato', lw=2, label='Train')

plt.plot(min_smpl_leaf, auc_test, color='royalblue', lw=2, label='Test')

plt.plot([best_msl, best_msl], [0, 1], color='gray', linewidth=1, linestyle='--')
plt.text(0.2, 0.7, f'Best AUC={best_auc:.2f} (min_smpl_leaf={best_msl})', fontsize=14,
         color='royalblue', weight='semibold')

plt.xticks(min_smpl_leaf)
plt.xlim([max(min_smpl_leaf), min(min_smpl_leaf)])
plt.xscale('log')
plt.ylim([0.5, 1.0])

plt.xlabel('Min Samples Leaf', fontsize=14)
plt.ylabel('AUC', fontsize=14)
plt.title('Default Risk Model: Decision Tree (Min Samples Leaf)', fontsize=16)
plt.legend(loc='best', fontsize=14);

#### Grid-search

Instead of testing (fine-tuning) one hyper-parameter at a time, we can use grid search to assess combination of hyper-parameters.

In [None]:
tree = DecisionTreeClassifier(random_state=314)

# create a list of all parameters we want to test
param_grid = 

# define the gridsearch object
tree_gs = 

# fit the model
tree_gs.fit(X_train, y_train)

Find the best set of hyper-parameters.

In [None]:
#--

In [None]:
tree_scores_train = tree_gs.predict_proba(X_train)[:, 1]
tree_scores_test = tree_gs.predict_proba(X_test)[:, 1]

tree_fpr_train, tree_tpr_train, _ = roc_curve(y_train, tree_scores_train)
tree_fpr_test, tree_tpr_test, _ = roc_curve(y_test, tree_scores_test)

plt.figure(figsize=(12, 9))

plt.plot(tree_fpr_train, tree_tpr_train, color='green', lw=2, alpha = 0.4, linestyle = '-',
         label=f'DT Train (AUC = {roc_auc_score(y_train, tree_scores_train):0.3f})')

plt.plot(tree_fpr_test, tree_tpr_test, color='green', lw=2, linestyle = '-',
         label=f'DT Test (AUC = {roc_auc_score(y_test, tree_scores_test):0.3f})')

plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate', fontsize = 14)
plt.ylabel('True Positive Rate', fontsize = 14)
plt.title('Default Risk Model: Decision Tree', fontsize = 16)
plt.legend(loc='lower right', fontsize = 14);

### Random Forest

Insted of testing every combination of hyper-parameters, we can perform a random test which picks random combinations from the given set.

We will perform a random-search to optimize the following hyperparameters for a Random Forest model.

Number of trees in random forest: `n_estimators = [200, 300]`

Maximum number of levels in tree: `max_depth = [3, 6]`

Minimum percentage of samples required in the leaf nodes: `min_samples_leaf = [0.02, 0.05]`

Whether to select sub-samples for training each tree: `bootstrap = [True, False]`

In [None]:
forest = RandomForestClassifier(random_state=314)

param_grid = 

forest_gs = 

forest_gs.fit(X_train, y_train)

In [None]:
forest_scores_train = forest_gs.predict_proba(X_train)[:, 1]
forest_scores_test = forest_gs.predict_proba(X_test)[:, 1]

forest_fpr_train, forest_tpr_train, _ = roc_curve(y_train, forest_scores_train)
forest_fpr_test, forest_tpr_test, _ = roc_curve(y_test, forest_scores_test)

plt.figure(figsize=(12, 9))

plt.plot(forest_fpr_train, forest_tpr_train, color='darkorange', lw=2, alpha = 0.5, linestyle = '-',
         label=f'RF Train (AUC = {roc_auc_score(y_train, forest_scores_train):0.3f})')

plt.plot(forest_fpr_test, forest_tpr_test, color='darkorange', lw=2, linestyle = '-',
         label=f'RF Test (AUC = {roc_auc_score(y_test, forest_scores_test):0.3f})')


plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate', fontsize = 14)
plt.ylabel('True Positive Rate', fontsize = 14)
plt.title('Default Risk Model: Random Forest', fontsize = 16)
plt.legend(loc='lower right', fontsize = 14);

Note: Once you find the best set of hyper-parameters, you can further refine them by performing another random search using a new set of hyper-parameters -- the new values (to be tested) can be chosen based on the results from the first random search.

Next, we will perform a random-search to optimize the following hyperparameters for a *Gradient Boosting* model.

Number of trees: `n_estimators = [100, 300, 500]`

Learning rate: `learning_rate = [0.05, 0.1]`

Maximum number of levels in tree: `max_depth = [3, 6]`

Minimum percentage of samples required in the leaf nodes: `min_samples_leaf = [0.01, 0.02, 0.05]`


In [None]:
# initialize a model
gbm = 

# create a list of all parameters we want to test
param_grid = 

# define the gridsearch object
gbm_rs = 

# fit the model
gbm_rs.fit(X_train, y_train)

print ('Best GBM Parameters:', gbm_rs.best_params_)

In [None]:
# model scores
gbm_scores_train = 
gbm_scores_test = 

# ROC curve data
gbm_fpr_train, gbm_tpr_train, _ = 
gbm_fpr_test, gbm_tpr_test, _ = 

In [None]:
# ROC Curve
plt.figure(figsize=(12, 9))

plt.plot(gbm_fpr_train, gbm_tpr_train, color='purple', lw=2, alpha = 0.2, linestyle = '-',
         label=f'GBM Train (AUC = {roc_auc_score(y_train, gbm_scores_train):0.3f})')

plt.plot(gbm_fpr_test, gbm_tpr_test, color='purple', lw=2, linestyle = '-',
         label=f'GBM Test (AUC = {roc_auc_score(y_test, gbm_scores_test):0.3f})')

;