# Model Selection and Tuning

This file is concerned with trying different ML algorithms for classification, focusing exclusively on predictive power and starting from the simplest model to the most complicated. We will use the processed data created from the data_preparation.ipynb file, which is a csv named 'aggregated_train_data.csv'.

The objective of the models is to accurately predict if a client with certain characteristics will default on a loan when asking for it. As we do know that the data is very imbalanced (over 90% of the observations are negative), the performance metric of interest will not be accuracy as is: just by always predicting negative we would be right most of the time. Instead we will focus on ROC Area Under the Curve, which is a far more balanced metric.

It is important to use such a metric which combines elements from both the true positive rate as well as the true negative rate: if we were to flag accurately all the defaulting loans (recall) we may avoid the costs of the default, however at the expense of missing on plenty of customers who would have paid their loans and interest back. As we do not know the associated benefits and costs of these two cases, balanced accuracy is the best we can do for this imbalanced dataset. We will also keep track of accuracy and runtime of each model, just not trying to optimize it as this could be done on the best model when deploying on the cloud if its performance warrants it.

Every model will be trained, tuned and cross-validated using the sci-kit learn library.

In [1]:
import pandas as pd
import numpy as np
import os
import time
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV

jobs = os.cpu_count()-2 ## A lot of power.

In [2]:
df = pd.read_csv('aggregated_train_data.csv', index_col= 0)
df.head()

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,CC_SK_DPD_MEAN,CC_SK_DPD_DEF_MEAN,CC_NAME_CONTRACT_STATUS_Active_MEAN,CC_NAME_CONTRACT_STATUS_Approved_MEAN,CC_NAME_CONTRACT_STATUS_Completed_MEAN,CC_NAME_CONTRACT_STATUS_Demand_MEAN,CC_NAME_CONTRACT_STATUS_Refused_MEAN,CC_NAME_CONTRACT_STATUS_Sent proposal_MEAN,CC_NAME_CONTRACT_STATUS_Signed_MEAN,CC_NAME_CONTRACT_STATUS_nan_MEAN
0,100002,1,0,202500.0,406597.5,24700.5,351000.0,0.018801,-9461,-637,...,,,,,,,,,,
1,100003,0,0,270000.0,1293502.5,35698.5,1129500.0,0.003541,-16765,-1188,...,,,,,,,,,,
2,100004,0,0,67500.0,135000.0,6750.0,135000.0,0.010032,-19046,-225,...,,,,,,,,,,
3,100006,0,0,135000.0,312682.5,29686.5,297000.0,0.008019,-19005,-3039,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,100007,0,0,121500.0,513000.0,21865.5,513000.0,0.028663,-19932,-3038,...,,,,,,,,,,


In [3]:
df.rename(columns = {col: col.lower() for col in df.columns.values}, inplace = True)
X = df.drop(columns=['target','sk_id_curr'])
y = df.target.copy()

performance_metrics = {}

In [4]:
y.value_counts()

0    282686
1     24825
Name: target, dtype: int64

We note that we will be dealing with imbalanced data. The negative (non-default in this case) are above 90% of the observations while the positive (client defaulting on a loan) are less than 10% of the cases.

### First Model: **Logistic Regression**

For this model we will first fit a logistic regression to all the parameters, and take the mean of the AUC score from a 5 fold cross validation. We will then run a logistic regression regularized by Lasso to try and do feature extraction. We will finally compute the regression without the penalty but using only the parameters extracted from the constrained regression.

In [5]:
from sklearn.linear_model import SGDClassifier

In [6]:
## Logistic Regression and other models do not accept NaN values. Will use sklearn's preprocessing to impute mean where necessary.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X.replace([np.inf, -np.inf], np.nan, inplace=True)
transformations = Pipeline([('impute', SimpleImputer(strategy= 'mean')), ('scale', StandardScaler())])
transformed_data = transformations.fit_transform(X)

All variables

In [7]:
start = time.time()
logistic_regression_full = SGDClassifier(loss = 'log_loss', l1_ratio= 1, n_jobs= jobs, random_state= 0, early_stopping= True)
res = cross_validate(logistic_regression_full, transformed_data,y, cv = 5, scoring=['roc_auc','accuracy'])
end = time.time()
performance_metrics['logistic_regression_full'] = {'roc_auc': np.mean(res['test_roc_auc']),'accuracy':np.mean(res['test_accuracy']),'runtime': end-start}

Regularized Logistic regression

In [8]:
## Model tuning

##param_grid = [{'alpha' : [0.01,0.05,0.1]}]
##logistic_regression_l1 = SGDClassifier(loss = 'log_loss', penalty = 'l1', l1_ratio = 1, n_jobs= jobs, random_state= 0, early_stopping= True)
##grid_search = GridSearchCV(logistic_regression_l1, param_grid, scoring=['roc_auc','accuracy'], refit = 'roc_auc')
##grid_search.fit(transformed_data,y)
##grid_search.best_params_

In [9]:
##grid_search.best_score_

In [10]:
start = time.time()
logistic_regression_l1 = SGDClassifier(loss = 'log_loss', alpha= 0.05, penalty = 'l1', l1_ratio= 1, n_jobs= jobs, random_state= 0, early_stopping= True)
res = cross_validate(logistic_regression_l1, transformed_data,y, cv = 5, scoring=['roc_auc','accuracy'])
end = time.time()
performance_metrics['logistic_regression_l1'] = {'roc_auc': np.mean(res['test_roc_auc']),'accuracy':np.mean(res['test_accuracy']),'runtime': end-start}

### Second Model: **Support Vector Machines**

In [11]:
from sklearn.linear_model import SGDClassifier

In [12]:
## Model tuning

##svc = SGDClassifier(loss = 'hinge', random_state = 0, max_iter = 10000, early_stopping = True)
##param_grid = [{'alpha': [0.001,0.005,0.01]}]
##grid_search = GridSearchCV(svc, param_grid, n_jobs = jobs, cv = 5, scoring = 'roc_auc')
##grid_search.fit(transformed_data,y)
##grid_search.best_params_

In [13]:
##grid_search.best_score_

In [14]:
start = time.time()
svc = SGDClassifier(loss = 'hinge', alpha= 0.01,random_state= 0, max_iter = 10000, early_stopping= True)
res = cross_validate(svc, transformed_data,y, cv = 5, scoring=['roc_auc','accuracy'])
end = time.time()
performance_metrics['SVC'] = {'roc_auc': np.mean(res['test_roc_auc']),'accuracy':np.mean(res['test_accuracy']),'runtime': end-start}

### Third Model: **Decision Tree**

In [15]:
from sklearn.tree import DecisionTreeClassifier

In [16]:
## Model tuning

##clf = DecisionTreeClassifier(random_state = 0)
##param_grid = [{'max_depth':[15,17,19]}]
##grid_search = GridSearchCV(clf, param_grid, n_jobs = jobs, cv = 5, scoring = 'roc_auc')
##grid_search.fit(transformed_data,y)
##grid_search.best_params_

In [17]:
##grid_search.best_score_

In [18]:
start = time.time()
dec_tree = DecisionTreeClassifier(max_depth = 17, min_samples_leaf = 7, random_state = 0)
res = cross_validate(dec_tree, transformed_data,y, cv = 5, n_jobs = jobs, scoring=['roc_auc','accuracy'])
end = time.time()
performance_metrics['decision_tree'] = {'roc_auc': np.mean(res['test_roc_auc']),'accuracy':np.mean(res['test_accuracy']),'runtime': end-start}

### Fourth Model: **Random Forest**

In [19]:
from sklearn.ensemble import RandomForestClassifier

In [20]:
## Model tuning

##clf = RandomForestClassifier(random_state = 0)
##param_grid = [{'max_depth':[11,13,15], 'min_samples_leaf': [7,9,11], 'n_estimators': 200}]
##grid_search = GridSearchCV(clf, param_grid, n_jobs = jobs, cv = 5, scoring = 'roc_auc')
##grid_search.fit(transformed_data,y)
##grid_search.best_params_

In [21]:
##grid_search.best_score_

In [22]:
start = time.time()
clf = RandomForestClassifier(n_estimators = 400, max_depth = 15, min_samples_leaf = 7, n_jobs = jobs,random_state= 0)
res = cross_validate(clf, transformed_data,y, scoring=['roc_auc','accuracy'])
end = time.time()
performance_metrics['random_forest'] = {'roc_auc': np.mean(res['test_roc_auc']),'accuracy':np.mean(res['test_accuracy']),'runtime': end-start}

### Fifth Model: **Gradient Boosted Trees**

In [23]:
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV

In [24]:
## Model tuning

##LGBM_clf = LGBMClassifier(
##    learning_rate =0.01,
##    n_estimators= 500,
##    num_leaves = 50,
##    min_split_gain= 0.03,
##    colsample_bytree=0.6,
##    verbose =-1,
##    n_jobs = jobs,
##    seed=0)

##param_grid = [{'max_depth':[9,10,11], 'min_child_weight': range(3,13,3), 'reg_alpha': [0.1,1,5,10], 'reg_lambda': [0.1,1,5,10], 'subsample': [0.6,0.7,0.8]}]
##grid_search = RandomizedSearchCV(LGBM_clf, param_grid, n_jobs = jobs, cv = 5, scoring = 'roc_auc')
##grid_search.fit(transformed_data,y)
##grid_search.best_params_

In [25]:
##grid_search.best_score_

In [26]:
start = time.time()
LGBM_clf = LGBMClassifier(
    learning_rate =0.01,
    n_estimators= 5000,
    num_leaves = 50,
    max_depth= 11,
    min_split_gain= 0.03,
    min_child_weight= 6,
    subsample=0.8,
    colsample_bytree=0.6,
    reg_alpha = 4,
    reg_lambda = 5,
    verbose =-1,
    n_jobs = jobs,
    seed=0)

res = cross_validate(LGBM_clf, transformed_data,y, scoring=['roc_auc','accuracy'])
end = time.time()
performance_metrics['LGBM'] = {'roc_auc': np.mean(res['test_roc_auc']),'accuracy':np.mean(res['test_accuracy']),'runtime': end-start}

### Comparing model performance with runtime:

All of the metrics were obtained with kfold cross-validation, so that they are more robust and closer to the actual test scores.

In [27]:
results = pd.DataFrame(performance_metrics).transpose()
results.sort_values(by = ['roc_auc'], ascending= False)

Unnamed: 0,roc_auc,accuracy,runtime
LGBM,0.79097,0.920126,1788.541508
random_forest,0.75829,0.919271,1005.142762
SVC,0.752688,0.915304,12.469002
logistic_regression_full,0.696794,0.909928,30.610333
decision_tree,0.617587,0.897792,78.623214
logistic_regression_l1,0.481147,0.919271,33.661335


In [28]:
results.to_csv('performance_metrics.csv')

In [29]:
## For final model (LGBM):
## res['test_roc_auc']

array([0.79043704, 0.79265997, 0.78639598, 0.79160497, 0.7937533 ])