# Model Selection and Tuning

This file is concerned with trying different ML algorithms for classification, focusing exclusively on predictive power and starting from the simplest model to the most complicated. We will use the processed data created from the data_preparation.ipynb file, which is a csv named 'train_data_processed.csv'.

The objective of the models is to accurately predict if a client with certain characteristics will default on a loan when asking for it. As we do not know the associated costs with a client not paying back nor the benefits of a client repaying, we will focus on Area Under Curve as the performance metric.

Every model will be trained, tuned and cross-validated using the sci-kit learn library.

In [1]:
import pandas as pd
import numpy as np
import os
import time
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

jobs = os.cpu_count()-1 ## A lot of power.

In [2]:
df = pd.read_csv('train_data_processed.csv', index_col= 0)
df.head()

Unnamed: 0,AMT_ANNUITY,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,CNT_CHILDREN,AMT_INCOME_TOTAL,...,ORGANIZATION_TYPE_Trade: type 6,ORGANIZATION_TYPE_Trade: type 7,ORGANIZATION_TYPE_Transport: type 1,ORGANIZATION_TYPE_Transport: type 2,ORGANIZATION_TYPE_Transport: type 3,ORGANIZATION_TYPE_Transport: type 4,ORGANIZATION_TYPE_University,ORGANIZATION_TYPE_XNA,REGION_POPULATION_RELATIVE,TARGET
0,-0.166149,0.241033,4.163149,0.250314,5.253007,0.083037,0.262949,0.139376,-0.577538,0.142129,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018801,1
1,0.592677,-0.176156,-0.321603,-0.170589,-0.276616,0.311267,0.622246,0.510853,-0.577538,0.426792,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003541,0
2,-1.404676,-0.593345,-0.321603,-0.591491,-0.276616,0.50213,0.555912,0.729567,-0.577538,-0.427196,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010032,0
3,0.177869,0.241033,-0.321603,0.250314,-0.276616,0.50213,0.650442,0.510853,-0.577538,-0.142533,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008019,0
4,-0.361755,-0.593345,-0.321603,-0.591491,-0.276616,0.50213,0.322738,0.510853,-0.577538,-0.199466,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028663,0


The data is already scaled for numerical features and encoded (One-hot) for categorical features. We are ready to start working with it.

In [3]:
df.rename(columns = {col: col.lower() for col in df.columns.values}, inplace = True)
X = df.drop(columns=['target'])
y = df.target.copy()

performance_metrics = {}

### First Model: **Logistic Regression**

For this model we will first fit a logistic regression to all the parameters, and take the mean of the AUC score from a 5 fold cross validation. We will then run a logistic regression regularized by Lasso to try and do feature extraction. We will finally compute the regression without the penalty but using only the parameters extracted from the constrained regression.

In [4]:
from sklearn.linear_model import SGDClassifier

In [5]:
start = time.time()
logistic_regression_full = SGDClassifier(loss = 'log_loss', l1_ratio= 1, n_jobs= jobs, random_state= 0, early_stopping= True)
results_logistic_regression_full = cross_val_score(logistic_regression_full, X,y, cv = 5, scoring='roc_auc')
end = time.time()
performance_metrics['logistic_regression_full'] = {'cv_roc_auc' : np.mean(results_logistic_regression_full),'runtime': end-start}

Regularized Logistic regression

In [6]:
param_grid = [{'alpha' : [0.0001, 0.001, 0.01, 0.1, 1]}]
logistic_regression_l1 = SGDClassifier(loss = 'log_loss', l1_ratio= 1, n_jobs= jobs, random_state= 0, early_stopping= True)
grid_search = GridSearchCV(logistic_regression_l1, param_grid, cv = 7, scoring = 'roc_auc')
grid_search.fit(X,y)
grid_search.best_params_

{'alpha': 0.0001}

As the optimal parameter happens to be the default one, with little to no regularization, we will not remove features using Lasso regression and will accept the original logistic regression as our best model for it.

### Second Model: **Support Vector Machines**

In [7]:
from sklearn.linear_model import SGDClassifier

In [8]:
svc = SGDClassifier(loss = 'hinge', random_state = 0, max_iter = 10000, early_stopping = True)
param_grid = [{'alpha': [3,3.5,4]}]
grid_search = GridSearchCV(svc, param_grid, n_jobs = jobs, cv = 5, scoring = 'roc_auc')
grid_search.fit(X,y)
grid_search.best_params_

{'alpha': 3.5}

In [9]:
start = time.time()
linear_svc = SGDClassifier(alpha = 3.5, loss = 'hinge', random_state = 0, max_iter = 10000, early_stopping = True)
results_svc = cross_val_score(linear_svc, X,y, cv = 5, n_jobs = jobs, scoring='roc_auc')
end = time.time()
performance_metrics['svc'] = {'cv_roc_auc': np.mean(results_svc), 'runtime': end-start}

### Third Model: **Decision Tree**

In [10]:
from sklearn.tree import DecisionTreeClassifier

In [11]:
clf = DecisionTreeClassifier(random_state = 0)
param_grid = [{'max_depth': [8,10,12], 'min_samples_leaf': [500,800,1000]}]
grid_search = GridSearchCV(clf, param_grid, n_jobs = jobs, cv = 5, scoring = 'roc_auc')
grid_search.fit(X,y)
grid_search.best_params_

{'max_depth': 12, 'min_samples_leaf': 800}

In [12]:
start = time.time()
dec_tree = DecisionTreeClassifier(max_depth = 12, min_samples_leaf = 800, random_state = 0)
results_decision_tree = cross_val_score(dec_tree, X,y, cv = 5, n_jobs = jobs, scoring='roc_auc')
end = time.time()
performance_metrics['decision_tree'] = {'cv_roc_auc' : np.mean(results_decision_tree),'runtime': end-start}

### Fourth Model: **Random Forest**

In [13]:
from sklearn.ensemble import RandomForestClassifier

In [14]:
clf = RandomForestClassifier(n_jobs = jobs,random_state= 0, oob_score= True)
param_grid = [{'max_depth': [10,12], 'min_samples_leaf': [500,750]}]
grid_search = GridSearchCV(clf, param_grid, cv = 5, scoring = 'roc_auc')
grid_search.fit(X,y)
grid_search.best_params_

{'max_depth': 12, 'min_samples_leaf': 500}

In [15]:
start = time.time()
clf = RandomForestClassifier(n_estimators = 300, min_samples_leaf = 500, max_depth = 12, n_jobs = jobs,random_state= 0, oob_score= True)
random_forest = cross_val_score(clf, X,y, scoring = 'roc_auc', cv = 5)
end = time.time()
performance_metrics['random_forest'] = {'cv_roc_auc': np.mean(random_forest), 'runtime': end - start}

### Fifth Model: **XGBoost**

In [16]:
from xgboost import XGBClassifier

In [17]:
start = time.time()
xgbclf = XGBClassifier(
 learning_rate =0.05,
 n_estimators= 500,
 max_depth= 12,
 min_child_weight=3,
 gamma = 0.2,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread= jobs,
 scale_pos_weight=1,
 seed=0)
res = cross_val_score(xgbclf, X, y, scoring = 'roc_auc')
end = time.time()
performance_metrics['xgboost'] = {'cv_roc_auc': np.mean(res), 'runtime': end-start}

### Comparing model performance with runtime:

All of the ROC AUC scores were obtained with kfold cross-validation, so they are more robust than regular train scores.

In [18]:
results = pd.DataFrame(performance_metrics).transpose()
results.sort_values(by = ['cv_roc_auc','runtime'])

Unnamed: 0,cv_roc_auc,runtime
svc,0.69168,2.032032
decision_tree,0.722859,8.337421
random_forest,0.735669,172.843819
logistic_regression_full,0.737341,18.54125
xgboost,0.745122,1522.068018
