by Stuart Miller, Paul Adams, and Justin Howard

# **Introduction**

Intro

# **Methods**

## **Data**

Banking data, we don't know anything about the data!

## **Models**

Description of the models used

### **Random Forest**

Random Forest stuff!

### **XGBoost**

[XGB](https://towardsdatascience.com/from-zero-to-hero-in-xgboost-tuning-e48b59bfaf58)

### **SVM**

## **Hyperparameter Tuning**

Hyperparameters were selected with a randomized search with 5-fold internal cross-validation.
We used a randomized search rather than an exhaustive search (sometimes called grid search) 
 because randomized searches have been shown to achieve similar results,
 but with significantly lower run times than exhaustive search.
Unlike in an exhaustive search where all possible combinations of tuning parameters are validated,
 in a randomized search a number of search iterations are specified and
 a random set of parameters are validated on each iteration
In this application of randomized search, 5-fold cross-validation is performed at each iteration.
The best parameters from the search are selected based on the mean cross-validated log loss.

We ran the random forest and XGBoost hyperparameter searches for 100 iterations.
We only ran the SVM hyperparameter search for 10 iterations due to the high run time of fitting the model.
The tuning parameters for each model used in the case study are shown in tables X-Z.


**Table X. Random Forest Tuning Parameters**

| Parameter           | Search Range                    |
|---------------------|:-------------------------------:|
| `n_estimators`      | 10:150                          |
| `criterion`         | One of `'gini', 'entropy'`      |
| `max_depth`         | 10:100                          | 
| `min_samples_split` | 2:100                           |
| `min_samples_leaf`  | 2:100                           |
| `max_features`      | One of `'auto', 'sqrt', 'log2'` |

**Table Y. XGBoost Tuning Parameters**

| Parameter           | Search Range                    |
|---------------------|:-------------------------------:|
| `A`      |                        |
| `B`         | One of       |


**Table Z. SVM Tuning Parameters**

| Parameter | Search Range                                |
|-----------|:-------------------------------------------:|
| `C`       | 0.001:10\*                                  |
| `kernel`  | One of `'linear', 'poly', 'rbf', 'sigmoid'` |
| `gamma`   | One of `'scale', 'auto'`                    | 

\*value distibution on a log scale

# **Results**

* Hyperparameter tuning tables
* Validation results for RF, XGB, SVM <- let's do a test set here
* Scaling times for SVM (1k, 2k, 5k, 10k)

# **Conclusion**

the conclusion

# Appendix A

In [None]:
import time
import pprint

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    log_loss,
    accuracy_score,
    make_scorer)
from sklearn.model_selection import (
    train_test_split, 
    RandomizedSearchCV, 
    cross_validate)
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

log_loss_scorer = make_scorer(log_loss, greater_is_better=False)
accuracy_scorer = make_scorer(accuracy_score)

random_state = np.random.RandomState(42)

In [120]:


# get the data
data = pd.read_csv('./data/case_8.csv')
# put the target in another variable
target = data.target
# drop off ID and target
data = data.drop(['ID', 'target'], axis=1)
# get train and test sets
X_train, X_test, y_train, y_test = train_test_split(data,
                                                    target,
                                                    test_size=0.33,
                                                    random_state=random_state)

In [121]:
obj_columns = list(data.select_dtypes(include='object'))
obj_col_encoders = {col: LabelEncoder() for col in obj_columns}

for col in obj_col_encoders.keys():
    obj_col_encoders[col].fit(data[col])
    
for col in obj_col_encoders.keys():
    X_train[col] = obj_col_encoders[col].transform(X_train[col])
    X_test[col] = obj_col_encoders[col].transform(X_test[col])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [81]:
# random forest

rf_clf = RandomForestClassifier(random_state=random_state)
rf_params = {
    'n_estimators': np.linspace(10, 150, dtype='int'),
    'criterion':['gini', 'entropy'],
    'max_depth': np.linspace(10, 100, dtype='int'),
    'min_samples_split': np.linspace(2, 100, 50, dtype='int'),
    'min_samples_leaf': np.linspace(2, 100, 50, dtype='int'),
    'max_features': ['auto', 'sqrt', 'log2']
}

search_iters = 100

rf_RSCV_start_time = time.time()
# setup search
rf_RSCV = RandomizedSearchCV(rf_clf, rf_params, scoring=log_loss_scorer,
                                 n_iter=search_iters, random_state=random_state)
# seach
rf_RSCV.fit(X_train, y_train)

rf_RSCV_end_time = time.time()
duration = rf_RSCV_end_time-rf_RSCV_start_time

print(f'Randomized CV search done. {search_iters} iterations took \
{int(duration // 3600):02d}::{int((duration % 3600)//60):02d}::{int((duration % 3600) % 60):02d}')

Randomized CV search done. 10 iterations took 00::25::42


In [127]:
# print the best parameters chosen by CV
pprint.pprint(rf_RSCV.best_params_)

{'criterion': 'gini',
 'max_depth': 70,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 68,
 'n_estimators': 44}


In [101]:
# get CV results with best parameters
rf_clf.set_params(**rf_RSCV.best_params_)
rf_cv = cross_validate(rf_clf, X_train, y_train, 
                       scoring={
                           'log_loss':log_loss_scorer,
                           'accuracy':accuracy_scorer
                       })

In [114]:
print('RF 5-fold Validation Performance')
# note test_log_loss is negated due to how scorers work 
# in parameter searches in sklearn
print('Mean Log Loss\t{}'.format(np.mean(-rf_cv['test_log_loss'])))
print('Mean Accuracy\t{}'.format(np.mean(rf_cv['test_accuracy'])))

5-fold Validation Performance
Mean Log Loss	7.735369922195444
Mean Accuracy	0.7760428226385534


In [128]:
# get performance on test set
rf_clf.fit(X_train, y_train)
rf_y_test_pred = rf_clf.predict(X_test)

print('RF Test Set Performance')
print('Test Log Loss\t{}'.format(log_loss(rf_y_test_pred, y_test)))
print('Test Accuracy\t{}'.format(accuracy_score(rf_y_test_pred, y_test)))

RF Test Set Performance
Test Log Loss	7.651903959831961
Test Accuracy	0.7784551768011451
