In [9]:
%%javascript
MathJax.Hub.Config({
    TeX: { equationNumbers: { autoNumber: "AMS" } }
});

<IPython.core.display.Javascript object>


by Stuart Miller, Paul Adams, and Justin Howard

# **Introduction**

Intro

# **Methods**

## **Data**

Banking data, we don't know anything about the data!

## **Models**

This case study we use random forest, XGBoost, and support vector machines to model the data.
In the following sections, we descibe how these models work.

### **Random Forest**

A random forest is an emsemble model created from a collection of decision trees and bootstrapped aggregated (bagged) data.
The following steps are used to create bagged trees (James et al, 2013):

  * bootstrap sample (repeated sampling with replacement) the dataset to create $B$ separate datasets.
  * fit a model $f^b(x)$ on each $B$ dataset.

Then the bagged model is given by

$$
f_{bag}(x) = \frac{1}{B} \sum_{b=1}^B f^b (x)
$$

In the context of classification, the *majority vote* of the classifiers is taken as the class prediction.
This is called a bagged decision tree model.
The aggregation of these high variance decision trees substantially reduces the overall model variance (James et al, 2013).
In general, a large number of decision tree used be used in the ensemble.
We treat the number of decision trees as a hyperparameter and tune it with cross-validation.

One additional tweek is added to a bagged decision tree to make it a random forest.
When building decision trees, at each split, the decision tree only considers a random subset of the available predictors (James et al, 2013).
This improves on the bagged model by decorrelating the individual trees used in the ensemble.
Two common methods for determing the number of predictors to consider in a split are the square root of the number of available features
 and log base-2 of the number of available features.
We treat the number of features the model considers at each split as a hyperparameter and tune it with cross-validation.




### **XGBoost**

[XGB](https://towardsdatascience.com/from-zero-to-hero-in-xgboost-tuning-e48b59bfaf58)

### **SVM**

## **Hyperparameter Tuning**

Hyperparameters were selected with a randomized search with 5-fold internal cross-validation.
We used a randomized search rather than an exhaustive search (sometimes called grid search) 
 because randomized searches have been shown to achieve similar results,
 but with significantly lower run times than exhaustive search.
Unlike in an exhaustive search where all possible combinations of tuning parameters are validated,
 in a randomized search a number of search iterations are specified and
 a random set of parameters are validated on each iteration
In this application of randomized search, 5-fold cross-validation is performed at each iteration.
The best parameters from the search are selected based on the mean cross-validated log loss.

We ran the random forest and XGBoost hyperparameter searches for 100 iterations.
We only ran the SVM hyperparameter search for 10 iterations due to the high run time of fitting the model.
The tuning parameters for each model used in the case study are shown in tables X-Z.


**Table X. Random Forest Tuning Parameters**

| Parameter           | Search Range                    | Description |
|---------------------|:-------------------------------:|---------------------|
| `n_estimators`      | 10:150                          | Number of decision trees to use in the random forest |
| `criterion`         | One of `'gini', 'entropy'`      | The method for determining best split in decision trees |
| `max_depth`         | 10:100                          | The maximum depth decision trees can grow |
| `min_samples_split` | 2:100                           | The minimum number of samples required to make a split |
| `min_samples_leaf`  | 2:100                           | The minimum number of sample required to make a leaf node |
| `max_features`      | One of `'auto', 'sqrt', 'log2'` | The maximum number of features considered when making a split in a tree |

**Table Y. XGBoost Tuning Parameters**

| Parameter           | Search Range                    | Description |
|---------------------|:-------------------------------:|---------------------|
| `A`      |                        |  |
| `B`         | One of       |  |


**Table Z. SVM Tuning Parameters**

| Parameter | Search Range                                | Description |
|-----------|:-------------------------------------------:|---------------------|
| `C`       | 0.001:10\*                                  |  |
| `kernel`  | One of `'linear', 'poly', 'rbf', 'sigmoid'` |  |
| `gamma`   | One of `'scale', 'auto'`                    |  |

\*value distibution on a log scale

# **Results**

* Hyperparameter tuning tables
* Validation results for RF, XGB, SVM <- let's do a test set here
* Scaling times for SVM (1k, 2k, 5k, 10k)

# **Conclusion**

the conclusion

## References

James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R . Springer.

# Appendix A

In [1]:
import time
import pprint

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    log_loss,
    accuracy_score,
    make_scorer)
from sklearn.model_selection import (
    train_test_split, 
    RandomizedSearchCV, 
    cross_validate)
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

log_loss_scorer = make_scorer(log_loss, greater_is_better=False)
accuracy_scorer = make_scorer(accuracy_score)

random_state = np.random.RandomState(42)

In [2]:


# get the data
data = pd.read_csv('./data/case_8.csv')
# put the target in another variable
target = data.target
# drop off ID and target
data = data.drop(['ID', 'target'], axis=1)
# get train and test sets
X_train, X_test, y_train, y_test = train_test_split(data,
                                                    target,
                                                    test_size=0.33,
                                                    random_state=random_state)

In [3]:
obj_columns = list(data.select_dtypes(include='object'))
obj_col_encoders = {col: LabelEncoder() for col in obj_columns}

for col in obj_col_encoders.keys():
    obj_col_encoders[col].fit(data[col])
    
for col in obj_col_encoders.keys():
    X_train[col] = obj_col_encoders[col].transform(X_train[col])
    X_test[col] = obj_col_encoders[col].transform(X_test[col])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [4]:
# random forest

rf_clf = RandomForestClassifier(random_state=random_state)
rf_params = {
    'n_estimators': np.linspace(10, 150, dtype='int'),
    'criterion':['gini', 'entropy'],
    'max_depth': np.linspace(10, 100, dtype='int'),
    'min_samples_split': np.linspace(2, 100, 50, dtype='int'),
    'min_samples_leaf': np.linspace(2, 100, 50, dtype='int'),
    'max_features': ['auto', 'sqrt', 'log2']
}

search_iters = 100

rf_RSCV_start_time = time.time()
# setup search
rf_RSCV = RandomizedSearchCV(rf_clf, rf_params, scoring=log_loss_scorer,
                                 n_iter=search_iters, random_state=random_state)
# seach
rf_RSCV.fit(X_train, y_train)

rf_RSCV_end_time = time.time()
duration = rf_RSCV_end_time-rf_RSCV_start_time

print(f'Randomized CV search done. {search_iters} iterations took \
{int(duration // 3600):02d}::{int((duration % 3600)//60):02d}::{int((duration % 3600) % 60):02d}')

Randomized CV search done. 100 iterations took 03::53::19


In [5]:
# print the best parameters chosen by CV
pprint.pprint(rf_RSCV.best_params_)

{'criterion': 'gini',
 'max_depth': 83,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 48,
 'n_estimators': 132}


In [6]:
# get CV results with best parameters
rf_clf.set_params(**rf_RSCV.best_params_)
rf_cv = cross_validate(rf_clf, X_train, y_train, 
                       scoring={
                           'log_loss':log_loss_scorer,
                           'accuracy':accuracy_scorer
                       })

In [7]:
print('RF 5-fold Validation Performance')
# note test_log_loss is negated due to how scorers work 
# in parameter searches in sklearn
print('Mean Log Loss\t{}'.format(np.mean(-rf_cv['test_log_loss'])))
print('Mean Accuracy\t{}'.format(np.mean(rf_cv['test_accuracy'])))

RF 5-fold Validation Performance
Mean Log Loss	7.697942288341889
Mean Accuracy	0.777126444284875


In [8]:
# get performance on test set
rf_clf.fit(X_train, y_train)
rf_y_test_pred = rf_clf.predict(X_test)

print('RF Test Set Performance')
print('Test Log Loss\t{}'.format(log_loss(rf_y_test_pred, y_test)))
print('Test Accuracy\t{}'.format(accuracy_score(rf_y_test_pred, y_test)))

RF Test Set Performance
Test Log Loss	7.640917295253621
Test Accuracy	0.7787732598208132


In [23]:
# do some fit times for comparison with the SVM
for size in [1000, 2000, 5000, 10000]:
    sample = random_state.choice(np.arange(len(X_train)), size=size, replace=False)
    X_train_sub = X_train.iloc[sample, :]
    y_train_sub = y_train.iloc[sample]
    start_time = time.time()
    rf_clf.fit(X_train_sub, y_train_sub)
    end_time = time.time()
    duration = end_time - start_time
    print(f'RF fit on {size} records took {duration}')

RF fit on 1000 records took 0.4633328914642334
RF fit on 2000 records took 0.9529027938842773
RF fit on 5000 records took 2.6475236415863037
RF fit on 10000 records took 5.744798898696899
