## Random Forest

Merges a collection of independent decision trees to get a more accurate and stable prediction.

A type of ensemble methods that combines several machine learning models in order to decrease both bias and variance.

When to use it?
- Categorical or continuous target variables
- interested in significance of predictors
- need a quick benchmark model
- If you have messy data, such as missing values, outliers

When not to use it?
- If you are solving a very complex, novel problem
- Transparancy is important (details within the model)
- Prediction time is important


In [2]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

In [4]:
print(RandomForestClassifier().get_params())
print(RandomForestRegressor().get_params())

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


### Read in Data

In [6]:
import joblib
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

In [7]:
tr_features = pd.read_csv('data/train_features.csv')
tr_labels = pd.read_csv('data/train_labels.csv')

### Hyperparameter tuning

![RF](image/rf.png)

In [8]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [16]:
rf = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 50,100,250],
    'max_depth': [2,4,8,16,42,None]
}

cv = GridSearchCV(rf,parameters,cv=5)
cv.fit(tr_features,tr_labels.values.ravel())
print_results(cv)

BEST PARAMS: {'max_depth': 4, 'n_estimators': 50}

0.79 (+/-0.107) for {'max_depth': 2, 'n_estimators': 5}
0.805 (+/-0.104) for {'max_depth': 2, 'n_estimators': 50}
0.798 (+/-0.13) for {'max_depth': 2, 'n_estimators': 100}
0.794 (+/-0.121) for {'max_depth': 2, 'n_estimators': 250}
0.809 (+/-0.09) for {'max_depth': 4, 'n_estimators': 5}
0.826 (+/-0.121) for {'max_depth': 4, 'n_estimators': 50}
0.824 (+/-0.121) for {'max_depth': 4, 'n_estimators': 100}
0.824 (+/-0.108) for {'max_depth': 4, 'n_estimators': 250}
0.818 (+/-0.066) for {'max_depth': 8, 'n_estimators': 5}
0.817 (+/-0.065) for {'max_depth': 8, 'n_estimators': 50}
0.822 (+/-0.066) for {'max_depth': 8, 'n_estimators': 100}
0.817 (+/-0.079) for {'max_depth': 8, 'n_estimators': 250}
0.792 (+/-0.055) for {'max_depth': 16, 'n_estimators': 5}
0.811 (+/-0.036) for {'max_depth': 16, 'n_estimators': 50}
0.811 (+/-0.018) for {'max_depth': 16, 'n_estimators': 100}
0.807 (+/-0.022) for {'max_depth': 16, 'n_estimators': 250}
0.802 (+/-0.047)

In [17]:
cv.best_estimator_

RandomForestClassifier(max_depth=4, n_estimators=50)

### Write out pickled model

In [18]:
joblib.dump(cv.best_estimator_,'model/RF_model.pkl')

['RF_model.pkl']