# Random forests

Random forest is a type of ensemble method. 
And ensemble combines several machine learning models to decrease both bias and variance.
The overall idea here is that multiple models are fit
In training, this algorithm will take N samples from the training data. maybe we'll sample four or five So now we have N subsets of our overall data. And those subsets contain both a subset of the rows and also a subset of the columns.
we would have N trees that are all build independently on different subsets of data.
these trees are all independent because you want each decision tree to key in on different relationships in the data,

# Importing libraries

In [1]:
import joblib
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

# Exploring Random forest hyperparameters

In [2]:
print(RandomForestClassifier())
print(RandomForestRegressor())

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      rand

Here we are mostly interested in parameters max_depth(maximum depth of our individual trees grow) and n_estimators(number of trees).

# Loading data

In [3]:
tr_features = pd.read_csv('./dataset/train_features.csv')
tr_labels = pd.read_csv('./dataset/train_labels.csv')

# Hyperparameter tuning

We will check our model accuracy for different values of hyperparameters max_depth and n_estimators.

Function to print results of model beautifully:

In [4]:
def print_results(results):
    print('Best params: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

Training model using grid search cv and getting best estimators:

In [5]:
rf = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 50, 250],
    'max_depth': [2, 4, 8, 16, 32, None]
}

cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

Best params: {'max_depth': 4, 'n_estimators': 250}

0.805 (+/-0.146) for {'max_depth': 2, 'n_estimators': 5}
0.811 (+/-0.082) for {'max_depth': 2, 'n_estimators': 50}
0.802 (+/-0.131) for {'max_depth': 2, 'n_estimators': 250}
0.805 (+/-0.069) for {'max_depth': 4, 'n_estimators': 5}
0.817 (+/-0.083) for {'max_depth': 4, 'n_estimators': 50}
0.826 (+/-0.104) for {'max_depth': 4, 'n_estimators': 250}
0.82 (+/-0.044) for {'max_depth': 8, 'n_estimators': 5}
0.817 (+/-0.058) for {'max_depth': 8, 'n_estimators': 50}
0.818 (+/-0.065) for {'max_depth': 8, 'n_estimators': 250}
0.805 (+/-0.044) for {'max_depth': 16, 'n_estimators': 5}
0.813 (+/-0.016) for {'max_depth': 16, 'n_estimators': 50}
0.809 (+/-0.034) for {'max_depth': 16, 'n_estimators': 250}
0.8 (+/-0.093) for {'max_depth': 32, 'n_estimators': 5}
0.815 (+/-0.046) for {'max_depth': 32, 'n_estimators': 50}
0.811 (+/-0.021) for {'max_depth': 32, 'n_estimators': 250}
0.815 (+/-0.037) for {'max_depth': None, 'n_estimators': 5}
0.815 (+/-0.038

In [6]:
cv.best_estimator_

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=4, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=250,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

# Write out best pickled model

In [7]:
joblib.dump(cv.best_estimator_, './models/Random_forest_model.pkl')

['./models/Random_forest_model.pkl']