## Bagging: Implement a bagging model

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will fit and evaluate a simple Random Forest model.

### Read in Data

_In this final video in the bagging chapter we're going to try to build the best Random Forest model we can on this Titanic dataset by using the same process we did for Gradient Boosting - we will search for the best hyperparameter settings for the Random Forest model using `GridSearchCV`._

_Lets start by importing the same packages we imported last chapter - so that is `joblib` to save out our model, `pandas` to read in our data, and then our classifier and `GridSearchCV` for `sklearn`. And then we will read in the training features and the training labels._

In [1]:
import joblib
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

tr_features = pd.read_csv('../train_features.csv')
tr_labels = pd.read_csv('../train_labels.csv')

### Hyperparameter tuning

_Recall this helper function we used in the last chapter to help us print out the average accuracy score and the standard deviation of that accuracy score (across the 5 folds built into our Cross Validation) for each hyperparameter combination._

In [2]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

_So lets walk through this code again, this should look familiar as it's basically the same as what we ran for GradientBoosting. We have the `RandomForestClassifier` object and we've stored it as `rf`.

_Then we define our hyperparameter dectionary. One more reminder that the keys in this dictionary align with the name of the hyperparameters that would be passed into `RandomForestClassifier`_

_And then we just need to define the list of settings we want to test for each hyperparameter. So for number of estimators we will test out building 5 decision trees, 50 decision trees, 250 decision trees, and 500 decision trees._

_And then for max depth we test 4, 8, 16, 32, and None. None will just build it as deep as it wants until it achieves some level of training error tolerance defined within `RandomForestClassifier`._

_Again, we are testing deeper trees here than we did for Gradient Boosting and that's expected based on the way we know these two algorithms optimize that bias/variance tradeoff. Random Forest starts with deep trees that have high variance and low bias._

_So now call `GridSearchCV`, pass in our model object (`rf`), the hyperparameter dictionary, and tell it we want to do 5-fold Cross Validation._

_We just call `.fit()` with our training features and labels (and convert the column vector to an array). And then `GridSearchCV` will run 5-fold Cross Validation for each hyperparameter setting combination._

_And then we will just print out our results using the `print results()` function._

_So lets go ahead and run this._

In [3]:
rf = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 50, 250, 500],
    'max_depth': [4, 8, 16, 32, None]
}
cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

BEST PARAMS: {'max_depth': 8, 'n_estimators': 250}

0.809 (+/-0.098) for {'max_depth': 4, 'n_estimators': 5}
0.813 (+/-0.106) for {'max_depth': 4, 'n_estimators': 50}
0.824 (+/-0.108) for {'max_depth': 4, 'n_estimators': 250}
0.826 (+/-0.106) for {'max_depth': 4, 'n_estimators': 500}
0.813 (+/-0.073) for {'max_depth': 8, 'n_estimators': 5}
0.818 (+/-0.076) for {'max_depth': 8, 'n_estimators': 50}
0.828 (+/-0.067) for {'max_depth': 8, 'n_estimators': 250}
0.818 (+/-0.07) for {'max_depth': 8, 'n_estimators': 500}
0.792 (+/-0.029) for {'max_depth': 16, 'n_estimators': 5}
0.811 (+/-0.029) for {'max_depth': 16, 'n_estimators': 50}
0.811 (+/-0.029) for {'max_depth': 16, 'n_estimators': 250}
0.805 (+/-0.021) for {'max_depth': 16, 'n_estimators': 500}
0.79 (+/-0.039) for {'max_depth': 32, 'n_estimators': 5}
0.803 (+/-0.036) for {'max_depth': 32, 'n_estimators': 50}
0.807 (+/-0.032) for {'max_depth': 32, 'n_estimators': 250}
0.817 (+/-0.03) for {'max_depth': 32, 'n_estimators': 500}
0.805 (+/-0

_Before we dig into the results, I want to call out that with `RandomForest`, even if I ran this exact cell again on the SAME exact training set - I would get different results. That's because each time you run `RandomForest` it is randomly sampling rows and columns internally (like we discussed earlier in this chapter) to build each decision tree. So you'll get different results each time you run `RandomForest`._

_Also - remember that these results are on unseen data thanks to the way the Cross Validation built into GridSearchCV splits up the data._

_Now, looking at the results. The best results are using 50 estimators with a max depth of 4, that generates an accuracy of 82.8% so this is the best Cross-Validation performance that we've seen thus far. Feel free to dig through these results but there are two things I want to call out quickly:_
1. _It's clear that 5 estimators isn't quite enough as for every combination, the ones with only 5 estimators generate the worst results._
2. _As you scroll through the results you can see all parameter combinations with max_depth of 4 do quite well, they also do quite well with max depth of 8 but it starts to fall off after that which indicates that for some of these deeper trees, they might be overfitting just a little bit._

_Then lets ask the `GridSearchCV` object to print out the best model based on performance on unseen data._

In [4]:
cv.best_estimator_

RandomForestClassifier(max_depth=8, n_estimators=250)

### Write out pickled model

_Lastly, we'll just write out the fit model using `joblib.dump()` - we just pass in the model object and just tell it to write to the same location where we stored our Gradient Boosting model._

In [5]:
joblib.dump(cv.best_estimator_, '../models/RF_model.pkl')

['../models/RF_model.pkl']

_In the next chapter, we will cover the last of our ensemble techniques...stacking._