## Random Forest: Fit and evaluate a model

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will fit and evaluate a simple Random Forest model.

### Read in Data

_Welcome back to the final lesson in the Random Forest chapter. We're going to use `GridSearchCV`  to do Grid Search within k-fold Cross Validation in order to find the optimal hyperparameter settings for Random Forest that generates the best model._

_Lets start by importing the packages we will need for this lesson, then we will read in the training features and the training labels._

In [1]:
import joblib
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

tr_features = pd.read_csv('../train_features.csv')
tr_labels = pd.read_csv('../train_labels.csv', header=None)

### Hyperparameter tuning

![RF](img/rf.png)

_A quick reminder of what the two hyperparameters we will be tuning represent. Number of estimators simply represents how many individual decision trees to build and max depth dictates how deep each of those trees can go._

In [2]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

_So we have called the `RandomForestClassifier` object and stored it as `rf`. We're not hardcoding any hyperparameter values so leave those parentheses blank as we have in each chapter._

_Lets define our hyperparameter dectionary. So the keys in this dictionary align with the name of the hyperparameters that would be passed into `RandomForestClassifier` and then we just need to define the list of settings we want to test._

_So for number of estimators we will test out building 5 decision trees, 50 decision trees, and 250 decision trees._

_And then for max depth we will start with 1 (what's called a decision stump) and then we'll also test out 5, 9, 13, 17, and None. None will just build it as deep as it wants until it achieves some level of training error tolerance (defined internally)._

_So now call `GridSearchCV`, pass in our model object (`rf`), the hyperparameter dictionary, and tell it we want to do 5-fold Cross Validation._

_We just call `.fit()` with our training features and labels like we have in every other chapter and that will run 5-fold Cross Validation for each hyperparameter setting combination._

_So lets look at the results._

In [3]:
rf = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 50, 250],
    'max_depth': [1, 5, 9, 13, 17, None]
}
cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(tr_features, tr_labels)

print_results(cv)

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)


BEST PARAMS: {'max_depth': 5, 'n_estimators': 50}

0.79 (+/-0.057) for {'max_depth': 1, 'n_estimators': 5}
0.783 (+/-0.115) for {'max_depth': 1, 'n_estimators': 50}
0.779 (+/-0.106) for {'max_depth': 1, 'n_estimators': 250}
0.801 (+/-0.102) for {'max_depth': 5, 'n_estimators': 5}
0.826 (+/-0.119) for {'max_depth': 5, 'n_estimators': 50}
0.824 (+/-0.084) for {'max_depth': 5, 'n_estimators': 250}
0.818 (+/-0.075) for {'max_depth': 9, 'n_estimators': 5}
0.824 (+/-0.075) for {'max_depth': 9, 'n_estimators': 50}
0.822 (+/-0.057) for {'max_depth': 9, 'n_estimators': 250}
0.792 (+/-0.045) for {'max_depth': 13, 'n_estimators': 5}
0.813 (+/-0.049) for {'max_depth': 13, 'n_estimators': 50}
0.816 (+/-0.04) for {'max_depth': 13, 'n_estimators': 250}
0.794 (+/-0.044) for {'max_depth': 17, 'n_estimators': 5}
0.807 (+/-0.025) for {'max_depth': 17, 'n_estimators': 50}
0.811 (+/-0.03) for {'max_depth': 17, 'n_estimators': 250}
0.8 (+/-0.065) for {'max_depth': None, 'n_estimators': 5}
0.816 (+/-0.039) f

  self.best_estimator_.fit(X, y, **fit_params)


_Lets take a quick look at the best estimator based on test score_

In [4]:
cv.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### Write out pickled model

_Lastly, lets write out this model to compare it to the other models a little later on. We just call `joblib.dump()` and pass in the model object and just tell it to write to the same location where we stored our other models `../RF model.pkl`._

In [5]:
joblib.dump(cv.best_estimator_, '../RF_model.pkl')

['../RF_model.pkl']