## Boosting: Implement a boosting model

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will fit and evaluate a simple Gradient Boosting model.

### Read in Data

_In this FINAL video of this chapter, we'll fit the best Gradient Boosting model we can by using `GridSearchCV` to tune three key hyperparameters. If you want to learn more about GridSearchCV, you should take my Applied Machine Learning Fonudations course that talks more about the proper framework to fit and evaluate models. In short, GridSearchCV allows us to easily search through a number of different hyperparameter setting combinations to find the one that generates the best performance on unseen data._

_So we will find the best boosted model we can in this video and we will save that fit model. We will do the same in the bagging and stacking sections. Then in the last chapter of this course we will compare the best boosting, bagging, and stacking models against one another on the validation set to see which model performs best._

_Lets start by importing a few packages:_
* _`joblib` will help us save our fit model at the end of this lesson_
* _`pandas` to read our data into a dataframe_
* _and then `GradientBoostingClassifier` and `GridSearchCV` from `sklearn`_

In [1]:
import joblib
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

tr_features = pd.read_csv('../train_features.csv')
tr_labels = pd.read_csv('../train_labels.csv')

### Hyperparameter tuning

_A quick reminder of what the three hyperparameters we will be tuning represent:_
- _Number of estimators simply represents how many individual decision trees to build_
- _Max depth dictates how deep each of those trees can go_
- _Learning rate controls how quickly this algorithm will try to find the optimal model - too large and it will never find the optimal solution, too small and it also may not find the optimal solution and even if it does, it will take a long time to do so_

_Now, the `GridSearchCV` method stores a LOT of information about model performance but it can be kind of difficult to pick through to find what you need. So I wrote a quick little function here for us to use to print the results a little more cleanly. I'm not going to go through it in detail but in essence what it does is for every hyper-parameter combination it will print out the average accuracy score and the standard deviation of that accuracy score (across the 5 folds built into our Cross Validation). This will give us the information we need to select the optimal hyperparameter settings._

In [2]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

_Ok - lets get into the actual process of searching for the best hyperparameter settings using GridSearchCV. So we start by calling the `GradientBoostingClassifier` object and store it as `gb`. If we wanted to hardcode any hyperparameter values in we would enter them in these parentheses. Otherwise, it will just use the defaults which we saw in the last lesson. We don't want to hardcode any values at this moment because we want to use GridSearchCV to test different hyperparameter settings._

_The first thing we need to do for `GridSearchCV` is define our hyperparameter dictionary. The hyperparameters we want to tune are `n estimators`, `max depth`, and `learning rate` (all of the rest will be set as their default values)._

_So for number of estimators we will test out building 5, 50, 250, and 500 decision trees._

_And then for max depth we will start with 1 (what's called a decision stump) and then we'll also test out 3, 5, 7, and 9._

_So now call `GridSearchCV`, pass in our model object (`gb`), the hyperparameter dictionary, and tell it we want to do 5-fold Cross Validation._

_We just call `.fit()` with our training features and labels and that will fit a model with each hyperparamater combination and evaluate them to see which is the best one. One note here - `tr labels` are stored as a column vector type but what `sklearn` really wants them to be is an array - so we will just convert it from the a column vector to an array using `.values.ravel()`._

_Lastly, lets call our print results function to show us how each model performed._

In [3]:
gb = GradientBoostingClassifier()
parameters = {
    'n_estimators': [5, 50, 250, 500],
    'max_depth': [1, 3, 5, 7, 9],
    'learning_rate': [0.01, 0.1, 1, 10, 100]
}
cv = GridSearchCV(gb, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

BEST PARAMS: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500}

0.624 (+/-0.007) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 5}
0.796 (+/-0.115) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 50}
0.796 (+/-0.115) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 250}
0.811 (+/-0.117) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 500}
0.624 (+/-0.007) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 5}
0.811 (+/-0.069) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
0.83 (+/-0.074) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 250}
0.841 (+/-0.077) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500}
0.624 (+/-0.007) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 5}
0.82 (+/-0.051) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 50}
0.82 (+/-0.037) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 250}
0.82 (+/-0.036) for {'learning_rate

_I want to note two things here:_
1. _These results are still on unseen data, the Cross Validation built into `GridSearchCV` splits the training data into k parts, trains the model on k-1 parts, and then evaluates the model on the last chunk of data_
2. _If you're running this along with me at home, you might have a different training set than I do and there is some randomization built into some of these algorithms. So it's very possible that you may get different results than I do. In fact, I could even run this cell again and get slightly different results._

_There are A LOT of hyperparameter combinations here. Remember, we tested 4 levels of `n estimators`, 5 levels of `max depth`, and 5 levels of `learning rate` so that makes for 100 TOTAL MODELS built._

_We can see that the best model has a learning rate of 0.01, max depth of 3, and 500 total estimators. That combination is generating an accuracy of 84.1%._

_Two things I'll point out in the results:_
1. _High learning rate is generating really poor results across the board, indicating that it's jumping across that loss curve too quickly and not finding the optimal model_
2. _Models with 5 estimators is pretty consistently the worst model but depending on the other settings, we have 50, 250, and 500 estimators all generating fairly good models._

_Feel free to dig through these results yourselves and see what other insights you may be able to pull out that will help build your intuition for future model builds._

_Next, lets call the `best estimator` attribute from this `GridSearchCV` object and it will return the fit model tha performed best on unseen data._

In [4]:
cv.best_estimator_

GradientBoostingClassifier(learning_rate=0.01, n_estimators=500)

### Write out pickled model

_Lastly, lets write out this model to compare it to the other models in the last chapter of this course. We just call `joblib.dump()` and pass in the model object and just tell it to write to our models folder. And it's important to remember that this is saving your model that has been fit on the training data. So once it's saved, we can read it back into Python and start making predictions on data it has never seen before._

In [5]:
joblib.dump(cv.best_estimator_, '../models/GB_model.pkl')

['../models/GB_model.pkl']

_Hopefully this chapter has given you a pretty good grasp of boosting and how to implement it in Python. In the next chapter, we are going to take a looking at a different type of ensemble learning called bagging._