## Stacking: Implement a stacking model

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will fit and evaluate a simple stacked model.

### Read in Data

_In this final video in the stacking chapter we're going to try to build the best stacked model we can using the same process we went through in prior chapters._

_Lets start by importing our packages - so we have the same `joblib`, `pandas` to read in our data, and `GridSearchCV` from `sklearn` that we have used in prior chapters. Now we are also going to import our StackingClassifier from `sklearn` and we also have to import the objects we will use to fit our base models - so we are going to import gradient boosting and random forest. Again, those will represent our base models and we will stick with those since we are pretty familiar with them from earlier in the course. Then we are also going to import LogisticRegression from sklearn, we will use that logistic regression for our meta model. If you want to learn more about Logistic Regression and the key hyperparameters, take a look at my algorithms course in this Applied ML series._

_Lastly, we will read in the training features and the training labels._

In [1]:
import joblib
import pandas as pd
from sklearn.ensemble import StackingClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

tr_features = pd.read_csv('../train_features.csv')
tr_labels = pd.read_csv('../train_labels.csv')

### Hyperparameter tuning

_We will run the cell for our helper function that calculates average accuracy score and the standard deviation of that accuracy score across the 5 folds built into our Cross Validation for each hyperparameter combination._

In [2]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

_Ok, so we have gone through this GridSearchCV step already with GradientBoosting and RandomForest. You will remember we intantiate the model object and then create a dictionary for the hyperparameter settings and then GridSearchCV loops through the different hyperparam settings, fits a model, and finds the best one._

_We will be doing the same thing here but with very slight tweaks only because our StackingClassifier has slightly different requirements. Lets start by creating our StackingClassifier in the same way we did in the last video. So we will define our estimators - we will have one RandomForest model and one Gradient Boosting model and leave the parentheses empty because we will define those parameters in our parameters dictionary. So this is a very, very simple stacking model with only two base models. Now lets create our StackingClassifier and remember it requires us to pass in the list of estimators and assign it to the estimators hyperparameter._

_Then lets call the `get params` method to refresh our memory on the name of the parameters we want to tweak. In the interest of time, we are going to keep it quite simple. So we know there are a number of parameters we can tweak for gradient boosting and random forest but we are going to focus only on n estimators - so we will only tweak the number of trees each use. And those parameters will be represented by gb__n_estimators and rf__n_estimators. so lets just copy that directly into our parameters dictionary._

_The next parameter we want to set is this `final estimator` one. That will be our meta model. Lets use Logistic Regression. The primary hyperparameter to tune there is the C parameter, that controls the amount of regularization, or how closely it fits to the training data. Again, if you want to learn more about Logistic Regression - look into my algorithms course. We will try C=0.1, 1, and 10._

_Lastly, we highlighted this `passthrough` parameter in the last video and that controls whether the model fits only on the output of the base models or if it also uses the originial training data too. So we will test out True (include the training data) and False (only use the output of the two base models)._

_Then the rest of it looks exactly the same, GridSearch using our Stackingclassifier object, with 5 fold cross validation. Then call `.fit()` with our training features and labels (and convert the column vector to an array)._

_And then we will just print out our results using the `print results()` function._

_So lets go ahead and run this._

In [3]:
estimators = [('rf', RandomForestClassifier()),
              ('gb', GradientBoostingClassifier())]

sc = StackingClassifier(estimators=estimators)
sc.get_params()
parameters = {
    'gb__n_estimators': [50, 100],
    'rf__n_estimators': [50, 100],
    'final_estimator': [LogisticRegression(C=0.1),
                        LogisticRegression(C=1),
                        LogisticRegression(C=10)],
    'passthrough': [True, False]
}
cv = GridSearchCV(sc, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

BEST PARAMS: {'final_estimator': LogisticRegression(C=10), 'gb__n_estimators': 50, 'passthrough': False, 'rf__n_estimators': 50}

0.835 (+/-0.113) for {'final_estimator': LogisticRegression(C=0.1), 'gb__n_estimators': 50, 'passthrough': True, 'rf__n_estimators': 50}
0.83 (+/-0.111) for {'final_estimator': LogisticRegression(C=0.1), 'gb__n_estimators': 50, 'passthrough': True, 'rf__n_estimators': 100}
0.835 (+/-0.034) for {'final_estimator': LogisticRegression(C=0.1), 'gb__n_estimators': 50, 'passthrough': False, 'rf__n_estimators': 50}
0.837 (+/-0.049) for {'final_estimator': LogisticRegression(C=0.1), 'gb__n_estimators': 50, 'passthrough': False, 'rf__n_estimators': 100}
0.828 (+/-0.101) for {'final_estimator': LogisticRegression(C=0.1), 'gb__n_estimators': 100, 'passthrough': True, 'rf__n_estimators': 50}
0.832 (+/-0.101) for {'final_estimator': LogisticRegression(C=0.1), 'gb__n_estimators': 100, 'passthrough': True, 'rf__n_estimators': 100}
0.83 (+/-0.048) for {'final_estimator': Lo

_So first of all - we see this warning pertaining to the Logistic Regression meta model not converging in the given number of iterations and it suggests scaling our data. So this only pertains to the cases where we are including the original training data, Logistic Regression does not necessarily require your data to be properly scaled but it does perform a bit better when you do that. So feel free to play around with scaling the data and exploring if that improves your performance at all. In the interest of time, we will move forward with the unscaled data. If you want a refresher on how to scale your training data, check out my Foundations course for the Applied ML series._

_Now, looking at the results. The best results are using XXXXXXX, that generates an accuracy of XX.X% so this is the best Cross-Validation performance that we've seen thus far. Feel free to dig through these results but there are two things I want to call out quickly:_
1. __
2. __

_Then lets ask the `GridSearchCV` object to print out the best model based on performance on unseen data._

In [4]:
cv.best_estimator_

StackingClassifier(estimators=[('rf', RandomForestClassifier(n_estimators=50)),
                               ('gb',
                                GradientBoostingClassifier(n_estimators=50))],
                   final_estimator=LogisticRegression(C=10))

### Write out pickled model

_Lastly, we'll just write out the fit model using `joblib.dump()` - we just pass in the model object and just tell it to write to the same location where we stored our prior models._

In [5]:
joblib.dump(cv.best_estimator_, '../models/stacked_model.pkl')

['../models/stacked_model.pkl']

_In the final chapter, we will review everything we have learned and compare the best model from each of our ensemble techniques on the validation data._