# Gradient Boosting

In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
np.random.RandomState(42)

<mtrand.RandomState at 0x2370b968288>

> Reading in the data that was prepared for modeling. In the notebook Feature Engineering and Pre-processing, we ensured that our X variable and test data had the same features so that any trained model would be able to predict the probability of West Nile virus in our testing data.

In [2]:
X = pd.read_csv('../../Data/X.csv',index_col=[0])
y = pd.read_csv('../../Data/y.csv',index_col=[0])

### Train Test Split

> Before modeling, it is important to train test split. This is done so that we can later test to make sure that our model is able to predict accurately on unseen data. In this specific case, we are training our model on 70 percent of our data, and then testing it on the remaining 30 percent. By stratifying on our y variable, we are saying that we want the proportion of cases where West Nile virus is present to be the same in both the testing and training datasets.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,stratify = y)

### Modeling

>Pipelines are a way to streamline a lot of the routine processes by encapsulating little pieces of logic into one function call, making it easier to model our data. They are set up with the fit/transform/predict functionality so that you could fit a whole pipeline to the training data while also transforming the test data without having to do each step individually. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’. In the model below, we set Standard Scaler and Gradient Boosting as its 'steps' parameters in order to tell our pipeline which estimators we want to fit. We can see that we fit them both while also finding optimal parameters all in one step. 
>>Step 1: Standardization of datasets is an important requirement for many machine learning estimators. Datasets that are not standardized will cause problems for models because they do not more or less look like standard normally distributed data (Gaussian with zero mean and unit variance). We can use Standard Scaler, a preprocessing module, to standardize features by removing the mean and scaling to unit variance. It scales data by dividing non-constant features by their standard deviation. This is the most used module due to the fact that it computes the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. By using Standard Scaler, we are allowing each feature to be considered on the same scale as one another. This makes our computations more efficient.

>> Step 2: Gradient Boosting is a classification model that uses "weak" models, and uses the residuals to then build a more robust model. It first creates a single decision tree, which is likely to be overfit to the training data, and makes predictions. The model then looks at how wrong it was, then fits another decision tree based on the residuals over and over again, combining what each tree has learned. 

In [4]:
pipe = Pipeline([
       ('ss',StandardScaler()),
       ('GBC',GradientBoostingClassifier())
])

> This cell specifies the specific hyper parameters that we want to test to see which combination of hyper parameters perform the best. These hyper parameters are:
>> n_estimators: This parameter determines the number of decision trees that are created when fitting our 
    boosting model

>> max depth: This determines the number of splits that each individual decision tree will make before 
    predicting and creating a new tree

In [5]:
params = {
    'GBC__n_estimators':[100,150,200],
    'GBC__min_samples_split': [2, 3],
    'GBC__min_samples_leaf': [3, 5],
    'GBC__max_depth':[3,5,7]
}

>Grid Search is a module that performs parameter tuning. Parameter tuning is the process of selecting the values for a model’s parameters that maximize the accuracy of the model. Grid Search does this by exhaustively generating candidates from a grid of parameter values specified with the `param_grid` parameter. When “fitting” Grid Search on a dataset, all the possible combinations of parameter values are evaluated, and the best combination is chosen. In our case, we peform Grid Search in order to find the parameters that give us the highest score.
>>Instead of scoring the model with accuracy, we scored the model with `roc_auc` , which stands for receiver operating characteristic area under the curve. This is used to evaluate the performance of a binary classification system, and gives us insight into how our model is doing. It does this because our classes are unbalanced. We could predict zero for every data point, and could have a 95% accuracy score because 95% of the data is in our negative class. Using `roc_auc` accounts for the true positives and true negatives that we predict.

In [6]:
GradBoost = GridSearchCV(pipe,param_grid=params,scoring='roc_auc')
GradBoost.fit(X_train,y_train['WnvPresent'])

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('ss', StandardScaler(copy=True, with_mean=True, with_std=True)), ('GBC', GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0....      presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'GBC__n_estimators': [100, 150, 200], 'GBC__min_samples_split': [2, 3], 'GBC__min_samples_leaf': [3, 5], 'GBC__max_depth': [3, 5, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [7]:
for params in GradBoost.best_params_:
    print('The value for',params,'that had the highest roc_auc score is',GradBoost.best_params_[params])

The value for GBC__max_depth that had the highest roc_auc score is 3
The value for GBC__min_samples_leaf that had the highest roc_auc score is 5
The value for GBC__min_samples_split that had the highest roc_auc score is 2
The value for GBC__n_estimators that had the highest roc_auc score is 100


In [8]:
print('The training score of the model is',GradBoost.score(X_train,y_train))
print('The testing score of the model is',GradBoost.score(X_test,y_test))

The training score of the model is 0.8853349955087059
The testing score of the model is 0.87556380679916


## Saving the model for Evaluation

In [9]:
with open('../../Assets/GradBoost.pkl','wb+') as f:
    pickle.dump(GradBoost,f)