# Adaptive Boosting

In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import GridSearchCV,train_test_split
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
np.random.RandomState(42)

<mtrand.RandomState at 0x1ddc09c72d0>

> Reading in the data that was prepared for modeling. In the notebook Feature Engineering and Pre-processing, we ensured that our X variable and test data had the same features so that any trained model would be able to predict the probability of West Nile virus in our testing data.

In [2]:
X = pd.read_csv('../../Data/X.csv',index_col=[0])
y = pd.read_csv('../../Data/y.csv',index_col=[0])

### Train Test Split

> Before modeling, it is important to train test split. This is done so that we can later test to make sure that our model is able to predict accurately on unseen data. In this specific case, we are training our model on 70 percent of our data, and then testing it on the remaining 30 percent. By stratifying on our y variable, we are saying that we want the proportion of cases where West Nile virus is present to be the same in both the testing and training datasets.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y["WnvPresent"], test_size=0.3, random_state=42,stratify = y)

### Modeling

>Pipelines are a way to streamline a lot of the routine processes by encapsulating little pieces of logic into one function call, making it easier to model our data. They are set up with the fit/transform/predict functionality so that you could fit a whole pipeline to the training data while also transforming the test data without having to do each step individually. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’. In the model below, we set Standard Scaler and Adaptive Boost as its 'steps' parameters in order to tell our pipeline which estimators we want to fit. We can see that we fit them both while also finding optimal parameters all in one step. 
>>Step 1: Standardization of datasets is an important requirement for many machine learning estimators. Datasets that are not standardized will cause problems for models because they do not more or less look like standard normally distributed data (Gaussian with zero mean and unit variance). We can use Standard Scaler, a preprocessing module, to standardize features by removing the mean and scaling to unit variance. It scales data by dividing non-constant features by their standard deviation. This is the most used module due to the fact that it computes the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. By using Standard Scaler, we are allowing each feature to be considered on the same scale as one another. This makes our computations more efficient.

>> Step 2: Adaptive Boosting is meant to enhance machine learning algorithms. The default for this model uses the Decision Tree Classifier, but it can also be used to enhance other models. This model performs the best on binary classification, and it works by creating a decision tree which it learns from each decision it makes. How it does this is by fitting a single decision tree for our data, and looks at the predictions that it got wrong. It then changes the data so that the observations that were incorrect have more weight for the next decision tree.

In [4]:
pipe = Pipeline([
    ('SS', StandardScaler()),
    ('ADA', AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced')))
])

### Parameters

> This cell specifies the specific hyper Parameters that we want to test to see which combination of hyper parameters perform the best. These hyper parameters are:
>> n_estimators - # of Estimators: This parameter refers to how many Models should be fit and learned upon 

>> learning_rate : The learning rate is the rate of how much each model contributes to the weights each decision 

In [5]:
param_grid = {
              'ADA__n_estimators':[30,50,70,100,150],
              'ADA__learning_rate': [.05,.1,.5,.7,1.0]
              
}

### Grid Searching

>Grid Search is a module that performs parameter tuning. Parameter tuning is the process of selecting the values for a model’s parameters that maximize the accuracy of the model. Grid Search does this by exhaustively generating candidates from a grid of parameter values specified with the `param_grid` parameter. When “fitting” Grid Search on a dataset, all the possible combinations of parameter values are evaluated, and the best combination is chosen. In our case, we peform Grid Search in order to find the parameters that give us the highest score.
>>Instead of scoring the model with accuracy, we scored the model with `roc_auc` , which stands for receiver operating characteristic area under the curve. This is used to evaluate the performance of a binary classification system, and gives us insight into how our model is doing. It does this because our classes are unbalanced. We could predict zero for every data point, and could have a 95% accuracy score because 95% of the data is in our negative class. Using `roc_auc` accounts for the true positives and true negatives that we predict.

In [6]:
AdaBoost = GridSearchCV(pipe, param_grid= param_grid, scoring = 'roc_auc' )
AdaBoost.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('SS', StandardScaler(copy=True, with_mean=True, with_std=True)), ('ADA', AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight='balanced', criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impu...ne,
            splitter='best'),
          learning_rate=1.0, n_estimators=50, random_state=None))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'ADA__n_estimators': [30, 50, 70, 100, 150], 'ADA__learning_rate': [0.05, 0.1, 0.5, 0.7, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [7]:
for param in AdaBoost.best_params_:
    print('The values for',param,'That gave us the highest roc_auc score is',AdaBoost.best_params_[param])

The values for ADA__learning_rate That gave us the highest roc_auc score is 0.7
The values for ADA__n_estimators That gave us the highest roc_auc score is 50


In [8]:
print('The training score of the model is',AdaBoost.score(X_train,y_train))
print('The testing score of the model is',AdaBoost.score(X_test,y_test))

The training score of the model is 0.9958461107178337
The testing score of the model is 0.8010500045652373


## Saving the model for Evaluation

In [9]:
with open('../../Assets/AdaBoost.pkl', 'wb+') as f:
    pickle.dump(AdaBoost, f)