## Boosting and Pipelines

### Objectives

- Understanding how we can create a streamline of procedures (Pipelines)

- What is the use of such practices.

- Boosting methods - Specifically Gradient and Adaboost

- Implementation of GradientBoostClassifier with fine-tuning with gridsearch.


### Pipelines

__Q:__ What is a pipeline?

[sklearn - documentation](https://scikit-learn.org/stable/modules/compose.html#pipeline)

> Transformers (scaling, preprocessing, feature selection etc.) are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline.

__Q__: Why should we use pipelines?

    - Convenience: You only have to call fit and predict once on your data to fit a whole sequence of estimators.
    - Joint parameter selection: You can grid search over parameters of all estimators in the pipeline at once.
    - Safety: Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
    
    


In [None]:
import numpy as np
np.random.seed(0)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

In [None]:
## load the dataset 
## Source: https://www.kaggle.com/uciml/pima-indians-diabetes-database/download
df = pd.read_csv('data/diabetes.csv')
df.head()

In [None]:
## Let's use describe method to see if there is anything suspicious
df.describe().T

In [None]:
## Now let's use info method
df.info()

In [None]:
## separate target variable from features
target = df.Outcome
df.drop('Outcome', axis=1, inplace=True)

In [None]:
## Let's see the distribution of 1's and 0's
np.unique(target, return_counts= True)

In [None]:
## Split data into test train
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.20)

Note that in this problem it makes sense to focus on recall score as we don't want to  misclassify patients with diabetes.

Recall score = $\frac{tp}{(tp + fn)}$


In [None]:
## First let's fit a logistic regression model to see the baseline
## we will also use pipelines
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

## we will apply standard scaling to Logistic regression
## because we might want to use regularization

from sklearn.preprocessing import StandardScaler

In [None]:
## ll estimators in a pipeline, except the last one,
## must be transformers (i.e. must have a transform method). 
## The last estimator may be any type (transformer, classifier, etc.)

pipe = Pipeline([('ss', StandardScaler()),
                 ('log_reg', LogisticRegression(random_state=123,
                                                max_iter = 500, 
                                                solver = 'saga'))])

In [None]:
## we can access to a particular step in the pipeline

print(pipe.steps[0])

print(pipe['log_reg'])

In [None]:
## we can call fit method with pipeline 
## we can call them with a gridsearch

## let's use fit method from pipeline
pipe.fit(X_train, y_train)

## We can access the trained estimator from pipe
pipe['log_reg'].predict(X_train)

In [None]:
## to find a best value for the C
## let's use GridSearchCV

from sklearn.model_selection import GridSearchCV
grid = grid = [{'log_reg__C': np.logspace(-2,2,10, base = 10.0), 
                'log_reg__penalty': ['l1', 'l2']}]

gridsearch = GridSearchCV(estimator=pipe,
                  param_grid=grid,
                  scoring='recall',
                  cv=5, verbose=1, n_jobs=-1)

gridsearch.fit(X_train, y_train)

In [None]:
# Best accuracy
print('Best accuracy: %.3f' % gridsearch.best_score_)

# Best params
print('\nBest params:\n', gridsearch.best_params_)

### Boosting Algorithms

__Q:__ What is boosting?

 - Recall that random forest algorithm uses boosting aggregation (bagging) to decrease the variance of individual trees.
 - Boosting ~ Bagging 
      - Bagging: Trees grow parallel
      - Boosting: Trees grow sequentially
 - Idea is to create a slow learner.

Recall that in bagging we did bootstrapping in boosting we don't do bootstrapping instead we modify the dataset at each step.

__important parameters__(with sklearn notation)

__n_estimators:__ # of trees to use in the procedure


__learning_rate:__ (Shrinkage parameter)

> The shrinkage parameter $\lambda$, a small positivenumber.This controls the rate at which boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem. Very small  $\lambda$ can require using a very large value of B in order to achieve good performance


<img src="img/boosting_algorithm.png" width=450, height=450> 

__max_depth, max_leaf_nodes etc,__ (The number of splits in each trees)

> Often d = 1 works well, in which case each tree is called a _stump_, consisting of a single split. In this case, the boosted ensemble is fitting an additive model, since each term involves only a single variable. More generally d is the interaction depth, and controls the interaction order of the boosted model, since d splits can involve at most d variables.

In [None]:
## Now let's investigate the performance of Adaboost and GradientBoost
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
## let's see some of the parameters of the Gradient Boosting
?GradientBoostingClassifier

In [None]:
gbc = GradientBoostingClassifier(random_state= 103019,
                                 validation_fraction=0.1, 
                                 n_iter_no_change= 5, 
                                 tol = 0.005)

In [None]:
## Let's use a gridsearch to find best parameters for GradientBoost

params = {'n_estimators' : [100, 200, 300],
         'learning_rate' : np.logspace(-3, -1, 5),
         'max_leaf_nodes': [3,5,7,9],
         'subsample': [0.2, 0.5, 0.7, 0.9], 
         'max_features':[0.5,1]}

gs = GridSearchCV(estimator = gbc, 
                  param_grid = params,
                  cv = 5, 
                  scoring= 'recall',
                  verbose = 1,
                  n_jobs= -1)

gs.fit(X_train, y_train)

### Some practical tips for Gradient Boost

- Apparently max_leaf_nodes = k gives similar results to max_depth = k-1 but according to sklearn documentation max_leaf_nodes works faster. So you might want to use max_leaf_nodes for bigger projects.

- Again according to sklearn documentation, smaller learning rate gives better test_scores but you might want to put more estimators if you set the learning rate small.

- As it is mentioned above, when small learning rate is used we might increase the number of estimators. To prevent unneccesarry computing then we can put some early stopping criteria by the parameters: n_iter_change, min_impurity_decrease or tol.

- It looks like subsampling with shrinkage method (learning rate) might give better results. In this case, out of bag test scoring is also become available. Note that you can access these by oob_improvement method.

- Using a small max_features value can significantly decrease the runtime.

For more: 
[sklearn documentation - gradientboost](https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting)

In [None]:
print(gs.best_score_)
print(gs.best_estimator_)

In [None]:
## let's see the best_estimator's test performance
best_estimator = gs.best_estimator_
y_pred = best_estimator.predict(X_test)


## import recall_score from sklearn
from sklearn.metrics import recall_score

print(recall_score(y_test, y_pred))

## similarly log_reg predictor would give

log_reg_best = gridsearch.best_estimator_
y_pred_log = log_reg_best.predict(X_test)

print(recall_score(y_test, y_pred_log))

In [None]:
y_train_pred = best_estimator.predict(X_train)

print(recall_score(y_train, y_train_pred))

## try the same thing with log_reg_best: Do you expect better score?

In [None]:
## Try Adaboost algorithm and XGboost here
## Use gridsearch or RandomSearchCV to fine-tune parameters.
