# Boosting

In [None]:
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor,\
GradientBoostingClassifier
import xgboost  # You may need to install this!
import pandas as pd
import matplotlib as mpl
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import precision_score, recall_score, confusion_matrix
%matplotlib inline

## Agenda

SWBAT:

- describe boosting algorithms;
- implement boosting models with `sklearn` and with `XGBoost`.

## Intro

One of the problems with using single decision trees and random forests is that, once I make a split, I can't go back and consider how another feature varies across the whole dataset. But suppose I were to consider **my tree's errors**. The fundamental idea of ***boosting*** is to start with a weak learner and then to use information about its errors to build a new model that can supplement the original model.

## Two Types

The two main types of boosting available in Scikit-Learn are adaptive boosting (AdaBoostClassifier, AdaBoostRegressor) and gradient boosting (GradientBoostingClassifier, GradientBoostingRegressor).

Again, the fundamental idea of boosting is to use a sequence of **weak** learners to build a model. Though the individual learners are weak, the idea is to train iteratively in order to produce a better predictor. More specifically, the first learner will be trained on the data as it stands, but future learners will be trained on modified versions of the data. The point of the modifications is to highlight the "hard-to-predict-accurately" portions of the data.

- **AdaBoost** works by iteratively adapting two related series of weights, one attached to the datapoints and the other attached to the learners themselves. Datapoints that are incorrectly classified receive greater weights for the next learner in the sequence. That way, future learners will be more likely to focus on those datapoints. At the end of the sequence, the learners that make better predictions, especially on the datapoints that are more resistant to correct classification, receive more weight in the final "vote" that determines the ensemble's prediction. <br/> Suppose we have a binary classification problem and we represent the two classes with 1 and -1. (This is standard for describing the algorithm of AdaBoost.) <br/>
Then, in a nutshell: <br/>
    1. Train a weak learner. <br/>
    2. Calculate its error $\epsilon$. <br/>
    3. Use that error as a weight on the classifier: $\theta = \frac{1}{2}ln\left(\frac{1-\epsilon}{\epsilon}\right)$. <br/>
    Note that $\theta$ CAN be negative. This represents a classifier whose accuracy is _worse_ than chance. <br/>
    4. Use _that_ to adjust the data points' weights: $w_{n+1} = w_n\left(\frac{e^{\pm\theta}}{scaler}\right)$. Use $+\theta$ for incorrect predictions, $-\theta$ for correct predictions. <br/>  $\rightarrow$ For more detail on AdaBoost, see [here](https://towardsdatascience.com/boosting-algorithm-adaboost-b6737a9ee60c).

- **Gradient Boosting** works instead by training each new learner on the residuals of the model built with the learners that have so far been constructed. That is, Model $n+1$ (with $n+1$ learners) will focus on the predictions of Model $n$ (with only $n$ learners) that were **most off the mark**. As the training process repeats, the learners learn and the residuals get smaller. I would get a sequence going: <br/> Model 0 is very simple. Perhaps it merely predicts the mean: <br/>
$\hat{y}_0 = \bar{y}$; <br/>
Model 1's predictions would then be the sum of (i) Model 0's predictions and (ii) the predictions of the model fitted to Model 0's residuals: <br/> $\hat{y}_1 = \hat{y}_0 + \hat{(y - \hat{y})}_{err0}$; <br/>
Now iterate: Model 2's predictions will be the sum of (i) Model 0's predictions, (ii) the predictions of the model fitted to Model 0's residuals, and (iii) the predictions of the model fitted to Model 1's residuals: <br/> $\hat{y}_2 = \hat{y}_0 + \hat{(y - \hat{y})}_{err0} + \hat{(y - \hat{y})}_{err1}$<br/>
Etc.
<br/>

$\rightarrow$ How does gradient boosting work for a classification problem? How do we even make sense of the notion of a gradient in that context? The short answer is that we appeal to the probabilities associated with the predictions for the various classes. See more on this topic [here](https://sefiks.com/2018/10/29/a-step-by-step-gradient-boosting-example-for-classification/). <br/> $\rightarrow$ Why is this called "_gradient_ boosting"? The short answer is that fitting a learner to a model's residuals comes to the same thing as fitting it to the derivative of that model's loss function. See more on this topic [here](https://www.ritchievink.com/blog/2018/11/19/algorithm-breakdown-why-do-we-call-it-gradient-boosting/).

Let's illustrate gradient boosting now!

## AdaBoost in Scikit-Learn

In [None]:
galaxies = pd.read_csv('COMBO17.csv')
galaxies.head()

This is a dataset about galaxies. The Mcz and MCzml columns are measures of redshift, which is our target. Mcz is usually understood to be a better measure, so that will be our target column. Many of the other columns have to do with various measures of galaxies' magnitudes. For more on the dataset, see [here](https://astrostatistics.psu.edu/datasets/COMBO17.html).

In [None]:
galaxies.columns

In [None]:
galaxies.isnull().sum().sum()

In [None]:
galaxies.info()

In [None]:
galaxies = galaxies.dropna()

Let's collect together the columns that have high correlation with Mcz, our target:

In [None]:
preds = []
for ind in galaxies.corr()['Mcz'].index:
    if abs(galaxies.corr()['Mcz'][ind]) > 0.5:
        preds.append(ind)

In [None]:
galaxies[preds].corr()

These various magnitude columns all have high correlations **with one another**! Let's try a simple model with just the S280MAG column, since it has the highest correlation with Mcz.

In [None]:
x = galaxies['S280MAG']
y = galaxies['Mcz']

Since we only have one predictor, we can visualize the correlation with the target! We can also reshape it for modeling purposes!

In [None]:
x_rev = x.values.reshape(-1, 1)

In [None]:
mpl.pyplot.scatter(x_rev, y);

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_rev, y, random_state=42)

In [None]:
abr = AdaBoostRegressor(random_state=42)

abr.fit(x_train, y_train)

In [None]:
cross_val_score(abr, x_test, y_test, cv=5)

## Hyperparameter Tuning

Let's see if we can do better by trying different hyperparameter values:

In [None]:
gs = GridSearchCV(estimator=abr,
                 param_grid={
                     'n_estimators': [25, 50, 100],
                     'loss': ['linear', 'square']
                 }, cv=5)

In [None]:
gs.fit(x_train, y_train)

In [None]:
gs.best_params_

## XGBoost

In [None]:
grad_boost = xgboost.XGBRegressor(random_state=42, objective='reg:squarederror')

grad_boost.fit(x_train, y_train)

In [None]:
cross_val_score(grad_boost, x_test, y_test, cv=5)

## Regression or Classification?

What does my target look like?

In [None]:
galaxies['Mcz'].hist();

There seems to be a bit of a bimodal shape here. We might therefore try predicting whether the redshift factor is likely to be greater or less than 0.5:

In [None]:
galaxies['bool'] = galaxies['Mcz'] > 0.5

In [None]:
galaxies.tail()

In [None]:
x_train2, x_test2, y_train2, y_test2 = train_test_split(x_rev, galaxies['bool'])

### AdaBoost

In [None]:
abc = AdaBoostClassifier(random_state=42)

abc.fit(x_train2, y_train2)

In [None]:
abc.score(x_test2, y_test2)

In [None]:
precision_score(y_test2, abc.predict(x_test2))

In [None]:
recall_score(y_test2, abc.predict(x_test2))

### GradientBoosting

In [None]:
gbc = GradientBoostingClassifier(random_state=42)

gbc.fit(x_train2, y_train2)

In [None]:
gbc.score(x_test2, y_test2)

In [None]:
precision_score(y_test2, gbc.predict(x_test2))

In [None]:
recall_score(y_test2, gbc.predict(x_test2))

In [None]:
confusion_matrix(y_test2, gbc.predict(x_test2))

### XGBoost

In [None]:
grad_boost_class = xgboost.XGBClassifier(random_state=42, objective='binary:logistic')

grad_boost_class.fit(x_train2, y_train2)

In [None]:
grad_boost_class.score(x_test2, y_test2)