# Boosting - Gradient Boosting

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

**Note**: If you have completed the Decision Tree or Random Forest notebooks already, those preprocessing steps are the same. Feel free to copy paste answers from the previous notebook or the solutions and jump straight to the Gradient Boosting part.

### The Dataset

The dataset can be downloaded [here](https://archive.ics.uci.edu/ml/datasets/bank+marketing). It consists of data from marketing campaigns of a Portuguese bank. We will try to build a classifier that can predict whether or not the client targeted by the campaign ended up subscribing to a term deposit (column `y`).

Load the file `data/bank-marketing.zip` with pandas and check the distribution of the target `y`. Here the separator is `';'` instead of a comma.

The dataset is imbalanced, we will need to keep that in mind when building our models!

Now split the data into the feature matrix `X` (all features except `y`) and the target vector `y` making sure that you convert `yes` to `1` and `no` to `0`.

In [None]:
# Get X, y


Here is the list of features in our X matrix:

```
1. age (numeric)
2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. default: has credit in default? (categorical: 'no','yes','unknown')
6. housing: has housing loan? (categorical: 'no','yes','unknown')
7. loan: has personal loan? (categorical: 'no','yes','unknown')
8. contact: contact communication type (categorical: 'cellular','telephone') 
9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. cons.price.idx: consumer price index - monthly indicator (numeric) 
18. cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
19. euribor3m: euribor 3 month rate - daily indicator (numeric)
20. nr.employed: number of employees - quarterly indicator (numeric)
```

Note the comment about the `duration` feature. We will exclude it from our analysis.

Drop `duration` from X:

Now we can check the types of all our features. We see that some seem to be categorical whilst others are numerical. We will keep a two lists, one for each type, so we can preprocess them differently.

In [None]:
X.dtypes

In [None]:
# they have a third class "unknown" we'll process them as non binary categorical
num_features = ["age", "campaign", "pdays", "previous", "emp.var.rate", 
                "cons.price.idx", "cons.conf.idx","euribor3m", "nr.employed"]

cat_features = ["job", "marital", "education","default", "housing", "loan",
                "contact", "month", "day_of_week", "poutcome"]

### Visualise the numerical features

* show a boxplot of the numerical features

The features aren't at the same scale. But it's all fine for tree based methods as we've seen in the course, so we do not need to do any scaling here!

### One Hot Encoding on Categorical Features

In order to make sure our dataset contains only number we will need to transform our categorical features into one hot encoded features. To do so, first, use `pd.get_dummies` on your dataframe (select only the categorical features) to generate the new columns. Assign the new dataframe to a variable `X_categorical`

Create, now we can create `X_processed` using `pd.concat` (check the documentation, you will need to specify the right axis). Here we want to concatenate a dataframe with only our numerical features together with our `X_categorical` we created above:

### Split data into training set and test set

Split your data (use `X_processed`) into training and test set. Here we are dealing with an imbalanced dataset, so it is important to enforce stratification. We will use the argument `stratify` from `train_test_split` to do so (check the documentation)

Great, we're ready to start training our Gradient Boosting algorithms!

##  Gradient Boosting

`Gradient Boosting` is a bit different than other standard algorithms for which sklearn offers a standard implementation and usually people would stick to it. Here there are many optimisations that can make gradient boosting more performant, in terms of machine learning but also software, hence we find multiple good implementations of the algorithms. The most popular boosting libraries are `xgboost`, `lightgbm` and `catboost`. It is unclear which one is `The Best` but they all offer an API following the sklearn API, which means you can easily swap a library for another by simply importing the right class. 

In this notebook we will focus on `xgboost` and `lightgbm`. 

Note: `sklearn` has an implementation of Gradient Boosting, `GradientBoostingClassifier` that you can import from `sklearn.ensemble`. It happens to be slow to train usually and not as performant as the other libraries, but has the advantage of using the same `DecisionTree` class from sklearn and give us access to it, so it can be a good choice if you want to debug your models relying on methods available on Decision Trees

## XGBoost

XGBoost has an implementation of gradient boosting that has the same API as sklearn. Hence we can use it later inside `GridSearchCV` as we've done before with sklearn algorithm. For that you can import the `XGBClassifier` from `xgboost.sklearn`

Here are the main parameters we will want to tune for XGBoost, with the values we will start with:

- max_depth=15
- min_child_weight=1
- n_estimators=20


- subsample=1.
- colsample_bytree=1.
- learning_rate=1.

XGBoost has a way to give more weight to the minority class as well. It does not compute automatically the right weight though, we need to explicitly pass it. Check the documentation for the parameter `scale_pos_weight` and assign it to the right value:

Train your model on the training set:

Check the accuracy of your model both on the training set and test set. Also check the classification report on the test set. 

You can import those functions from `sklearn.metrics`:

What do you observe?

Here our model seems to overfit a lot. That's because our trees are two complex **AND/OR** we are building too many of them for the given learning rate we have (remember that the model is built sequentially, and every new tree corrects the error of the previous one, the amplitude of the correction is given by the learning rate).

Let's add more constraints on our tree and see what it changes.

Set `max_depth`=7 and `min_child_weight`=5. You can use `set_params` on your model to overwrite parameters.

Re-train your model and check accuracy on both train and test set as well as the classification report on the test set:

Right, it looks like for those simple trees and at the given learning rate, we do not overfit anymore. Let's try to add more trees and see what happens.

Set `n_estimators` to 100 (yes, that's probably too much):

Re-train and measure accuracy and classification report again:

Looks like we are overfitting again... Now are trees not that complicated (not too much variance), but still by adding too many of them our model start to overfit. It's an important thing to keep in mind: for a given learning rate, we will need to find the best constraints on the tree and the best number of tree **together**, the best number of trees will depend on how complex our trees are.

We can "easily" do so by using grid search.

Create a new gridSearchCV object that finds the best combination of `max_depth`, `min_child_weight` and `n_estimators`

What are your best parameters:

Re-train your model with the given parameters and check accuracy and classification report:

There are two other important parameters that we haven't tuned so far:

- `subsample` that defines the ratio of rows to use for each different tree
- `colsample_bytree` that defines the ratio of columns to use for each different tree

Those two parameters allow us to force our trees to learn different things in the data and thus can provide a boost in accuracy. Ideally we would tune them together with the other parameters, but that would require much more computations, so here we will tune them given the best parameters we already found.

Create a new grid search that will find the best combination of `subsample` and `colsample_bytree`. Both are expressed as ratio (between 0 and 1)

What are your best parameters:

Re-train your model with the given parameters and check accuracy and classification report:

Great. The last parameter that we haven't changed is the learning rate. Here a lower learning rate will give us more granularity whilst correcting the error, meaning that it will take more trees but we hope to get a better accuracy before we start to overfit. 

Ideally when decreasing the learning rate we would re-tune all parameters, but here we are taking shortcuts and assume the best parameters for our trees are still valid and focus only on getting the new number of trees right:

Set `learning_rate` to .1 and run a grid search with different values of `n_estimators` to find the new best number of trees:

What is your best `n_estimators`?:

Re-train your model with the given parameters and check accuracy and classification report:

### Plot feature importance

XGBoost is built-in with a function to plot the importance of your features. Although for boosting that is a more unstable metric since we cannot simply average feature importance over trees (because only the first tree is actually meant to model the initial data). So take it with a pinch of salt!

Import `plot_importance` from xgboost and pass you XGBClassifier model to it

## Using LightGBM

LightGBM is a more recent boosting libraries, released by Microsoft. It provides a sklearn API that allow us to easily use it within a Pipeline object.

Note: LightGBM has a way to handle categorical features, unfortunately this isn't possible with the sklearn API and requires using the more complex training API, which isn't compatible with sklearn. It has the added advantage of handling categorical features, meaning we do not need to use one hot encoding. 

Create a new `LGBMClassifier`. You can import it directly from `lightgbm` and keep the default parameters for now, apart for:
- `class_weight` which allows us to take the inbalance into account, set it to `balanced`
- `subsample` set it to .8
- `colsample_bytree` set it to .8


Train it and check accuracy and classification report:

Nice,let's try to tune it now. 

Create a new grid search to find the best combination of `max_depth`, `n_estimators`, `min_child_samples`:

What are your best parameters?

Re-train your model with the given parameters and check accuracy and classification report:

Here again we could tune `subsample`, `colsample_bytree` and decrease the `learning_rate`, but for the purpose of this notebook we can stop here. A last thing we can do though, is checking the feature importance:

In [None]:
import lightgbm

plt.figure(figsize=(20, 10))
lightgbm.plot_importance(lgb, ax=plt.gca())