# Bagging - Random Forest

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

**Note**: If you have completed the Decision Tree notebook already, those preprocessing steps are the same. Feel free to copy paste answers from the previous notebook or the solutions and jump straight to the Random Forest part.

### The Dataset

The dataset can be downloaded [here](https://archive.ics.uci.edu/ml/datasets/bank+marketing). It consists of data from marketing campaigns of a Portuguese bank. We will try to build a classifier that can predict whether or not the client targeted by the campaign ended up subscribing to a term deposit (column `y`).

Load the file `data/bank-marketing.zip` with pandas and check the distribution of the target `y`. Here the separator is `';'` instead of a comma.

The dataset is imbalanced, we will need to keep that in mind when building our models!

Now split the data into the feature matrix `X` (all features except `y`) and the target vector `y` making sure that you convert `yes` to `1` and `no` to `0`.

In [None]:
# Get X, y


Here is the list of features in our X matrix:

```
1. age (numeric)
2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. default: has credit in default? (categorical: 'no','yes','unknown')
6. housing: has housing loan? (categorical: 'no','yes','unknown')
7. loan: has personal loan? (categorical: 'no','yes','unknown')
8. contact: contact communication type (categorical: 'cellular','telephone') 
9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. cons.price.idx: consumer price index - monthly indicator (numeric) 
18. cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
19. euribor3m: euribor 3 month rate - daily indicator (numeric)
20. nr.employed: number of employees - quarterly indicator (numeric)
```

Note the comment about the `duration` feature. We will exclude it from our analysis.

Drop `duration` from X:

Now we can check the types of all our features. We see that some seem to be categorical whilst others are numerical. We will keep a two lists, one for each type, so we can preprocess them differently.

In [None]:
X.dtypes

In [None]:
# they have a third class "unknown" we'll process them as non binary categorical
num_features = ["age", "campaign", "pdays", "previous", "emp.var.rate", 
                "cons.price.idx", "cons.conf.idx","euribor3m", "nr.employed"]

cat_features = ["job", "marital", "education","default", "housing", "loan",
                "contact", "month", "day_of_week", "poutcome"]

### Visualise the numerical features

* show a boxplot of the numerical features

The features aren't at the same scale. But it's all fine for tree based methods as we've seen in the course, so we do not need to do any scaling here!

### One Hot Encoding on Categorical Features

In order to make sure our dataset contains only number we will need to transform our categorical features into one hot encoded features. To do so, first, use `pd.get_dummies` on your dataframe (select only the categorical features) to generate the new columns. Assign the new dataframe to a variable `X_categorical`

Create, now we can create `X_processed` using `pd.concat` (check the documentation, you will need to specify the right axis). Here we want to concatenate a dataframe with only our numerical features together with our `X_categorical` we created above:

### Split data into training set and test set

Split your data (use `X_processed`) into training and test set. Here we are dealing with an imbalanced dataset, so it is important to enforce stratification. We will use the argument `stratify` from `train_test_split` to do so (check the documentation)

Great, we're ready to start training our random forest!

# Random Forest

Now that we have preprocessed our data, we can train a Random Forest on it. For that, we will import the `RandomForestClassifier` from `sklearn.ensemble`

For now we will make our Random Forest bad on purpose by deactivating some important parameters to better see their impact.

Create a new Random Forest where the `RandomForestClassifier` has the following parameters:
- `max_depth`=3, 
- `min_samples_split`=.1
- `n_estimators`=15
- `max_features`=None
- ` bootstrap`=False

We will explain the role of those parameters step by step in this notebook.

Train your model on the training set:

Let's check the performances of our newly trained model on the test set. Compute the accuracy score and display the classification report. Both can be found in the package `sklearn.metrics`

It's actually not that bad. First thing we notice is that we have much lower performance scores on the class `1`. That's because we do not have many observations in that class so the model focuses on class `0`. We can fix that by using the parameter `class_weight="balanced"`. Use `set_params` to set the class weight of our model to balanced:

Train your model again and verify it has improved the performance of class `1` by printing the classification  report:

Better. But let's take a closer look at our ensemble to see if we're doing things right. First thing we can do is access single trees in our forest and take a look at their individual performances.

You can access the list of trees in your ensemble with the attribute `.estimators_` on the random forest.

Extract the two first trees in your ensemble and call them `dt1` and `dt2`:

Now print the classification report for each of the two trees:

Note: You can call `.predict` on a single tree to generate a prediction.

The two reports seem **extremely** similar. Something does not look right. Let's plot the two trees to debug a bit further.Execute the two cells below to display then. What can you see?

In [None]:
from IPython.display import Image  
from sklearn import tree
import pydotplus

dot_data = tree.export_graphviz(dt1, 
                                out_file=None, 
                                filled=True, 
                                rounded=True,
                                max_depth=6,
                                proportion=True,
                                feature_names=X_train.columns,
                                special_characters=True)  

graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

In [None]:
dot_data = tree.export_graphviz(dt2, 
                                out_file=None, 
                                filled=True, 
                                rounded=True,
                                max_depth=6,
                                proportion=True,
                                feature_names=X_train.columns,
                                special_characters=True)  

graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

It looks like are trees are exactly the same. There are three reasons for that:

- We have no used any bootstrap (`bootstrap`=False)
- We have not usre any feature subsampling (`max_features`=None)
- We have built too simple trees (`max_depth` and `min_samples_split`)

By doing so, we are not benefiting at all from the boost in performance that bagging should bring.

Now use `set_params` to set `bootstrap=True` and `max_features="auto"`:

Train your model again:

Print the accuracy and classification report on the test set:

Interesting, our performance have actually decreased here. That's because we have introduce more noise by adding boostraping and feature subsampling, so indivual trees are more varied... but by construction are less good than the optimal tree we built earlier. In general this wouldn't be an issue if trees are complex enough to compensate each other's error. Here our trees are two constrained.

First, let's visualise our two first trees as we did before. Execute the cells below to do so (you might have to change the names of variables if you have used different ones). Are our trees different?

In [None]:
dt1 = rf.estimators_[0]
dt2 = rf.estimators_[1]

print(classification_report(y_test, dt1.predict(X_test)))
print(classification_report(y_test, dt2.predict(X_test)))

In [None]:
dot_data = tree.export_graphviz(dt1, 
                                out_file=None, 
                                filled=True, 
                                rounded=True,
                                max_depth=6,
                                proportion=True,
                                feature_names=X_train.columns,
                                special_characters=True)  

graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

In [None]:
dot_data = tree.export_graphviz(dt2, 
                                out_file=None, 
                                filled=True, 
                                rounded=True,
                                max_depth=6,
                                feature_names=X_train.columns,
                                proportion=True,
                                special_characters=True)  

graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

Next step is to find the best trade off for `max_depth` and `min_samples_split` to have trees performant enough but also varied enough. For this we will use grid search:

Define a new grid search object checking a few sensible values for `max_depth` and `min_samples_split` and train it on the training set.

Note: Keep in mind here that you want your trees to overfit a little, so do not constrain them too much

What are your best parameters?

Now re-train your model using those parameters:

Print accuracy and classification report:

That's getting better! Now let's increase the number of trees by setting `n_estimators`, try to increase it until you do not get a performance boost anymore.

What's your final accuracy?

### Plot feature importance

Since Random Forest relies on Decision Trees, we can access features importance as well. Here the features importance will just be averaged over our trees.

With sklearn, you can access it with the attribute `feature_importances_`.

Great, now create a new dataframe where the data is the feature importances you saved above, and the index will be the list of columns from X_train

Plot it as a bar plot:

What can you observe? What are your main features? 

Compare with the features importance with decision tree alone.

## Optional: how would we optimise for recall?

Grid Search by default will return the parameters that give the best accuracy. But what if we cared more about the recall?

We can overwrite the metrics that the grid search is using when comparing two models. The code below will convert the `recall_score` function into a scorer object using `make_scorer`. The resulting object can be passed as `scoring` argument to the gridsearch.

In [None]:
from sklearn.metrics import recall_score, make_scorer

gridCV = GridSearchCV(rf, parameters, cv=10, n_jobs=-1, scoring=make_scorer(recall_score))

gridCV.fit(X_train, y_train)
gridCV.best_params_

In [None]:
rf.set_params(**gridCV.best_params_)
rf.fit(X_train,y_train)

print(accuracy_score(y_test, rf.predict(X_test)))
print(classification_report(y_test, rf.predict(X_test)))

We've decreased in overall accuracy, but managed to increase the recall (for class 1 by default) a bit more!