# Decision trees

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### The Dataset

The dataset can be downloaded [here](https://archive.ics.uci.edu/ml/datasets/bank+marketing). It consists of data from marketing campaigns of a Portuguese bank. We will try to build a classifier that can predict whether or not the client targeted by the campaign ended up subscribing to a term deposit (column `y`).

Load the file `data/bank-marketing.zip` with pandas and check the distribution of the target `y`. Here the separator is `';'` instead of a comma.

In [None]:
df = pd.read_csv("data/bank-marketing.zip", sep=";")
df.y.value_counts()


The dataset is imbalanced, we will need to keep that in mind when building our models!

Now split the data into the feature matrix `X` (all features except `y`) and the target vector `y` making sure that you convert `yes` to `1` and `no` to `0`.

In [None]:
# Get X, y
y = df["y"].map({"no":0, "yes":1})
X = df.drop("y", axis=1)


Here is the list of features in our X matrix:

```
1. age (numeric)
2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4. education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5. default: has credit in default? (categorical: 'no','yes','unknown')
6. housing: has housing loan? (categorical: 'no','yes','unknown')
7. loan: has personal loan? (categorical: 'no','yes','unknown')
8. contact: contact communication type (categorical: 'cellular','telephone') 
9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. cons.price.idx: consumer price index - monthly indicator (numeric) 
18. cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
19. euribor3m: euribor 3 month rate - daily indicator (numeric)
20. nr.employed: number of employees - quarterly indicator (numeric)
```

Note the comment about the `duration` feature. We will exclude it from our analysis.

Drop `duration` from X:

In [None]:
X.drop("duration", inplace=True, axis=1)


Now we can check the types of all our features. We see that some seem to be categorical whilst others are numerical. We will keep a two lists, one for each type, so we can preprocess them differently.

In [None]:
X.dtypes

In [None]:
# they have a third class "unknown" we'll process them as non binary categorical
num_features = ["age", "campaign", "pdays", "previous", "emp.var.rate", 
                "cons.price.idx", "cons.conf.idx","euribor3m", "nr.employed"]

cat_features = ["job", "marital", "education","default", "housing", "loan",
                "contact", "month", "day_of_week", "poutcome"]

### Visualise the numerical features

* show a boxplot of the numerical features

In [None]:
plt.figure(figsize=(20, 10))
sns.boxplot(data=X[num_features], ax=plt.gca())


The features aren't at the same scale. But it's all fine for tree based methods as we've seen in the course, so we do not need to do any scaling here!

### One Hot Encoding on Categorical Features

In order to make sure our dataset contains only number we will need to transform our categorical features into one hot encoded features. To do so, first, use `pd.get_dummies` on your dataframe (select only the categorical features) to generate the new columns. Assign the new dataframe to a variable `X_categorical`

In [None]:
X_categorical = pd.get_dummies(X[cat_features])


Create, now we can create `X_processed` using `pd.concat` (check the documentation, you will need to specify the right axis). Here we want to concatenate a dataframe with only our numerical features together with our `X_categorical` we created above:

In [None]:
X_processed = pd.concat([X[num_features], X_categorical], axis=1)


### Split data into training set and test set

Split your data (use `X_processed`) into training and test set. Here we are dealing with an imbalanced dataset, so it is important to enforce stratification. We will use the argument `stratify` from `train_test_split` to do so (check the documentation)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=.3, random_state=42, stratify=y)


## Train a Decision Tree

Now that we have our preprocessor and our data ready, we can train an decision tree on it. For that we can use a `DecisionTreeClassifier` from `sklearn.tree`

For now we will keep our tree unconstrained with:
- `max_depth`=None, 
- `min_samples_split`=2


Create a new decision tree:

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(max_depth=None, min_samples_split=2)


Now fit your model on the training set:

In [None]:
dtc.fit(X_train, y_train)


Execute the cell below to display your tree in the notebook, what do you observe?

Note: if you get an error about `pydotplus` or `graphviz`, try to run the following code in your terminal:

```
conda install python-graphviz
conda install -c conda-forge pydotplus
```

In [None]:
from IPython.display import Image  
from sklearn import tree
import pydotplus

dot_data = tree.export_graphviz(dtc, 
                                out_file=None, 
                                filled=True, 
                                rounded=True,
                                max_depth=6,
                                proportion=True,
                                special_characters=True, feature_names=X_train.columns)

graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

Compute the accuracy of your model on the training data and then on the test data, what can you tell?

In [None]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_train, dtc.predict(X_train)))
print(accuracy_score(y_test, dtc.predict(X_test)))


Now let's investigate a bit more by looking at the `classification_report` (you can import it from `sklearn.metrics`). That will provide us with more information about precision and recall on both our classes.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, dtc.predict(X_test)))


It looks like our model is predicting the majority class `0` (no) really well, which leads to a high accuracy, but we're really bad at predicting the class `1`, which corresponds to successful campaigns, and is of interest here!

At this stage we've found two major issues with our model:

- It largely overfits
- It focuses on the majority class

With our decision tree, we can address both. 

- For the first one we will need to tune `max_depth` and `min_samples_split`. 
- For the second one, we will set `class_weight='balanced'` so that it automatically gives more weight on our minority class as a way to compensate.

Let's use more sensible/constraining values for `max_depth` and `min_samples_split`, let's say `6` and `20` respectively. To change the parameters of your tree, you can use `set_params` on it with the name and values you want to update (for example `max_depth=6`)

Don't forget to re-train the tree after instanciating it.

In [None]:
dtc.set_params(max_depth=6, min_samples_split=20)
dtc.fit(X_train, y_train)


Right let's try to train it again and check the accuracy first (both train and test sets), is it better?

In [None]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_train, dtc.predict(X_train)))
print(accuracy_score(y_test, dtc.predict(X_test)))


We can also visualise our tree:

In [None]:
from IPython.display import Image  
from sklearn import tree
import pydotplus

dot_data = tree.export_graphviz(dtc, 
                                out_file=None, 
                                filled=True, 
                                rounded=True,
                                max_depth=6,
                                proportion=True,
                                special_characters=True, feature_names=X_train.columns)  

graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

That's a simpler tree!

Let's take a look at the classification report now:

In [None]:
dtc.fit(X_train, y_train)
print(classification_report(y_test, dtc.predict(X_test)))


Still doing really badly on the class `1`. Try to set the parameter `class_weight` to `balanced` and retrain your tree:

In [None]:
dtc.set_params(class_weight="balanced")
dtc.fit(X_train, y_train)


Check the classification report again:

In [None]:
print(classification_report(y_test, dtc.predict(X_test)))


That's much better!

### Use Grid Search to find the optimal parameters

Now that we've observed the impact of various parameters, we can trigger a grid search to find the optimal ones.

Define a new `parameters` dictionary that contains all the values you want to try for `max_depth` and `min_samples_split`. Then define a new `GridSearchCV` object and find the best parameters.

In [None]:
from sklearn.model_selection import GridSearchCV

parameters  = [{'max_depth': [3, 4, 7], "min_samples_split": [5, 10, 20]}] 

gridCV = GridSearchCV(dtc, parameters, cv=10, n_jobs=-1)

gridCV.fit(X_train, y_train)


What are you best parameters?

In [None]:
gridCV.best_params_


Great, now we can re-train our model using those parameters. Set the parameters of your tree to be the best ones given by the grid search, and train your model again:

In [None]:
dtc.set_params(**gridCV.best_params_)
dtc.fit(X_train, y_train)


Display your final tree:

In [None]:
from IPython.display import Image  
from sklearn import tree
import pydotplus

dot_data = tree.export_graphviz(dtc, 
                                out_file=None, 
                                filled=True, 
                                rounded=True,
                                max_depth=6,
                                proportion=True,
                                special_characters=True, feature_names=X_train.columns)

graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

Compute it's accuracy on train and test set:

In [None]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_train, dtc.predict(X_train)))
print(accuracy_score(y_test, dtc.predict(X_test)))


Finally check the classification report:

In [None]:
print(classification_report(y_test, dtc.predict(X_test)))


### Plot feature importance

Decision Trees have the advantage of providing a feature importance, a score allowing you to rank all features by their importance for your model when predicting the outcome. With sklearn, you can access it with the attribute `feature_importances_`.

Take a look at the `feature_importances_` attribute:

In [None]:
dtc.feature_importances_


That's hard to read. The array gives a number for each column in our training set, in the same order. A better way to visualise it would be to put it in a table, let's do that.

Create a new dataframe where the data is the feature importances you saved above, and the index will be the list of columns from our training data:

In [None]:
importances_df = pd.DataFrame(dtc.feature_importances_, columns=["importance"], index=X_train.columns)
importances_df.sort_values("importance", ascending=False).head()


Plot it as a bar plot:

In [None]:
importances_df.sort_values("importance", ascending=False).plot(kind="bar", figsize=(20,7))


What's your more important feature?