# Decision Trees and Random Forest


## Exercise: Decision Trees

We are going to use the breast cancer dataset from sklearn where the goal is to classify each sample as malignant or benign (binary classification task) based on features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.  


### Load the libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier, plot_tree
%matplotlib inline
np.random.seed(1)
plt.figure(figsize=(30,30))


### Load the data

In [None]:
# Load data
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target


### 1. Model fitting

In this exercise you need to do the following:
- Split the data into a training and a test set using test size of 30% of the training set.

- Train a decision tree classifier to the data and visualize it.

- Make a prediction for the test set

- Evaluate the model's performance by computing the accuracy score and plotting the confusion matrix. 

#### Hints: 
Decision Trees: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

Tree Plot: https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html

Confusion matrix plot: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay


In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

# Apply a decisiontree classifier to the data and visualize your decision tree
#### START YOUR SOLUTION HERE ####
# Split the data into training and test set

# fit model 

# Plot the fitted tree

# compute predictions for test set

# Compute the accuracy score

# Compute the confusion matrix

# Plot the confusion matrix

#### END YOUR SOLUTION HERE ###

### Tuning tree depth with grid search CV
Tune the tree depth parameter using grid seacrh cross validation. Check out depth values between 1 and 10. 
- What is the optimal tree depth and its corresponding test accuracy score?

- Plot the tree with the optimal depth parameter.

- What is the CV accuracy for the best parameter (tree depth)?

In [None]:
# Grid Search - tuning tree depth
from sklearn.model_selection import GridSearchCV

#### START YOUR SOLUTION HERE ####
# Define grid for the parameter to test - max_depth from 1 to 10

# Define and fit model using grid search CV with 5-fold cross validation

# Plot the fitted tree

# Print results

#### END YOUR SOLUTION HERE ####

## Exercise: Random Forest
Now we train a random forest model to the same dataset (for the same task) using the same training test split.
- Apply a random forest classifier with 100 trees to the data.
- Compute and print the training and test accuracies and compare it to the out of bag score (hint: set `oob_score = True` in classifier).

#### Hints:
Random Forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

OOB: https://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import ConfusionMatrixDisplay

#### END YOUR SOLUTION HERE ####
# fit model 

# compute predictions for the training and test sets

# compute the accuracy scores (test, training and OOB)


# print the computed scores

# Compute the confusion matrix

# Plot the confusion matrix 

#### END YOUR SOLUTION HERE ####





### Tune the number of trees parameter using grid search

Use grid search CV (5 folds) to find the best number of treees (estimators) using a grid from 100 to 1000 with a step of 100. Print the best number of trees and its corresponding test accuracy score and cross validation accuracy score.

In [None]:
#### START YOUR SOLUTION HERE ####
# Define the grid for the number of trees

# Do a grid search to find the optimal number of trees

# print the best hyperparameter

# print the training CV accuracy score

# print the test accuracy score

#### END YOUR SOLUTION HERE ####

### Importance plot
Use the permutation importance to compute the feature importances for the best model from the grid search CV. 

#### Hints:
Forest importances: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html


In [None]:
# retrieve the relative importance of each variable and visualize the importance plot
from sklearn.inspection import permutation_importance

#### START YOUR SOLUTION HERE ####
# get the best model from the grid search CV

# compute the feature importances using permutation test

# sort them

# plot the importances

#### END YOUR SOLUTION HERE ####

Below we use the attribute `feature_importances_` of random forest model selected in the grid search that quantifies the feature importance based on mean decrease in impurity. These scores, however, can be misleading for continuous and high cardinality features. 

In [None]:
# get the feature importances from the fitted model
importances = best_rf_model.feature_importances_
# get the standard deviations
std = np.std([tree.feature_importances_ for tree in best_rf_model.estimators_], axis=0)
# put them in pandas series
forest_importances = pd.Series(importances, index=cancer.feature_names)
# sort them
forest_importances.sort_values(inplace=True, ascending=False)

# plot them
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
