***
**Introduction to Machine Learning** <br>
__[https://slds-lmu.github.io/i2ml/](https://slds-lmu.github.io/i2ml/)__
***

# Exercise sheet 9: Random Forests

In [None]:
#| label: import
# Consider the following libraries for this exercise sheet:

# general
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist
from scipy.sparse import dok_matrix
# plots
import matplotlib.pyplot as plt
# sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split

## Exercise 2: Classifying `spam`

> a) Take a look at the spam dataset and shortly describe what kind of classification problem this is.

<div class="alert alert-block alert-info">
<b>Hint:</b> read <a href="https://github.com/slds-lmu/lecture_i2ml/blob/master/exercises/data/spam.csv"><code>spam.csv</code></a>
</div>

In [None]:
# Enter your code here:

> b) Use a decision tree to predict `spam`. Re-fit the tree using two random subsets of the data (each comprising 60% of observations). How stable are the trees?

<div class="alert alert-block alert-info">
<b>Hint:</b> Use <code>from sklearn.tree import plot_tree</code> to visualize the trees.
</div>

In [None]:
# Entering your code here:

> c) Forests come with a built-in estimate of their generalization ability via the out-of-bag (OOB) error. <br>
 >> i) Show that the probability for an observation to be OOB in an arbitrary bootstrap sample converges to $\frac{1}{e}$. <br>
 >> ii) Use the random forest learner (`RandomForestClassifier()`) to fit the model and state the out-of-bag (OOB) error.


i) Enter your answer here:


In [1]:
# ii) Enter your code here:

> d) You are interested in which variables have the greatest influence on the prediction quality. Explain how to
determine this in a permutation-based approach and compute the importance scorses for the `spam` data.

<div class="alert alert-block alert-info">
<b>Hint:</b> choose an adequate importance measure as described in <br>
    <a href="https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html"><code>https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html</code></a>.
</div>

In [None]:
# Enter your code here:

## Exercise 3: Proximities


You solve the `wine` task, predicting the `type` of a wine – with $3$ classes – from a number of covariates. After
training, you wish to determine how similar your observations are in terms of proximities. <br>
For the following subset of the training data and the random forest model given below:

In [None]:
wine_data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", header = None)
wine_data = wine_data.iloc[:, [0,4,1,7,11,2,6]]
wine_data.columns = ["type", "alcalinity", "alcohol", "flavanoids", "hue", "malic", "phenols"]
wine_data_sub = wine_data.iloc[[13, 169, 49],:]

wine_data_sub.head()

Unnamed: 0,type,alcalinity,alcohol,flavanoids,hue,malic,phenols
13,1,11.4,14.75,3.69,1.25,1.73,3.1
169,3,25.0,13.4,0.96,0.67,4.6,1.98
49,1,17.4,13.94,3.54,1.12,1.73,2.88


In [None]:
X_wine = wine_data.copy() # note without copy X_wine is not a variable but a pointer
y_wine = X_wine.pop("type")
X_wine_sub = wine_data_sub.copy()
y_wine_sub = X_wine_sub.pop("type")

# Train a Random Forest classifier
rf = RandomForestClassifier(n_estimators=3, max_depth=2, random_state=14)
rf.fit(X_wine, y_wine)

leaf_indices = rf.apply(X_wine_sub)
print(leaf_indices)

[[6 4 5]
 [6 1 6]
 [6 4 5]]


Using the rf.apply(X_wine_sub) output, we can follow the path of each sample through each tree.

> a)  find the terminal node of each tree the observations are placed in:

In [None]:
# Enter your code here:

> b) compute the observations’ pairwise proximities:

Enter your answer here:

> c) construct a similarity matrix from these proximities:

In [None]:
# Enter your code here: