# Random Forest Classification

In this notebook, we discuss how to fit and evaluate a random forest classification model in Python.

We will continue using the Breast Cancer Wisconsin Diagnostic dataset, which was downloaded from the UCI Machine Learning repository.

In [1]:
# Import data
import pandas as pd

df = pd.read_csv('wdbc.csv')

The target variable is called `target`, and we want to use all remaining variables to try to predict it.

In [2]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X = df.drop('target', axis = 1)
y = df.target

# We specify the random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1,
                                                    random_state = 1234)
(X_train.shape, X_test.shape)

((512, 30), (57, 30))

Recall the decision tree classification model we fitted in the previous notebook.

In [3]:
# Fit Decision Tree
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Get predicted values
y_pred = model.predict(X_test)

# Compute F-score
import sklearn.metrics

sklearn.metrics.f1_score(y_test, y_pred, pos_label = 'M')

0.8936170212765957

## Ensemble Learning

Growing a tree is (almost) deterministic: we look for the best split at every node. To add some randomness (and increase the number of potential trees), we can randomly select a feature at each node to construct our predicate. This can be achieved by setting `splitter = 'random'` in the constructor function.

In [4]:
model_ens = DecisionTreeClassifier(splitter = 'random')
model_ens.fit(X_train, y_train)

# Get predicted values
y_pred_ens = model_ens.predict(X_test)

# Compute F-score
sklearn.metrics.f1_score(y_test, y_pred_ens, pos_label = 'M')

0.8800000000000001

By repeating this procedure multiple times, we get different trees. Some have better performance, some have worse performance.

### Exercise

Fit the model 5 times. Store the predictions in the array below. How much overlap is there between the predictions?

In [5]:
import numpy as np

y_pred_array = np.empty((5, len(y_test)), dtype = object)

# Write your code below
for i in range(5):
    model_ens.fit(X_train, y_train)
    y_pred_array[i,] = model_ens.predict(X_test)


# Print results
y_pred_array

array([['M', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'B',
        'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M',
        'B', 'B', 'B', 'M', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'B',
        'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'M', 'B', 'M',
        'M', 'M', 'M', 'B', 'M'],
       ['M', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'B', 'M', 'B',
        'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M',
        'M', 'B', 'B', 'M', 'M', 'M', 'M', 'M', 'B', 'B', 'B', 'M', 'B',
        'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'M', 'B', 'M', 'B', 'B',
        'M', 'M', 'M', 'B', 'M'],
       ['M', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'M', 'B',
        'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'M', 'M', 'M',
        'B', 'B', 'B', 'B', 'M', 'B', 'M', 'M', 'B', 'B', 'B', 'B', 'B',
        'B', 'M', 'M', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'M', 'B', 'B',
        'M', 'M', 'M', 'B', 'M'],
       ['M', 'B', 'B',

## Random Forests

We can fit random forest models using `scikit-learn`, as follows:

In [7]:
# Fit Random Forest
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)

# Get predicted values
y_pred_rf = model_rf.predict(X_test)

# Compute F-score
sklearn.metrics.f1_score(y_test, y_pred_rf, pos_label = 'M')

0.9333333333333332

### Exercise

By default, the number of trees in the forest is set at 100. Increase it to 500.

Moreover, change `max_features` to `'log2'` (and look at the documentation to understand what this does!). Compute the F-score, and compare to the model above. Which model performs best?

In [14]:
# Write your code below
model_rf2 = RandomForestClassifier(n_estimators = 500,
                                   max_features = 'log2')
model_rf2.fit(X_train, y_train)

y_pred_rf2 = model_rf2.predict(X_test)

sklearn.metrics.f1_score(y_test, y_pred_rf2, pos_label = 'M')

0.9333333333333332