# Random Forest - Study of Randomness and Extracted Features


In this notebook, we study two bagging-based techniques of increasing randomness in the Random Forst model. Also, we investigate the performance of the Random Forest model trained on extracted features. We perform three tasks on a large and complex classification dataset.

- Task 1: Train a Random Forest model by using the bagging method on raw features.
- Task 2: Train a Random Forest model by using the Extra-Trees method on raw features.
- Task 3: Train a Random Forest model by using the bagging method on **extracted features**.


## Bagging & Extra-Trees Method

For task 1 and 2, we use raw features to train the Random Forest model using the bagging method and its augmented version.

Bagging stands for bootstrap aggregation. Using bagging, we train many decision trees on different random subsets of the training set. The sampling is performed with replacement. The bagging method induces additional randomness in the trees by learning from a random subset of features.

The extremely randomized trees or **Extra-Trees** method is created by augmenting the bagging method. The Extra-Trees method increases randomness by learning from random thresholds of each feature.

The Extra-Trees based model is much faster to train than the vanilla bagging method based Random Forests. This is due to the fact that finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree.


## Random Forest Trained with Extracted Features

For task 3, we investigate whether the Random Forest model can be trained effectively using extracted features. We use a dimensionality reduction technique, i.e., PCA, for feature extraction.  

## Dataset: MNIST

We use the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents.

There are 70,000 images. Each image is 28x28 pixels, and each feature simply represents one pixel’s intensity, from 0 (white) to 255 (black).

Thus, each image has 784 features. 

In [1]:
import time
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

## Load Data and Create Data Matrix (X) and the Label Vector (y)

We load the data from the API. However, note that we don't scale the data. The Random Forest model doesn't require standardized data.

In [2]:
# Load data using Scikit-Learn
mnist = fetch_openml('mnist_784', cache=False)

X = mnist["data"].astype('float64')
y = mnist["target"].astype('int64')

print(X.shape)
print(y.shape)

(70000, 784)
(70000,)


## Split Data Into Training and Test Subsets

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Task 1: Train a Random Forest by using the Bagging Method on Raw Features

We did not do hyperparameter tuning to select the optimal Random Forest model.

Separately we experimented with several hyperparameter values and found the near-optimal model, which is trained below. Ideally we should have performed a grid search for hyperparameter tuning.

In [4]:
%%time

t0 = time.time()
forest_clf = RandomForestClassifier(n_estimators=1000, criterion="gini", max_features="auto", 
                                    max_depth=32, class_weight="balanced", oob_score=True, 
                                    verbose=1, n_jobs=-1)

forest_clf.fit(X_train, y_train)
t1 = time.time()

training_forest_clf = t1 - t0

print("Random Forest Training took {:.2f}s".format(training_forest_clf))

y_test_predicted = forest_clf.predict(X_test)
accuracy_forest_clf = accuracy_score(y_test, y_test_predicted)
print("\nTest Accuracy: ", accuracy_forest_clf)

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))

print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))

print("\nScore of the training dataset obtained using an out-of-bag estimate: ", forest_clf.oob_score_)
print("\n")

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   18.7s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   43.9s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  1.7min finished


Random Forest Training took 136.03s


[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    0.9s



Test Accuracy:  0.9692142857142857

Test Confusion Matrix:
[[1370    0    1    0    2    3    6    0    5    0]
 [   0 1563    5    4    3    0    1    1    2    1]
 [   3    2 1406    6    3    1    5    8    8    1]
 [   2    1   22 1360    0   16    0   11   19    4]
 [   1    0    1    0 1314    0    5    4    3   22]
 [   1    2    3   12    2 1186   12    0    8    5]
 [   6    2    1    0    3    8 1363    0    4    0]
 [   3    5   22    1    7    0    0 1399    3   18]
 [   0    9    3    5    5    5    3    0 1318   20]
 [   4    0    5   19   14    5    2   14    8 1290]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1387
           1       0.99      0.99      0.99      1580
           2       0.96      0.97      0.97      1443
           3       0.97      0.95      0.96      1435
           4       0.97      0.97      0.97      1350
           5       0.97      0.96      0.97      1231
      

[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    1.2s finished


## Task 2: Train a Random Forest by using the Extra-Trees Method on Raw Features

We use Scikit-Learn’s ExtraTreesClassifier class for creating an Extra-Trees model. It is identical to the RandomForestClassifier class. Similarly, the Extra TreesRegressor class has the same API as the RandomForestRegressor class.

### Note:
By deault the ExtraTreesClassifier "bootstrap" hyperparameter is set to False. As a consequence, the whole dataset is used to build each tree. To use the bagging method, we need to explicity set it to True.

In [5]:
%%time

t0 = time.time()
extra_trees_clf = ExtraTreesClassifier(n_estimators=1000, criterion="gini", max_features="auto", 
                                       max_depth=32, class_weight="balanced", oob_score=True, 
                                       bootstrap=True, verbose=1, n_jobs=-1)
extra_trees_clf.fit(X_train, y_train)
t1 = time.time()

training_extra_tree = t1 - t0

print("Extra Tree Training took {:.2f}s".format(training_extra_tree))


y_test_predicted = extra_trees_clf.predict(X_test)
accuracy_extra_trees = accuracy_score(y_test, y_test_predicted)
print("\nTest Accuracy: ", accuracy_extra_trees)

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))

print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))

print("\nScore of the training dataset obtained using an out-of-bag estimate: ", extra_trees_clf.oob_score_)
print("\n")

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   12.1s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   29.1s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   52.7s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  1.1min finished


Extra Tree Training took 104.80s


[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    1.0s



Test Accuracy:  0.9675714285714285

Test Confusion Matrix:
[[1369    0    1    0    2    2    6    0    7    0]
 [   0 1562    6    3    3    1    1    1    2    1]
 [   3    3 1401    7    3    1    5    8   11    1]
 [   2    1   22 1358    0   17    2   10   18    5]
 [   2    0    1    0 1314    0    4    4    2   23]
 [   1    1    3   10    2 1188   12    1    8    5]
 [   8    2    1    0    3    7 1364    0    2    0]
 [   3    7   22    0    7    0    0 1389    6   24]
 [   0   10    4    9    6    6    3    0 1310   20]
 [   5    2    4   19   10    6    2   14    8 1291]]

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      1387
           1       0.98      0.99      0.99      1580
           2       0.96      0.97      0.96      1443
           3       0.97      0.95      0.96      1435
           4       0.97      0.97      0.97      1350
           5       0.97      0.97      0.97      1231
      

[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    1.2s finished


## Task 1 & 2: Summary of Results 

In [6]:
data = [["Random Forest (1000 trees)", accuracy_forest_clf, training_forest_clf], 
        ["Extra-Trees (1000 trees)", accuracy_extra_trees, training_extra_tree]]

pd.DataFrame(data, columns=["Classifier", "Accuracy", "Running-Time"])

Unnamed: 0,Classifier,Accuracy,Running-Time
0,Random Forest (1000 trees),0.969214,136.028079
1,Extra-Trees (1000 trees),0.967571,104.795368


## Task 1 & 2: Comparative Understanding

From the above results, we see that the performance of Extra-Trees and Random Forest model are comparable. 

- Extra-Trees model is **faster at the cost of slightly higher bias** (i.e., slightly smaller test accuracy). The speed in training comes from using random threshold then searching for an optimal threshold.
- The performance difference is insignificant.


### RandomForestClassifier vs. ExtraTreesClassifier

It is hard to tell in advance whether a RandomForestClassifier will perform better or worse than an ExtraTreesClassifier. 

Generally, the only way to know is to try both and compare them using cross-validation (and tuning the hyperparameters using grid search).

## Task 3: Train a Random Forest model by Using the Bagging Method on Extracted Features

The goal of this task is to determine whether a Random Forest model trained on extracted features exhibit better performance. In other words, we want to see whether the Random Forest model learns effectively on extracted features.
 
We use the **Principle Component Analysis (PCA)** dimensionality reduction technique to extract a low-dimensional features. Then, train the Random Forest model using the extracted features.

We apply the PCA to project the MNIST dataset (784 features) to a lower dimensional space by retaining maximum variance (95%).

The PCA **extracts 154 features**, which we use to train a Random Forest classifier.

In [7]:
pca = PCA(n_components=0.95)
pca.fit(X_train)

X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

print("Number of Principle Components (Extracted Features): ", pca.n_components_)  

Number of Principle Components (Extracted Features):  154


In [8]:
%%time

t0 = time.time()

forest_clf_pca = RandomForestClassifier(n_estimators=1000, criterion="gini", max_features="auto", 
                                    max_depth=32, class_weight="balanced", oob_score=True, verbose=1, n_jobs=-1)

forest_clf_pca.fit(X_train_pca, y_train)

t1 = time.time()

training_forest_clf_pca = t1 - t0

print("Random Forest (PCA) Training took {:.2f}s".format(training_forest_clf_pca))

y_test_predicted_pca = forest_clf_pca.predict(X_test_pca)
accuracy_forest_clf_pca = accuracy_score(y_test, y_test_predicted_pca)
print("\nTest Accuracy (PCA): ", accuracy_forest_clf_pca)

print("\nTest Confusion Matrix (PCA):")
print(confusion_matrix(y_test, y_test_predicted_pca))

print("\nClassification Report (PCA):")
print(classification_report(y_test, y_test_predicted_pca))

print("\nScore of the training (PCA) dataset obtained using an out-of-bag estimate: ", forest_clf_pca.oob_score_)
print("\n")

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    7.9s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   39.3s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  3.5min finished


Random Forest (PCA) Training took 223.10s


[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:    0.5s
[Parallel(n_jobs=8)]: Done 784 tasks      | elapsed:    0.8s



Test Accuracy (PCA):  0.9492857142857143

Test Confusion Matrix (PCA):
[[1360    0    4    1    2    5   10    1    4    0]
 [   0 1551   10    6    2    0    4    2    5    0]
 [   8    4 1361   14   10    2    4    7   30    3]
 [   2    0   22 1323    1   23    7   14   33   10]
 [   1    1    6    1 1299    0    8    1    6   27]
 [   5    2    6   26    9 1152   13    3    4   11]
 [   9    1    5    1    6   11 1351    0    3    0]
 [   2    6   20    2   14    1    0 1387    8   18]
 [   2   11   11   31   10   21    5    2 1259   16]
 [   8    1    4   25   34    8    2   25    7 1247]]

Classification Report (PCA):
              precision    recall  f1-score   support

           0       0.97      0.98      0.98      1387
           1       0.98      0.98      0.98      1580
           2       0.94      0.94      0.94      1443
           3       0.93      0.92      0.92      1435
           4       0.94      0.96      0.95      1350
           5       0.94      0.94      0.9

[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed:    1.0s finished


## Task 1 & 3: Summary of Results 

In [9]:
data = [["Random Forest (784 raw features)", accuracy_forest_clf, training_forest_clf], 
        ["Random Forest (154 extracted features)", accuracy_forest_clf_pca, training_forest_clf_pca]]

pd.DataFrame(data, columns=["Classifier", "Accuracy", "Running-Time"])

Unnamed: 0,Classifier,Accuracy,Running-Time
0,Random Forest (784 raw features),0.969214,136.028079
1,Random Forest (154 extracted features),0.949286,223.095504


## Task 1 & 3 Observation: Random Forest with Extracted Features

We observe that extracted features obtained with PCA **lowers the performance** of a Random Forest classifier. This is due to the fact that extracted features don't meaningfully combine to produce effective desion rules.

Machine Learning algorithms that create decision rules by composing knowledge (e.g., decision tree uses a combination of the features to contruct a decision rule) don't benefit from dimensionality reduction. 

Thus, **we shouldn't use PCA with Decision Tree or Random Forest**.