**Decision Trees and Random Forests**<br><br>
**Overview**
1.  **Decision Tree**
   *  Get data
   *  Train and test sets
   *  Grid search with cross-validation to find good hyperparameters
   *  Train model on full train set and measure performance on test set
2.  **Random Forest**
   *  Generate 1,000 subsets of the training set, each with 100 random instances
   *  Train a Decision Tree on each subset with the best found hyperparameters
   *  Evaluate the 1,000 Decision Trees on test set
   *  For each test set instance, generate the prediction of the 1,000 Trees, keep the most frequent prediction (majority vote)
   *  Evaluate predictions on test set

In [231]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV, ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.base import clone

import numpy as np

from scipy import stats

%reset

Nothing done.


# Decision Tree

**Step 1: Get data**

In [232]:
X, y = make_moons(n_samples=10000, noise=0.4)

**Step 2: Train and Test sets**

In [233]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Step 3: Grid search**

In [234]:
parameters =    {'max_leaf_nodes': list(range(2, 100)),
                'min_samples_split': [2, 3, 4],
                'max_depth': [2, 3, 4, 5, 10, 12, 15, 30]
}

tree = DecisionTreeClassifier(random_state=42)

grid_search = GridSearchCV(tree, parameters, cv=3)
grid_search.fit(X_train, y_train)

print(grid_search.best_estimator_)

DecisionTreeClassifier(max_depth=10, max_leaf_nodes=25, random_state=42)


**Step 4: Train model with best hyperparameters**

In [235]:
tree = grid_search.best_estimator_

tree.fit(X_train, y_train)

# some sample predictions
print(f'Predicted:\t{tree.predict(X_test[:10])}')
print(f'Actual:\t\t{y_test[:10]}')

predictions = tree.predict(X_test)
score = accuracy_score(y_test, predictions)
print(f'Accuracy score: {score*100}%')

Predicted:	[1 0 1 1 0 1 1 1 1 1]
Actual:		[1 0 1 1 0 1 1 1 1 1]
Accuracy score: 87.0%


# Random Forest

**Step 5: Generate subsets**

In [236]:
num_trees = 1000
num_instances = 100

mini_sets = []

rs = ShuffleSplit(n_splits=num_trees, test_size=len(X_train) - num_instances, random_state=42)

for train_index, test_index in rs.split(X_train):
    mini_X = X_train[train_index]
    mini_y = y_train[train_index]
    mini_sets.append((mini_X, mini_y))


**Step 6: Train best Decision Tree on each subset**

In [237]:
forest = [clone(grid_search.best_estimator_) for _ in range(num_trees)]

accuracy_scores = []

for tree, (mini_X, mini_y) in zip(forest, mini_sets):
    tree.fit(mini_X, mini_y)
    predictions = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, predictions))

print(f'Forest accuracy scores: {np.mean(accuracy_scores)*100}%')


Forest accuracy scores: 79.7495%


**Setp 7: Majority vote**

In [238]:
Y_pred = np.zeros((num_trees, len(X_test)))

for idx, tree in enumerate(forest):
    Y_pred[idx] = tree.predict(X_test)

majority_votes, count = stats.mode(Y_pred, axis=0)
print(majority_votes)
print(count)


[[1. 0. 1. ... 0. 0. 0.]]
[[852 984 838 ... 902 946 896]]


In [239]:
forest_accuracy = accuracy_score(y_test, majority_votes.reshape([-1]))

print(f'Forest accuracy with majority vote: {forest_accuracy*100}%')

Forest accuracy with majority vote: 87.75%
