**Chapter 6 – Decision Trees**

# Exercise solutions

## 7.

Train and fine-tune a Decision Tree binary classifier for the moons dataset. You should get test accuracy between 85% and 87%.

In [69]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.base import clone
from scipy.stats import mode
import numpy as np
import pandas as pd


In [8]:
X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)
print 'X shape:', X_train.shape
print 'y shape:', y.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X shape: (8000, 2)
y shape: (10000,)


In [48]:
tree_clf = DecisionTreeClassifier()
param_grid = dict(max_depth = range(2,16))
grid_search = GridSearchCV(tree_clf, param_grid, n_jobs=-1)

In [49]:
grid_search.fit(X_train, y_train)
print 'tree_clf best average accuracy:', grid_search.best_score_, '\n'
print 'tree_clf best parameters:', grid_search.best_params_

tree_clf best average accuracy: 0.8525 

tree_clf best parameters: {'max_depth': 2}


In [50]:
print 'tree_clf test accuracy:', accuracy_score(y_test, grid_search.best_estimator_.predict(X_test))

tree_clf test accuracy: 0.863


# 8.
Grow a forest using trees with the best hyperparameters from above.

In [77]:
# Grow 1000 trees, each one based on 100 training observations
rs = ShuffleSplit(n_splits = 1000, train_size=100, random_state=42) 
forest = []
all_tree_preds = []
scores = []
for train_ix, test_ix in rs.split(X_train):
    tree = clone(grid_search.best_estimator_)
    tree.fit(X_train[train_ix], y_train[train_ix])
    tree_preds = tree.predict(X_test)
    score = accuracy_score(y_test, tree_preds)
    forest.append(tree)
    scores.append(score)
    all_tree_preds.append(tree_preds)
                           
print 'Description of test accuracies for 1000 decision trees:\n', pd.DataFrame(scores).describe()

Description of test accuracies for 1000 decision trees:
                 0
count  1000.000000
mean      0.835481
std       0.031183
min       0.677500
25%       0.822500
50%       0.844500
75%       0.858000
max       0.873000


In [79]:
# For each test set observation, make predictions using the majority vote of all the trees
all_tree_preds_matrix = np.array(all_tree_preds)
majority_vote_preds = np.reshape(mode(all_tree_preds_matrix, axis=0).mode, -1)

print 'Accuracy of DIY random forest on test set:', accuracy_score(y_test, majority_vote_preds)

Accuracy of DIY random forest on test set: 0.8705
