## Chapter 6: Decision Trees

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:]
y = iris.target

In [2]:
# train decision tree

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

DecisionTreeClassifier(max_depth=2)

In [3]:
# export decision tree

from sklearn.tree import export_graphviz

export_graphviz(
    tree_clf,
    out_file = "iris_tree.dot",
    feature_names = iris.feature_names[2:],
    class_names = iris.target_names,
    rounded = True,
    filled = True
)

In [12]:
# Train a regression decision tree

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)

DecisionTreeRegressor(max_depth=2)

#### Exercises

#### 7. Train and fine-tune a Decision Tree for the moons dataset by following these steps:
 1. use ```make_moons(n_samples=10000, noise=0.4)``` to generate a moons dataset.
 2. Use a ```train_test_split()``` to split the dataset into a training set and test set.
 3. Use grid search with cross-validation to find good hyperparameter values for a ```DecisionTreeClassifier```.
 4. Train it on the full training set using these hyperparameters and measure your models performance on the test set.

In [17]:
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=10000, noise=0.4)

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [20]:
from sklearn.model_selection import GridSearchCV

params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2, 3, 4]}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(), params, verbose=1, cv=3)

grid_search_cv.fit(X_train, y_train)

Fitting 3 folds for each of 294 candidates, totalling 882 fits


GridSearchCV(cv=3, estimator=DecisionTreeClassifier(),
             param_grid={'max_leaf_nodes': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                            13, 14, 15, 16, 17, 18, 19, 20, 21,
                                            22, 23, 24, 25, 26, 27, 28, 29, 30,
                                            31, ...],
                         'min_samples_split': [2, 3, 4]},
             verbose=1)

In [21]:
grid_search_cv.best_estimator_

DecisionTreeClassifier(max_leaf_nodes=4)

In [22]:
from sklearn.metrics import accuracy_score

y_pred = grid_search_cv.predict(X_test)
accuracy_score(y_test, y_pred)

0.8525

#### 8. Grow a forest by following these steps:
 1. Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly.
 2. Train one Decision Tree on each subset, using the best hyperparameter values found in the previous exercise. Evaluate these 1,000 decision trees on the test set. Since they were trained on smaller sets, these Decision Trees will likely perform worse than the first Decision Tree, achieving only about 80% accuracy.
 3. For each test set instance, generate the predictions of the 1,000 decision trees, and keep only the most frequent prediction. This approach gives you *majority-vote predictions* over the test set.
 4. Evauluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model.

In [24]:
from sklearn.model_selection import ShuffleSplit

small_sets = []

rs = ShuffleSplit(n_splits=1000, test_size=len(X_train) - 100)
for small_train_index, small_test_index in rs.split(X_train):
    X_small_train = X_train[small_train_index]
    y_small_train = y_train[small_train_index]
    small_sets.append((X_small_train, y_small_train))

In [26]:
from sklearn.base import clone
import numpy as np

forest = [clone(grid_search_cv.best_estimator_) for _ in range(1000)]

accuracy_scores = []

for tree, (X_small_train, y_small_train) in zip(forest, small_sets):
    tree.fit(X_small_train, y_small_train)
    
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

np.mean(accuracy_scores)

0.82005

In [28]:
Y_pred = np.empty([1000, len(X_test)], dtype=np.uint8)

for tree_index, tree in enumerate(forest):
    Y_pred[tree_index] = tree.predict(X_test)

In [30]:
from scipy.stats import mode

y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

In [31]:
accuracy_score(y_test, y_pred_majority_votes.reshape([-1]))

0.8585