# 8.3 Lab: Decision Trees

In [None]:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz, DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error
from sklearn import tree
import graphviz
import matplotlib.pyplot as plt

%matplotlib inline

## 8.3.1 Fitting Classification Trees

The sklearn library has a lot of useful tools for tress. We first use classification trees to analyze the Carseats data set. In these data, Sales is a continuous variable, and so we begin by recoding it as a binary variable. We use the map() function to create a variable, called High, which takes on a value of 'Y' if the Sales variable exceeds 8, and takes on a value of 'N' otherwise. In Python, we need to code catergorical variable into dummy variable.

In [None]:
carseats = pd.read_csv('./data/Carseats.csv')
carseats['High'] = carseats.Sales.map(lambda x: 'Y' if x>8 else 'N')
carseats.ShelveLoc = pd.factorize(carseats.ShelveLoc)[0]
carseats.Urban = carseats.Urban.map({'No':0, 'Yes':1})
carseats.US = carseats.US.map({'No':0, 'Yes':1})
carseats.info()


We first split the dataset into training (200 samples) and test sets.

In [None]:
X = carseats.drop(['Sales', 'High'], axis=1)
y = carseats.High
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=200, random_state=0)

To build a tree, we could use 'gini' or 'entropy' as split criterion at each node. Here I provide an example use 'gini'. If we change the hyperparameters, the clf score jumps around.

In [None]:
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
                               max_depth=6, min_samples_leaf=4)
clf_gini.fit(X_train, y_train)
print clf_gini.score(X_train, y_train)

The most attractive feature of a tree is visulization. Here we first need to save the model file into a .dot file and graphviz.Source to display it.

In [None]:
export_graphviz(clf_gini, out_file="mytree.dot", feature_names=X_train.columns)
with open("mytree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

In [None]:
y_pred = clf_gini.predict(X_test)
cm = pd.DataFrame(confusion_matrix(y_test, y_pred).T, index=['No', 'Yes'], columns=['No', 'Yes'])
print(cm)
print "Accuracy is ", accuracy_score(y_test,y_pred)*100

The test accuracy of our model is significant lower than our training result, this may indicate overfitting. we can go back and change the hyperparameters in the training process to reduce the dimension of the parameter space.

## 8.3.2 Fitting Regression Trees

Here we fit a regression tree to the Boston data set. First, we create a training set, and fit the tree to the training data. Since Python does not support prune, let us fit the max_depth at 2.

In [None]:
boston = pd.read_csv('./data/Boston.csv')
X = boston.drop('medv', axis=1)
y = boston.medv
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=0)
regr_tree = DecisionTreeRegressor(max_depth=2)
regr_tree.fit(X_train, y_train)

In [None]:
export_graphviz(regr_tree, out_file="mytree.dot", feature_names=X_train.columns)
with open("mytree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

In [None]:
y_pred = regr_tree.predict(X_test)
mean_squared_error(y_test, y_pred)

## 8.3.3 Bagging and Random Forests

Here we apply bagging and random forests to the Boston data, using the randomForest package in Python. The exact results obtained in this section may depend on the version of Python and the version of the randomForest package installed on your computer. Recall that bagging is simply a special case of a random forest with m = p. Therefore, the randomForest() function can be used to perform both random forests and bagging. We perform bagging as follows:

In [None]:
all_features = X_train.shape[1]
regr_bagging = RandomForestRegressor(max_features=all_features, random_state=1)
regr_bagging.fit(X_train, y_train)

In [None]:
y_pred = regr_bagging.predict(X_test)
mean_squared_error(y_test, y_pred)

We can grow a random forest in exactly the same way, except that we'll use a smaller value of the max_features argument. Here we'll use max_features = 3 (close to square root of 13)

In [None]:
regr_rf = RandomForestRegressor(max_features=3, random_state=1)
regr_rf.fit(X_train, y_train)

y_pred = regr_rf.predict(X_test)
mean_squared_error(y_test, y_pred)

The test set MSE is even lower; this indicates that random forests yielded an improvement over bagging in this case.

In [None]:
Importance = pd.DataFrame({'Importance':regr_rf.feature_importances_*100}, index=X_train.columns)
Importance.sort_values(by='Importance', axis=0, ascending=True).plot(kind='barh', color='r', )
plt.xlabel('Variable Importance')
plt.gca().legend_ = None

## 8.3.4 Boosting

Here we use the GradientBoostingRegressor package. The argument n_estimators=500 indicates that we want 500 trees, and the option interaction.depth=4 limits the depth of each tree.

In [None]:
regr_boost = GradientBoostingRegressor(n_estimators=500, learning_rate=0.02, max_depth=4, random_state=1)
regr_boost.fit(X_train, y_train)

Let us check the feature importance and MSE.

In [None]:
feature_importance = regr_boost.feature_importances_*100
rel_imp = pd.Series(feature_importance, index=X_train.columns).sort_values(inplace=False)
rel_imp.T.plot(kind='barh', color='r', )
plt.xlabel('Variable Importance')
plt.gca().legend_ = None

In [None]:
y_pred = regr_boost.predict(X_test)
mean_squared_error(y_test,y_pred)