### Experiments with entropy, information gain, and decision trees.

Iris fact of the day: Iris setosa's root contains a toxin that was used by the Aleut tribe in Alaska to make poisonous arrowheads.

This is a version of the notebook where we just focus on the sklearn implementation.

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

# For producing decision tree diagrams.
from IPython.core.display import Image, display
from sklearn.externals.six import StringIO
import pydot

Couldn't import dot_parser, loading of dot files will not be possible.


In [2]:
# Load the data, which is included in sklearn.
iris = load_iris()
print 'Iris target names:', iris.target_names
print 'Iris feature names:', iris.feature_names
X, Y = iris.data, iris.target

# Shuffle the data, but make sure that the features and accompanying labels stay in sync.
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

# Split into train and test.
train_data, train_labels = X[:100], Y[:100]
test_data, test_labels = X[100:], Y[100:]

Iris target names: ['setosa' 'versicolor' 'virginica']
Iris feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


For the following questions, you'll might find this function useful to print out the tree. If you want to try a graphical way, look into this function:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html

The below function prints out a 'pseudocode' version of the tree, in terms of if-else statements.

In [3]:
def get_code(tree, feature_names):
        left      = tree.tree_.children_left
        right     = tree.tree_.children_right
        threshold = tree.tree_.threshold
        features  = [feature_names[i] for i in tree.tree_.feature]
        value = tree.tree_.value

        def recurse(left, right, threshold, features, node):
                if (threshold[node] != -2):
                        print "if ( " + features[node] + " <= " + str(threshold[node]) + " ) {"
                        if left[node] != -1:
                                recurse (left, right, threshold, features,left[node])
                        print "} else {"
                        if right[node] != -1:
                                recurse (left, right, threshold, features,right[node])
                        print "}"
                else:
                        print "return " + str(value[node])

        recurse(left, right, threshold, features, 0)

# example call:
# get_code(dt, iris.feature_names)

Let's explore what the DecisionTreeClassifier from sklearn can do. Be sure to reference the documentation at:  http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Here is the basic code to fit a deision tree, and evaluate its accuracy:

In [5]:
dt = DecisionTreeClassifier()
dt.fit(train_data, train_labels)

print 'Accuracy:', dt.score(test_data, test_labels)

get_code(dt, iris.feature_names)

Accuracy: 0.96
if ( petal width (cm) <= 0.800000011921 ) {
return [[ 31.   0.   0.]]
} else {
if ( petal width (cm) <= 1.65000009537 ) {
if ( petal length (cm) <= 4.94999980927 ) {
return [[  0.  32.   0.]]
} else {
if ( sepal length (cm) <= 6.05000019073 ) {
if ( petal length (cm) <= 5.05000019073 ) {
return [[ 0.  0.  1.]]
} else {
return [[ 0.  1.  0.]]
}
} else {
return [[ 0.  0.  3.]]
}
}
} else {
return [[  0.   0.  32.]]
}
}


Here are some things to try:

1. For the above basic tree, what features did it use in what order? where did it split the features? What is the tree's depth?
2. Try changing the split criterion to 'entropy', does that make any impact in accuracy? In what features are used? In where you split the features?
3. Try making a decision tree of depth 1, how does its performance compare to the basic one? On what feature does it split? 
4. Look at the sklearn documentation, and look into the other parameters. Try one out, what does it do? If it is a numeric parameter, try some extreme cases, what happens?
5. Try adding some additional random features to the feature set. Does the tree ignore them? How could you prevent overfitting?