### Intro

In the [previous module](../vid1_hello_world/machine_learning_hello_world.ipynb) I used a decision tree as my classifier. In this section I'm going to visualize it. 

Decision trees are are interpretable and easy to understand (i.e. you know exactly why the classifier makes a decision).



### Goals: 

1. Import dataset.
* Train a classifier.
* Predict label for new flower.
* Visualize the tree.

# Import Dataset

In [1]:
# import iris into scikit-learn
from sklearn.datasets import load_iris

# load iris... this includes the tabel found on wiki and some metadata
iris = load_iris()

# the metadata tells you the names of some features, including the names of different types of flowers
print(iris.feature_names)
print(iris.target_names)


['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']


In [2]:
# the features and examples are contained in the data variable
# the following code displays the measuresments of the first flower in the list
print(iris.data[0])

[5.1 3.5 1.4 0.2]


In [3]:
# the target variable contains the labels, which are indexed to the target_names
print(iris.target[0])

0


In [4]:
# both data and target variable contain 150 entries, and you can iterate over them using the following code
for i in range(len(iris.target)):
    print("Example %d: label %s, features %s" % (i, iris.target[i], iris.data[i]))

Example 0: label 0, features [5.1 3.5 1.4 0.2]
Example 1: label 0, features [4.9 3.  1.4 0.2]
Example 2: label 0, features [4.7 3.2 1.3 0.2]
Example 3: label 0, features [4.6 3.1 1.5 0.2]
Example 4: label 0, features [5.  3.6 1.4 0.2]
Example 5: label 0, features [5.4 3.9 1.7 0.4]
Example 6: label 0, features [4.6 3.4 1.4 0.3]
Example 7: label 0, features [5.  3.4 1.5 0.2]
Example 8: label 0, features [4.4 2.9 1.4 0.2]
Example 9: label 0, features [4.9 3.1 1.5 0.1]
Example 10: label 0, features [5.4 3.7 1.5 0.2]
Example 11: label 0, features [4.8 3.4 1.6 0.2]
Example 12: label 0, features [4.8 3.  1.4 0.1]
Example 13: label 0, features [4.3 3.  1.1 0.1]
Example 14: label 0, features [5.8 4.  1.2 0.2]
Example 15: label 0, features [5.7 4.4 1.5 0.4]
Example 16: label 0, features [5.4 3.9 1.3 0.4]
Example 17: label 0, features [5.1 3.5 1.4 0.3]
Example 18: label 0, features [5.7 3.8 1.7 0.3]
Example 19: label 0, features [5.1 3.8 1.5 0.3]
Example 20: label 0, features [5.4 3.4 1.7 0.2]
Ex

# Train a classifier

1. before you can train the classifier, you need to split it up the data
* this will allow you to test the classifier's accuracy
* do this by removing several examples from the data set and put them asside for later... this will be called your testing set
* testing data is separate from training data 
* training data will be used to train the classifier
* you will use your testing examples later to test how accurate your classifier is on data its never seen before

In [5]:
# for this example I'll remove one of each type of flower
# the first setosa is at index 0
# the first versicolor is at index 50
# the first virginica is at index 100
test_idx = [0, 50, 100]

# import numpy library
import numpy as np



# training data... simply removing 3 entries from data and target variables
train_target = np.delete(iris.target, test_idx)
train_data = np.delete(iris.data, test_idx, axis=0)

# testing data... now creating two new sets of variables... one for training, and one for testing
# these variables  only contain the 3 examples that were removed above
test_target = iris.target[test_idx]
test_data = iris.data[test_idx]


In [6]:
# as before you can create a decision tree classifier 
from sklearn import tree
clf = tree.DecisionTreeClassifier()

# now train it on the training data
clf.fit(train_data, train_target)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

# Predict Label for New Flower

In [7]:
# Now use the tree to classify your testing data 
# even though you only have one flower of each type you can print out the labels you expect
print(test_target)

[0 1 2]


In [8]:
# now take a look at what the tree predicts
# give it the features of your testing data
# it will return labels
print(clf.predict(test_data))

[0 1 2]


Notice the predicted labels match my testing data. Indicating that the classifier's predictions were all correct. Note, this was a very simple test, and I'll do more detailed examples in future modules.



# Visualize the Tree

Now let's visualize the tree to see how the classifier works.

In [9]:
# viz code
from sklearn.externals.six import StringIO
import pydot
dot_data = StringIO()
tree.export_graphviz(clf,
                     out_file=dot_data,
                     feature_names=iris.feature_names,
                     class_names=iris.target_names,
                     filled=True,
                     rounded=True,
                     impurity=False
                    )

graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph[0].write_pdf("iris.pdf")

To view the pdf:
* open the termial
* navigate to this project's folder directory
* run the following code
    * open -a preview iris.pdf

In [10]:
# now use the tree diagram to classify an example from my testing data
# the tree diagram works the same way it would in code
# since you know this flower is a sitosa, use it's features to navigate through the diagram to determine what the tree predicts.
# here are the features and label from the first testing flower
print(test_data[0], test_target[0])

[5.1 3.5 1.4 0.2] 0


# Wrapup

Every question the decision tree asks must be about one of the data set features. Therefore, the better the features are, the better the tree you can build.

# Resource

[Visualizing a Decision Tree](https://www.youtube.com/watch?v=tNa99PG8hR8&list=PLOU2XLYxmsIIuiBfYad6rFYQU_jL2ryal&t=7s&index=3)