# Decision Trees with Sklearn

This notebook implements and measures a Decision Tree classification model in Sklearn.

* Method: [Decision Tree](http://scikit-learn.org/stable/modules/tree.html)
* Dataset: Iris


## Imports

In [None]:
from os import environ
# environ["GRAPHVIZ_DOT"] = "/home/students/anaconda/bin/dot"

import numpy as np

from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt
import graphviz

%matplotlib inline

## Load and Prepare the Data

In [None]:
# Load the dataset
data = load_iris()

In [None]:
# Get information on the dataset
print(data.DESCR)

In [None]:
# Split the data into labels (targets) and features
label_names = data['target_names']
labels = data['target']

feature_names = data['feature_names']
features = data['data']

# View the data
print(label_names)
print(labels[0])
print("")
print(feature_names)
print(features[0])

In [None]:
# Create test and training sets
X_train, X_test, Y_train, Y_test = train_test_split(features,
                                                    labels,
                                                    test_size=0.33,
                                                    random_state=42)

## Fit a Decision Tree Model

In [None]:
# Create an instance of the Decision Tree classifier
model = tree.DecisionTreeClassifier()

# Train the model
model.fit(X_train, Y_train)

## Create Predictions

In [None]:
# Create predictions
predictions = model.predict(X_test)
print(predictions)

In [None]:
# Predict the probability of each class
pred_probs = model.predict_proba(X_test)
print(pred_probs[0])

In [None]:
# Create a plot to compare actual class (Y_test) and the predicted class (predictions)
fig = plt.figure(figsize=(20,10))
plt.scatter(Y_test, predictions)
plt.xlabel("Actual Class: $Y_i$")
plt.ylabel("Predicted Class: $\hat{Y}_i$")
plt.title("Actual vs. Predicted Class: $Y_i$ vs. $\hat{Y}_i$")
plt.show()

## Model Evaluation

### Accuracy

The accuracy score is either the fraction (default) or the count (normalize=False) of correct predictions.

In [None]:
print("Accuracy Score: %.2f" % accuracy_score(Y_test, predictions))

### K-Fold Cross Validation

This estimates the accuracy of an SVM model by splitting the data, fitting a model and computing the score 5 consecutive times. The result is a list of the scores from each consecutive run.

In [None]:
# Get scores for 5 folds over the data
clf = tree.DecisionTreeClassifier()
scores = cross_val_score(clf, data.data, data.target, cv=5)

# Print the scores and mean score
print("Scores: {}".format(scores))
print("Mean Score: %0.2f" % np.mean(scores))

### View the Tree

In [None]:
dot_data = tree.export_graphviz(model,
                                out_file=None, 
                                feature_names=data.feature_names,  
                                class_names=data.target_names,  
                                filled=True,
                                rounded=True,  
                                special_characters=True)  
graph = graphviz.Source(dot_data) 
graph 