# Decision Trees

In this tutorial, we will be looking at implementing the Decision Tree tools that are avaiable from scikit-learn. In particular we will be looking applying a simple decision tree, using the Gini Index, to the Iris dataset, followed by a more complex dataset. Both are classification problems. 

In [None]:
# preamble
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

import graphviz 

# Helper function to plot the decision tree. This uses the graphviz library.
def plot_tree(graph, feature_names=None, class_names=None):
    '''
    This method takes a DecisionTreeClassifier object, along with a list of feature names and target names
    and plots a tree. The feature names and class names can be left empty; they are just there for labelling 
    '''
    dot_data = export_graphviz(graph, out_file=None, 
                      feature_names=feature_names,  
                      class_names=class_names,  
                      filled=True, rounded=True,  
                      special_characters=True) 
    
    graph = graphviz.Source(dot_data)
    
    return graph


## Iris Dataset

In [None]:
# Load the iris dataset using the helper function provided by scikit-learn
from sklearn.datasets import load_iris

iris = load_iris()

# Note the prefix 'iris' here. This is so that we don't get confused with the data for the second dataset later on.
iris_X = iris.data
iris_y = iris.target

As an example, we can create a DecisionTreeClassifier object from scikit-learn. This has several options that we can change, but for now, let's see what the default tree looks like for the full iris dataset.

In [None]:
# B1) We initialise the DecisionTreeClassifier object. We can set our hyperparameters here if necessary; we'll look into this in a bit.
base_model = DecisionTreeClassifier()

# B2) DecisionTreeClassifier has a fit method, which takes the train X and train y to learn a tree. The necessary optimisation is done here. 
fitted_base_model = base_model.fit(iris_X, iris_y)

# B3) Use the helper function defined above to plot the learned model.
plot_tree(fitted_base_model, iris.feature_names, iris.target_names)

Note that as per the [documentation](https://scikit-learn.org/stable/modules/tree.html), DecisionTreeClassfier implements a modified version of CART. You can look at documentation for details about the particular algorithm scikit-learn implements. 

Since it implements CART, by default, DecisionTreeClassifier uses the Gini Index to measure the leaf impurity. You can also use the entropy information gain by setting `criterion='entropy'` when initialising the object. 

Other options available can be seen [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier). A key option is the `max_depth`. This defines the maximum depth of the tree. If this isn't set, as above, the algorithm will continue until all leaves are pure, or until all leaves contain less than `min_samples_split` samples; this is another setting we can set. 

In other words, `max_depth` reduces the depth of the full tree seen above until the longest path in the tree is equal to `max_depth`. Pruning simplifies the model and makes it more interpretable to humans, and also prevents overfitting. 

Lets look at how the `max_depth` affects the accuracy of the model. 

In [None]:
# Create a test-train data split. 
from sklearn.model_selection import train_test_split

# We have set the random seed to be 2, by setting the random_state parameter. 
X_train, X_test, y_train, y_test = train_test_split(iris_X, iris_y, test_size=0.3, random_state=2)

# Initialise a new model that uses the default `max_depth`. The code pattern is the same as above.

# Fill in the blanks below.
# B4) Create a DecisionTreeclassifier object with the correct hyperparameters. This is your model.


# B5) Use the DecisionTreeclassifier.fit(X, y) function to optimise the model.


# B6) Use the DecisionTreeclassifier.predict(X) to make predictions


# B7) Use metrics.accuracy_score(y_test, y_predictions) to compute accuracy scores. Print it.
full_model_accuracy = 

print(f'Accuracy: {full_model_accuracy}')

# B8) Plot the tree using plot_tree(model, feature_names, class_names)


In [None]:
# Initialise another model that sets `max_depth=3`. The code pattern is the same as above.

# B4)

# B5)

# B6)

# B7)
small_model_accuracy = 

print(f'Accuracy: {small_model_accuracy}')
# B8)


We can see here that our accuracy has improved, albeit slightly.
Note however, that this is dependent on the random seed that we used (since we only ran this once).
You can try changing the random_state parameter above when we used the train_test_split function, and see if the result above changes.

We can now do a more extensive search by doing a grid search to search over several `max_depth` values.

In [None]:
# Try using the GridSearchCV from sklearn.model_selection. 
# You can specifiy the set of `max_depth`s that you want to try by setting `param_grid={'max_depth':[1, 2, 3, 4, 5, 6]}`.
# Documentation can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

from sklearn.model_selection import GridSearchCV

# The code pattern here is similar to the previous sections. 
# G1) Initiate a GridSearchCV object with the correct model, param_grid, and cv; `cv=k` does a k-fold cross-validation.
grid_search_model = GridSearchCV(DecisionTreeClassifier(random_state=2), {'max_depth':[1, 2, 3, 4, 5, 6]}, cv=15,)

# G2) use the GridSearchCV.fit(X, y) method to run the grid search with cv. 
fitted_grid_search_model = grid_search_model.fit(iris_X, iris_y)

In [None]:
# Print the mean accuracy scores. 
# The fitted GridSearchCV object has an attribute model.cv_results_ (note the underscore) that gives us a dict object with several results from the cross-valiation. 
# In particular, one of the key value pairs avaiable is 'mean_test_score', which returns the average score for each parameter value over the cv fold. 

accuracy_scores = fitted_grid_search_model.cv_results_['mean_test_score']
print(f"Mean accurary scores:{accuracy_scores}")

# Plot the best estimator you found
# GridSearchCV.best_estimator_ (again, the underscore) returns the model that performed the best. This behaves the same as the model objects from before, so we can plot it.
# G3) Get the best model
best_tree_model = fitted_grid_search_model.best_estimator_

# G4) Plot the best model
plot_tree(best_tree_model, iris.feature_names, iris.target_names)

# PIMA-INDIAN Dataset
Now lets do the same analysis for a much more complicated dataset. Details about this dataset can be found [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database). As a summary, it contains 768 data points with 8 features and a single binary target variable.

In [None]:
import pandas as pd

col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

# load dataset. Here we are loading into Pandas, since the data is available as a csv file.
# The DecisionTreeClassifier object can accept dataframes when fitting. 
pima = pd.read_csv("diabetes.csv", names=col_names)

print(pima.head())

pima_feature_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age']
pima_class_names = ['True', 'False']

pima_X = pima[pima_feature_names]
pima_y = pima.label

In [None]:
# As before, lets look at what the tree looks like with default settings.
# Follow the example in Cell 3 above, where we did something similar on the Iris Dataset.

# B1)

# B2)

# B3)

We can see here that this is quite unwieldy. 

Lets immediately jump into see do the grid search, and see if we can do better with a smaller tree. 

Note from the previous cell that the depth of this graph is 14, so the highest depth we need to look at is 14.

In [None]:
# Follow code in a similar pattern as when we did grid search on the iris dataset in Cell 6. 

# G1)

# G2)

In [None]:
# plot the mean accuracy scores. 
# As before we can use GridSearchCV.cv_results_['mean_test_score'] for this.
# Remember to run %matplotlib inline if you are using matplotlib.

In [None]:
# Plot the best estimator you found
# G3)

# G4)

# As we can see, this is much more interpretable (at least compared to the previous result), and also has a better accuracy score.