# Decision Trees

In this jupyter notebook, we'll explore building decision tree models considering two approaches: 
1. the ``scikit-learn`` python machine learning toolkit _(it's pronounced sy-kit)_
2. by hand, motivated by Chapter 17 in the Grus textbook.

We'll see that while ``scikit-learn`` is very popular and is being actively developed, there are still a number of nice data science features missing from the module (such as the pruning of trees to avoid overfitting).

While we're able to view the source of ``scikit-learn`` packages, quickly and correctly understanding the code is a differet story. We'll run into errors if we provide ill-formed funtion calls to the scikit API, and debugging will be a challenge.

There's a tradeoff between using the scikit toolkit and writing your own code.

Much of the following code utilizes the ``pandas`` library.

## Links

* http://scikit-learn.org/ The current stable version is 0.20. I had to update my Anaconda from 0.17 to 0.20 -- there are some differences in the decision tree api between these two versions. Further, when googling after running into problems, you'll have to navigated between not only python2 and python3 code, but also the version of scikit. Ugh! Real programmer problems!
* http://pandas.pydata.org/ Pandas module documentation.

## Loading Data: Iris

We'll use the iris data set again in this notebook.

Even though we saw that the iris dataset has an extra blank line at the end, pandas opens it just fine. I specify the column names because the data file doesn't have a header row. 

In [None]:
import pandas as pd
print("Pandas version: ", pd.__version__)

In [None]:
url="https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

df = pd.read_csv(url, names=['sepal-width', 'sepal-length', 'petal-width', 'petal-length', 'species'])
print("Dimensions of dataframe: ", df.shape)
print("Number of rows: ", len(df))
print("Number of colums: ", len(df.columns))

### Data Exploration

Displaying the beginning and ending rows of the dataset to veryify that everything seems to be correctly loaded.

In [None]:
df.head(10)

In [None]:
df.tail()

#### Class Value Counts

What are the different class values? How can we automatically get this information using pandas?

In [None]:
df['species'].value_counts()

In [None]:
df['species'].unique()

## Decision Tree Experiments

Some questions that we want to explore:
* What decision tree induction algorithm is used in ``scikit-learn``: ID3, C4.5, CART, something else?
* How are the features converted from continuous to categorical -- binning? some other way?
* Are there different "hyperparameters" we can try?

### scikit-learn

`sklearn` is the name of the scikit-learn package in Python.

It should already be installed with the anaconda environment (but you want to make sure it's updated).

In [None]:
# !pip3 install scikit-learn --upgrade

In [None]:
import sklearn
print("scikit-learn version: ", sklearn.__version__)

### Exporting to a feature matrix and target vector

scikit-learn requires data to be properly formatted, in the layout of a:
1. feature matrix "X" ( _m features x n_ records )
2. target vector "y" ( _n_ class values)

Below, we extract the feature matrix and target array from the `pandas` dataframe, using `pandas` DataFrame operations:

In [None]:
X = df.drop('species', axis=1)
y = df['species']

Verifying the shape dimensions of the extracted feature matrix and target vector:

In [None]:
print(X.shape)
print(y.shape)

Indexing into the feature matrix and target vector: 

In [None]:
X.head()

In [None]:
X['sepal-width'][3]

In [None]:
y.head()

In [None]:
print(type(df))
print(type(X))
print(type(y))

### Decision Tree Induction

API: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

Note there are lots of parameters that we can specify.

In [None]:
import sklearn.tree

model1 = sklearn.tree.DecisionTreeClassifier()
model1 = model1.fit(X, y)   # naming the decision tree 'model1', for lack of a better name
model1

### Visualizing the learned tree model

_(Ugly! two installations needed)_

http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html#sklearn.tree.export_graphviz

This took me some time to get working on my home Mac laptop. 

1. https://pypi.org/project/graphviz/

In [None]:
#!pip3 install graphviz

2. As specified on the pypi graphviz documentation: "To render the generated DOT source code, you also need to install Graphviz. Make sure that the directory containing the dot executable is on your systems’ path.
https://www.graphviz.org/

In [None]:
#!brew install graphviz

In [None]:
import graphviz

In [None]:
print(X.columns)
print(y.unique())

In [None]:
dot_data = sklearn.tree.export_graphviz(model1, out_file=None,
                         feature_names=X.columns,  
                         class_names=y.unique(),  
                         filled=True, rounded=True,  
                         special_characters=True)

In [None]:
graph = graphviz.Source(dot_data)  
graph

### Running an instance through the learned decision tree model

#### Get an instance

In [None]:
lastrow = X.iloc[149]

print(lastrow)          # the last row of the feature matrix (minus the target class) of the dataset
type(lastrow)           # it's a Series object

`sklearn` complains if we give it a pandas dataframe or a pandas series. It likes receiving a 2D numpy array, instead. a 1D numpy array won't work.

`reshape` will convert data into a 2D numpy array.
* `.reshape(1,-1)` is used for a single row dataframe (an instance/observation)
* `.reshape(-1,1)` is used for a single column dataframe (an attribute/feature)

In [None]:
print(lastrow)
print(type(lastrow))          # Series
      
print("------------")
      
print(lastrow.values)
print(type(lastrow.values))   # 1D numpy array

print("------------")

print(lastrow.values.reshape(1,-1))    # 2D array
print(type(lastrow.values.reshape(1,-1)))

### Making a Prediction on the Test Instance

We can pass a row (one that is represented as a 2D array) to the model for classification.

In [None]:
predict = model1.predict(lastrow.values.reshape(1,-1))
predict

#### Decision Path

There is a `decision_path` method (new in version 0.18) that outputs how the testing instance traveled from the root node to a leaf.

There's limited documentation on it?

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.decision_path

In [None]:
print(model1.decision_path(lastrow.values.reshape(1,-1)))

## Splitting Data into a Training Set and Testing Set

Note that this code is also different between <= v0.17 and the current stable version.

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

There are four arguments that we'll specify, as commented below.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,    # predictive features
                                                    y,      # target column
                                                    test_size=0.33,    # 33% of dataset will be set aside for test set
                                                    random_state=42)   # for reproducibility
print("Size of training: ", len(X_train))
print(X_train)
print("Size of testing: ", len(X_test))
print(X_test)

It's interesting to compare the method call using the ``scikit-learn`` module against the Grus text which codes this by hand.

In [None]:
# Grus code

# TO INSERT....

# READ Ch. 17!

### Running our Model Against the Test Set

Use the `.score` method. 
We're going to pass to it the testing rows, which are divided into:
* ``X_test``: the testing features (everything except the target)
* ``y_test``: the testing answers (the actual target we hope we predict)

Running the 50 testing instances through ``model1`` (which was trained on all 150 instances), and comparing the model's prediction to the actual correct classification:

In [None]:
model1.score(X_test, y_test)

100% accuracy on the _test set_? This seems way too good. This is because `model1` was trained on _all_ data; we really only want it trained on the _training set_.

Trying again with a new model: `model2`.

In [None]:
model2 = sklearn.tree.DecisionTreeClassifier()
model2 = model2.fit(X_train, y_train)

In [None]:
print("Training accuracy rate:", model2.score(X_train, y_train))
print("Generalized testing accuracy rate:", model2.score(X_test, y_test))
model2.score(X_test, y_test)

This is fabulous! For academic purposes, let's build a model that is not 100% perfect, but adjusting the training/testing set splits:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,    # predictive features
                                                    y,    # target column
                                                    test_size=0.80,    # 33% of dataset will be set aside for test set
                                                    random_state=42)   # for reproducibility
print("Size of training: ", len(X_train))
print("Size of testing: ", len(X_test))

In [None]:
model3 = sklearn.tree.DecisionTreeClassifier()
model3 = model3.fit(X_train, y_train)

In [None]:
print("Training accuracy rate:", model3.score(X_train, y_train))
print("Generalized testing accuracy rate:", model3.score(X_test, y_test))

In [None]:
dot_data = sklearn.tree.export_graphviz(model3, out_file=None,
                         feature_names=X.columns,  
                         class_names=y.unique(),  
                         filled=True, rounded=True,  
                         special_characters=True)
graph = graphviz.Source(dot_data)  
graph

### Other things to explore

There are other interested attributes and methods worth exploring in the `sklearn.tree` module.

Examine the documentation. Also try using tab autocomplete.

In [None]:
print(model2.feature_importances_)
print(model2.tree_.node_count)

### Pruning

Quesions to answer:
* Does pruning help on the iris tree?
* Is pruning automatically performed in the sklearn implementation?

From the documentation:
"Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem."

http://scikit-learn.org/stable/modules/tree.html

### Ensemble Methods: Boosting, Bagging, Random Forests

See: http://scikit-learn.org/stable/modules/ensemble.html

In [None]:
import sklearn.ensemble

#### Boosting

Now supported in latest version! https://scikit-learn.org/stable/modules/ensemble.html#adaboost

In [None]:
model4 = sklearn.ensemble.AdaBoostClassifier(n_estimators=10)
model4 = model4.fit(X_train, y_train)
model4.score(X_test, y_test)

#### Bagging

See: http://scikit-learn.org/stable/modules/ensemble.html#bagging-meta-estimator

Note that because there is a random aspect to bagging (unless a random seed is used), the bagging evaluation may produce a different result every time it is run, because the decision trees will be different every time.

In [None]:
model5 = sklearn.ensemble.BaggingClassifier(sklearn.tree.DecisionTreeClassifier(),
                            max_samples=0.5, max_features=0.5)
model5 = model5.fit(X_train, y_train)
model5.score(X_test, y_test)

#### Random Forests

See: http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees

In [None]:
model6 = sklearn.ensemble.RandomForestClassifier(n_estimators=10)
model6 = model6.fit(X_train, y_train)
model6.score(X_test, y_test)

### Examining the Misclassifications: What the Model Got Wrong

We have our testing set, represented by the dataframes: `X_test` and `y_test`.

The below code demos how to:
1. index the `y_test` series/dataframe using a row's id
2. loop/iterate through the testing data.

In [None]:
type(y_test)

In [None]:
y_test[109]

In [None]:
for index, instance in X_test.iterrows():               # get every row from the test set
    # print(index, instance['sepal-width'])             # print the index number and sepal-width attribute
    predict = model3.predict(instance.values.reshape(1,-1))    # convert the instance row (dataframe/series) into a numpy array, send it through the model
    if (predict != y_test[index]):                      # see if the model's prediction matches the true species
        print("WRONG: ", index)

In [None]:
X_test.loc[56]

## Regression Trees

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor