Demo: Decision trees and ensembles
==================================

*This is based on the course of [Fraida Fund](https://colab.research.google.com/github/ffund/ml-notebooks/blob/master/notebooks/1-colab-tour.ipynb) for  NYU Tandon School of Engineering*

This is a simple demo notebook that demonstrates a decision tree classifier or an ensemble of decision trees.

**Attribution**: Parts of this notebook are slightly modified from [this tutorial from “Intro to Data Mining”](http://www.cse.msu.edu/~ptan/dmbook/tutorials/tutorial6/tutorial6.html).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier

In [None]:
df = pd.read_csv('http://www.cse.msu.edu/~ptan/dmbook/tutorials/tutorial6/vertebrate.csv')
df

We’l make it a binary classification problem:

In [None]:
df['Class'] = df['Class'].replace(['fishes','birds','amphibians','reptiles'],'non-mammals')
df

Decision tree
-------------

In [None]:
y = df['Class']
X = df.drop(['Name','Class'],axis=1)

clf_dt = DecisionTreeClassifier(criterion='entropy')
clf_dt = clf_dt.fit(X, y)

In [None]:
plt.figure(figsize=(10,10))
sklearn.tree.plot_tree(clf_dt, 
                    feature_names = df.columns.drop(['Name', 'Class']),
                    class_names = ["mammals", "non-mammals"],
                    filled=True, rounded=True);

### Feature importance

In [None]:
df_importance = pd.DataFrame({'feature': df.columns.drop(['Name', 'Class']),
                              'importance': clf_dt.feature_importances_})
df_importance

Bagged tree
-----------

In [None]:
n_tree = 3
clf_bag = BaggingClassifier(n_estimators=n_tree)
clf_bag = clf_bag.fit(X, y)

In [None]:
plt.figure(figsize=(n_tree*8, 10))
for idx, clf_t in enumerate(clf_bag.estimators_):
  plt.subplot(1, n_tree,idx+1)
  sklearn.tree.plot_tree(clf_t, 
                      feature_names = df.columns.drop(['Name', 'Class']),
                      class_names = ["mammals", "non-mammals"],
                      filled=True, rounded=True)  

Notice the similarities! The bagged trees are highly correlated.

Let’s look at the bootstrap sets each tree was trained on:

In [None]:
for samples in clf_bag.estimators_samples_:
  print(df.iloc[samples])

Random forest
-------------

In [None]:
n_tree = 3
clf_rf = RandomForestClassifier(n_estimators=n_tree, )
clf_rf = clf_rf.fit(X, y)

In [None]:
plt.figure(figsize=(n_tree*8, 10))
for idx, clf_t in enumerate(clf_rf.estimators_):
  plt.subplot(1, n_tree,idx+1)
  sklearn.tree.plot_tree(clf_t, 
                      feature_names = df.columns.drop(['Name', 'Class']),
                      class_names = ["mammals", "non-mammals"],
                      filled=True, rounded=True)  

These trees are much less correlated.

AdaBoost
--------

In [None]:
n_tree = 3
clf_ab = AdaBoostClassifier(n_estimators=n_tree)
clf_ab = clf_ab.fit(X, y)

In [None]:
plt.figure(figsize=(n_tree*8, 10))
for idx, clf_t in enumerate(clf_ab.estimators_):
  plt.subplot(1, n_tree,idx+1)
  sklearn.tree.plot_tree(clf_t, 
                      feature_names = df.columns.drop(['Name', 'Class']),
                      class_names = ["mammals", "non-mammals"],
                      filled=True, rounded=True)  

The output will be a weighted average of the predictions of all three trees.

As we add more trees, the ensemble accuracy increases:

In [None]:
for p in clf_ab.staged_predict(X):
  print(np.mean(p==y))