# Decision Trees

### Supervised Machine Learning Algorithm

There are essentially three types of decision trees:

- Decision Tree 
- Random Forest
- Gradient Boost Decision Tree


Decision trees can be used for both classification and regression problems.

Essentially learn a hierarchy of 'if/else' questions

### Drawback to a single Decision Tree

Decision trees are very prone to overfitting the data.   This can be addressed with a Random Forest or a Gradient Boost as we see later.

## Internal Algorithm Working

Searches over all possible values in the data set and finds the split that is most informative of the target variable.

The search is done recursively each time, finding the best value to split the remaining data.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mglearn

![Raw Decision Tree Data](images/DecisionTreeData.png)

![Decision Tree Depth](images/DecisionTreeDepth.png)

## Controlling complexity of Decision Trees

Allowing a decision tree to recurse to its maximum depth will lead to a tree with just *pure leaves*.  When this happens the model is overfitting to the training data and will **NOT** generalize well to unseen data.

The key to controlling the complexity, i.e. overfitting, is to specify the depth. 

### Two Strategies to prevent overfitting
- pre-pruning by specifying a maximum depth of the tree
- post-pruning by collapsing nodes that have 'minimal' information

Scikit-learn only supports pre-pruning.


## Decision Tree with Cancer dataset and maximum depth

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split


cancer = load_breast_cancer()
cancer.data.shape

(569, 30)

569 Observations and 30 features

In [3]:
cancer.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

### Train DecisionTreeClassifier with unbounded depth

In [4]:
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
training_score = tree.score(X_train, y_train)
test_score = tree.score(X_test, y_test)

print(f"Accuracy on training set: {training_score}")
print(f"Accuracy on test set: {test_score}")

Accuracy on training set: 1.0
Accuracy on test set: 0.9370629370629371


Because we did not restrict the depth - the training accuracy is 1 (or perfect) because the leaves are pure

The accuracy of 93.7% is not quite as other algorithms.

**REMEMBER** We only have a single train_test_split, and the accuracy will change with a new split

### Train DecisionTreeClassifier with a max depth of 4

In [5]:
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)
training_score = tree.score(X_train, y_train)
test_score = tree.score(X_test, y_test)

print(f"Accuracy on training set: {training_score}")
print(f"Accuracy on test set: {test_score}")

Accuracy on training set: 0.9882629107981221
Accuracy on test set: 0.951048951048951


### Feature importance in Trees

Recall there are 30 features.  Each feature value will sum to 1 and the value is a number between 0 and 1.  A value of 0 means the feature does not contribute and 1 means the feature perfectly predicts the outcome.

In [6]:
print(f"Feature importances:\n{tree.feature_importances_}")

Feature importances:
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.01019737 0.04839825
 0.         0.         0.0024156  0.         0.         0.
 0.         0.         0.72682851 0.0458159  0.         0.
 0.0141577  0.         0.018188   0.1221132  0.01188548 0.        ]


In [7]:
# create a dataframe to see feature name with importance
feature_importance_df = pd.DataFrame(tree.feature_importances_, cancer.feature_names,columns=['Value'])
feature_importance_df.head(30).sort_values(['Value'], ascending=False)

Unnamed: 0,Value
worst radius,0.726829
worst concave points,0.122113
texture error,0.048398
worst texture,0.045816
worst concavity,0.018188
worst smoothness,0.014158
worst symmetry,0.011885
radius error,0.010197
smoothness error,0.002416
mean radius,0.0


## Random Forest - Ensemble of Decision Trees

Ensemble is a collection of Machine Learning algorithms to create a more powerful overall model.  

A Random Forest is a collection of Decision Trees with slightly different behaviors.  Each Decision Tree in the Random Forest will have the degree of overfitting averaged out across all of the Decision Trees in the Random Forest.


In [8]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100, random_state=0)
forest.fit(X_train, y_train)

training_score = forest.score(X_train, y_train)
test_score = forest.score(X_test, y_test)

print(f"Accuracy on training set: {training_score}")
print(f"Accuracy on test set: {test_score}")

Accuracy on training set: 1.0
Accuracy on test set: 0.958041958041958


In [9]:
# create a dataframe to see feature name with importance
feature_importance_df = pd.DataFrame(forest.feature_importances_, cancer.feature_names,columns=['Value'])
feature_importance_df.head(30).sort_values(['Value'], ascending=False)

Unnamed: 0,Value
worst perimeter,0.149129
worst concave points,0.13283
worst radius,0.112002
mean concave points,0.104655
mean perimeter,0.077421
worst area,0.071164
mean concavity,0.057732
mean area,0.044627
mean radius,0.034281
area error,0.028493


## Crossvalidation performance of Decision vs Random

#### RandomForestClassifier

In [10]:
from sklearn.model_selection import cross_val_score
forest = RandomForestClassifier(n_estimators=100, random_state=0)
scores = cross_val_score(forest, X, y, cv=10, scoring='accuracy')

print(scores.mean())

0.9649997839426151


#### DecisionTreeClassifier

In [11]:
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
scores = cross_val_score(tree, X, y, cv=10, scoring='accuracy')

print(scores.mean())

0.9228577910292973


You can see that the RandomForestClassifier out performed the DecisionTreeClassifier

## Decision Tree & RandomForest Summary

RandomForest for regression and classification are currently amoung the most widely used machine learning methods. 

These algorithms do not usually require a great deal of tuning and they **do not** require scaling of the features.

## Gradient Boost

Gradient boost works by building trees in a serial manner where each tree corrects the mistakes of the previous one.


In [12]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

gb = GradientBoostingClassifier(n_estimators=100, random_state=0)
scores = cross_val_score(gb, X, y, cv=10, scoring='accuracy')

print(scores.mean())

0.9614910120127906
