The notebook contains examples of how to use Decision Trees.

The dataset is the Breast Cancer dataset from scikit-learn and is based on the content from the book:

[Introduction to Machine Learning with Python](https://www.amazon.com/Introduction-Machine-Learning-Python-Scientists/dp/1449369413/ref=sr_1_1?ie=UTF8&qid=1519586427&sr=8-1&keywords=introduction+to+machine+learning+with+python&dpID=51ZPksI0E9L&preST=_SX218_BO1,204,203,200_QL40_&dpSrc=srch)

by Andreas Muller and Sarah Guido.


### Decision Tree Classifier

In scikit-learn, the decision trees only implement *pre-pruning*, meaning that we can only control the depth of tree.

By default, sklearn's decision trees will use the training data until all of the leaves are *pure* meaning that each leaf contains only a single class.  This leads to overfitting and therefore does not allow the model to generalize well.

By pre-pruning, we can create a decision tree that generalizes to data that the model has not seen, and in doing so you will see the training accuracy go down. This is expected, because we did not allow the decision tree to create pure leaves.

Decision Trees have the advantage that they are invariant to scaling of the data.  Decision trees work well when you have features that are on completely different scales, or a mix of binary and continuous features.  For example, if you have a dataset that has category features, such as Airline Carrier code, and minutes of arrival delay.  You can convert the carrier code using dummy variables or one-hot-encoding, and then use a tree for the model

In [86]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [87]:
cancer = load_breast_cancer()
print(cancer.DESCR)

Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        

Use pandas to create a dataframe of the cancer data.

In [88]:
cancer_df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
cancer_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [89]:
cancer_target_df = pd.DataFrame(cancer.target, columns=['isCancer'])
cancer_target_df.head()

Unnamed: 0,isCancer
0,0
1,0
2,0
3,0
4,0


In [90]:
# by convention:
# X - features
# y - targets
X = cancer.data
y = cancer.target

Use train_test_split to create a training dataset and a testing dataset.  

The stratify option is used so that we can the same percentage of target distribution when we create the training and testing sets.  We do this so that neither the training nor the testing dataset has a bias in the target values.

Note that we are using train_test_split, which takes a single random sampling, and the results will vary depending upon the value of random_state.  A better approach is to always use cross_val_score.

In [91]:
X_train,X_test,y_train,y_test = train_test_split(X,y, stratify=y,train_size=0.75, test_size=0.25,  random_state=42)

In [92]:
# We are explicitly setting the max_depth to None, which is the default, and means that the classifier should expand
# nodes until the leaves are pure.  This creates an unbounded operation - and in general - we do not want to do that.
tree = DecisionTreeClassifier(max_depth=None, random_state=0)

In [93]:
tree.fit(X_train, y_train)
training_score = tree.score(X_train, y_train)
testing_score = tree.score(X_test, y_test)
print(f"Training Accuracy Score: {training_score}")
print(f"Testing Accuracy Score: {testing_score}")

Training Accuracy Score: 1.0
Testing Accuracy Score: 0.9370629370629371


Notice that the training score is perfect, 1.0 - which is expected because we set the parameters of the decision tree to expand the nodes until they were pure.  Meaning it could predict with certainty on the training data.

Notice the testing data, which is data the model has not seen, score well at 0.92, but lets see what happens when we pre-prune the tree

In [94]:
tree = DecisionTreeClassifier(max_depth=4, random_state=0)

In [95]:
tree.fit(X_train, y_train)
training_score = tree.score(X_train, y_train)
testing_score = tree.score(X_test, y_test)
print(f"Training Accuracy Score: {training_score}")
print(f"Testing Accuracy Score: {testing_score}")

Training Accuracy Score: 0.9882629107981221
Testing Accuracy Score: 0.951048951048951


Notice that while the training accuracy goes down, in this case the testing accuracy goes up.  Meaning it generalized better.

All of this should be taken with a grain of salt.  This result depends upon the random sampling from the train_test_split, and the size of the training and testing datasets.

Instead of 75% of the data going to the training set, lets change that to 70% going to the training data.

In [96]:
X_train,X_test,y_train,y_test = train_test_split(X,y, stratify=y,train_size=0.7, test_size=0.3,  random_state=42)
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)
training_score = tree.score(X_train, y_train)
testing_score = tree.score(X_test, y_test)
print(f"Training Accuracy Score: {training_score}")
print(f"Testing Accuracy Score: {testing_score}")

Training Accuracy Score: 0.992462311557789
Testing Accuracy Score: 0.9239766081871345


Notice how the accuracy changes.  This is an example of why to fully understand the model performance, using K-Fold cross validation with CVGridSearch is the better approach.

## Feature Importance in Trees

The decision tree classes have a derived property called:  feature_importances_  We can use this to see which features from the dataset are important.

In [97]:
important_features = pd.DataFrame(list(zip(cancer.feature_names, tree.feature_importances_)), columns=['Feature', 'Importance']).sort_values('Importance', ascending=False)
important_features.head(25)

Unnamed: 0,Feature,Importance
20,worst radius,0.733864
27,worst concave points,0.132649
21,worst texture,0.04999
11,texture error,0.03098
26,worst concavity,0.019168
9,mean fractal dimension,0.010499
24,worst smoothness,0.009499
25,worst compactness,0.007388
14,smoothness error,0.003377
13,area error,0.002586


From the importance table above, you can see that *worst radius* is the most important feature in determining whether there is breast cancer or not.  

In [98]:
y_predicted = tree.predict(X_test)
confusion_matrix(y_test, y_predicted) 

array([[ 57,   7],
       [  6, 101]])

The confusion matrix indicates:

True Negative - 57:  Patient truly did not have cancer

False Negative - 7:  Patient was false diagnosed with no cancer, but in fact they did have cancer.

False Positive - 6:  Patient was false diagnosed as having cancer, but in fact dit NOT have cancer.

True Positive - 101: Patient truly did have cancer.