## Decision trees

Decision trees are widely used for classfication purposes. They ask a series of if/else questions and based on the answers of each question, proceed towards another question or output. 

In [1]:
import mglearn
from sklearn.datasets import make_moons

### ID3 algorithm

The decision tree can be made using many methods, one of which is the ID3 algorithm. 
The Iterative Dichotomiser 3 or ID3 algorithm uses entropies and information gains to find the out the best feature to create node of a decision tree. For each feature, the information gain is calculated and the feature with the maximum information gain is chosen as the root node. 

Typically, decision trees overfit on the training data as decisions are based on training data only and learning on new unseen data can lead to the entire collapse of a decision tree. To prevent this type of overfitting, two strategies are used, namely $pre-pruning$, or stopping the creation of decision tree earlier or $post-pruning$, or trimming the decision tree after completing it. 

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()
x_train, x_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)
tree = DecisionTreeClassifier(random_state=0)
tree.fit(x_train, y_train)

print("\nAccuracy on training set {:.2f}".format(tree.score(x_train, y_train)))
print("\nAccuracy on testing set {:.2f}".format(tree.score(x_test, y_test)))


Accuracy on training set 1.00

Accuracy on testing set 0.94


As we can see that training accuracy is 100% as the leaf nodes are pure and are made through the training data. However, the testing accuracy is less (less than what we have seen for other classifiers). Let us apply pre-pruning now by restricting the maximum depth of the tree to 4 nodes.

In [3]:
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(x_train, y_train)

print("\nAccuracy on training set {:.2f}".format(tree.score(x_train, y_train)))
print("\nAccuracy on testing set {:.2f}".format(tree.score(x_test, y_test)))


Accuracy on training set 0.99

Accuracy on testing set 0.95


In [4]:
from sklearn.tree import export_graphviz

export_graphviz(tree, out_file="tree.dot", class_names=["malignant", "benign"],
ifeature_names=cancer.feature_names, impurity=False, filled=True)

In [7]:
import graphviz

with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

AttributeError: module 'graphviz.backend' has no attribute 'ENCODING'