## Decision trees

Decision trees are widely used for models for classification and regression tasks. They work like a series of questions based on the answers of which, a particular decision is made.

#### Applying Decsion Tree on Cancer dataset

In [1]:
import mglearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

In [2]:
cancer = load_breast_cancer()

x_train, x_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)

tree = DecisionTreeClassifier(random_state=0)
tree.fit(x_train, y_train)

print("Accuracy on training set: {:.3f}".format(tree.score(x_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(x_test, y_test)))

Accuracy on training set: 1.000
Accuracy on test set: 0.937


Pre-pruning by restricting the depth to 4 nodes

In [5]:
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(x_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(x_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(x_test, y_test)))

Accuracy on training set: 0.988
Accuracy on test set: 0.951


We can see how training accuracy went down as we restricted the growth of the tree but the testing accuracy went up due to more generalization over the test set and hence less overfitting.

### Visualizing the decision tree using graphviz

In [6]:
from sklearn.tree import export_graphviz

export_graphviz(tree, out_file="tree.dot", class_names=["Malignant", "Bening"],
               feature_names=cancer.feature_names, impurity=False, filled=True)

In [8]:
import graphviz
with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

AttributeError: module 'graphviz' has no attribute 'Source'