# Main Content
- Train, visualize, and make predictions with Decision Trees.
- CART training algorithm.
- Regularize trees and use them for regression tasks.
- Limitations of Decision Trees.

# Training and Visualizing a Decision Tree
First build one and take a look at how Decision Trees make predictions.

In [1]:
import sklearn
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:,2:] # petal length and width
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Visualize the trained Decision Tree by using the `export_graphviz()` method to output a grph definition file called `iris_tree.dot`, then convert this dot file to a variety of formats such as PDF or PNG using the `dot` command-line tool from the `graphviz` package.
`dot -Tpng iris_tree.dot -o iris_tree.png`. **First you need to install graphviz package and configure the system path.**

In [2]:
from sklearn.tree import export_graphviz

export_graphviz(
        tree_clf,
        out_file = 'output/chapter5/iris_tree.dot',
        feature_names=iris.feature_names[2:],
        rounded=True,
        filled=True
)

![iris_tree](output/chapter5/iris_tree.png)

# Make Predictions
I think there is no need to explain the process of prediction. Here are the meanings of a node's attibutes.
- Samples: the number of training samples it applies to.
- value: number of instances of each class
- gini: its impurity. If the instances in a node belong to the same class, it gini equals to 0.

*Equation 6-1. Gini impurity*
$$G_i=1-\sum_{k=1}^np_{i,k}^2$$
- $p_{i,k}$ is the ratio of class k instances among the training instances in the $i^{th}$ node.

**One of the many qualities of Decision Trees is that they require very little data preparation. In particular, they don't require feature scaling or centering at all.**

![fig 6-2](images/6-2.png)

# Estimating Class Probabilities
First it traverses the tree to find the leaf node for this instance, and then it returns the ratio of training instances of class k in this node.

In [3]:
tree_clf.predict_proba([[5,1.5]])

array([[ 0.        ,  0.90740741,  0.09259259]])

In [4]:
tree_clf.predict([[5,1.5]])

array([1])

# The CART Training Algorithm
The idea is simple: the algorithm first splits the training set in two subsets using a single feature k and a threshold $t_k$. It searches for the pair $(k,t_k)$ that produces the purest subsets(weighted by their size). The cost function is Equation 6-2.

*Equation 6-2. CART cost function for classification*
$$J(k,t_k)=\frac{m_{left}}{m}G_{left}+\frac{m_{right}}{m}G_{right}$$
- $G_{left/right}$ measures the impurity of the left/right subset.
- $m_{left/right}$ is the number of instances in the left/right subset.

It stops recursing once it reaches the maximum depth, or if it cannot find a split that will reduce impurity. So CART algorithm is a greedy algorithm. A greedy algorithm often produces a reasonably good solution, but it is not guaranteed to be the optimal solution.

# Gini Impurity or Entropy?
**The concept of entropy originated in thermodynamics(热力学) as a measure of molecular disorder: entropy approaches zero when molecules are still and well ordered. It later spread to a wide variety of domains, including Shannon's information theory, where it measures the average information content of a message: entropy is zero when all messages are identical.** Equation 6-3 shows the difinition of the entropy of the $i^{th}$ node.

*Equation 6-3. Entropy*
$$H_i=-\sum_{k=1}^{n} p_{i,k}log(p_{i,k})$$
$${p_{i,k}}\neq{0}$$

Most of time these two lead to similar trees. Gini impurity is slightly faster to compute, so it is a good default. However, when they differ, Gini impurity tends to isolate the most frequent class in tis own branch of the tree, while entropy tends to produce slightly more balanced trees.

# Regularization Hyperparameters
Decision Trees make very few assumptions about the training data. If unconstrained, the tree structure will adapt itself to the training data, fitting it very closely, and most likely overfitting it. Such a model is often called **nonparametric model** because the number of parameters is not determined prior to training. In contrast, a **parametric model** such as a linear model has a predetermined number of parameters, so its degree of freedom is limited, reducing the risk of overfitting.

**In scikit-learn, recuding the `max_` or increasing the `min_` will regularize the model and thus reducing the risk of overfitting. Another way to avoid overfitting is pruning unnecessary nodes(剪枝法).**

![fig 6-3](images/6-3.png)

# Regression

In [5]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X,y)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

![fig 6-4](images/6-4.png)

![fig 6-5](images/6-5.png)

The CART algorithm works mostly the same way as earlier. It now tries to split the training set in a way taht minimizes the MSE.

![fig e6-4](images/e6-4.png)

Just like for classification tasks, Decision Trees are prone to overfitting when dealing with regression tasks.

![fig 6-6](images/6-6.png)

# Instability
1. Decision Trees love orthogonal decision boundaries(**all splits are perpendicular to an axis**), which makes them sensitive to training set rotation. One way to limit this problem is to use PCA, which often results in a better orientation of the training data.

![fig 6-7](images/6-7.png)

2. The main issue with Decision Trees is that they are sensitive to small variations in the training data.**Pay attention to the one with petals 4.8 cm long and 1.8 cm wide**.

![fig 6-2](images/6-2.png)

![fig 6-8](images/6-8.png)

**Random Forests can limit this instability by averagin predictioins over many trees.**