# training and visualizing a decision tree

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

In [21]:
iris = load_iris()
iris['DESCR']



In [9]:
X = iris.data[:, 2:] # petal length and width
y = iris.target

In [10]:
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

**visualize the trained decision tree**

In [17]:
from sklearn.tree import export_graphviz
export_graphviz(
    tree_clf,
    out_file='iris_tree.dot', 
    feature_names=iris.feature_names[2:],
    rounded=True,
    filled=True
)

# making predictions

In particular, they don't require feature scaling or centering at all.

**gini impurity**
$$1-\sum_{k=1}^{n} p_{i, k}^2$$

scikit-learn uses the CART algorithm whihc only produces only binary tree.

fairly intuitive and easy to interpret

# estimate class probabilities

1. traverses the tree to find the leaf node for this instance
2. returns the ratio of training instances of class k in this node

In [19]:
tree_clf.predict_proba([[5, 1.5]])

array([[0.        , 0.90740741, 0.09259259]])

In [23]:
tree_clf.predict([[5, 1.5]])

array([1])

# CART training algorithm

CART (classification and regression tree) algorithm

**How to choose feature k and threshold $t_{k}$** <br>
searches for the pair that produces the smallest value of cost function

$$J(k, t_{k}) = \dfrac{m_{left}}{m} G_{left} + \dfrac{m_{right}}{m} G_{right}$$

- $m_{left/right}$ is the number of instances in the left/right subset.
- $G_{left/right}$ measures the impurity

**greedy algrithm**

**Entropy**
$$H_{i} = -\sum_{k=1,p_{i, k}\neq 0}^n p_{i, k}log(p_{i, k})$$

entropy is zero when all messages are identical.

most of time, the two ways dont' make a big difference. Gini impurity is slightly faster to compute.

# regularization hyperparameters

- parametric model 
- nonparametric model (likely to overfitting)

**regularization**:restrict the model's freedom through hyperparameters

post pruning: 
    Standard	statistical	tests,	such	as	the	χ2	test,	are	used	to	estimate	the	probability	that	the	improvement	is	purely	the	result	of chance	(which	is	called	the	null	hypothesis).	If	this	probability,	called	the	p-value,	is	higher	than	a	given	threshold	(typically	5%, controlled	by	a	hyperparameter),	then	the	node	is	considered	unnecessary	and	its	children	are	deleted.	The	pruning	continues	until all	unnecessary	nodes	have	been	pruned

# regression

In [25]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)

in regression problem, not try to minimize impurity but MSE
$$J(k, t_{k})=\dfrac{m_{left}}{m}MSE_{left} + \dfrac{m_{right}}{m}MSE_{right}$$
where 

$$MSE_{node}=\sum_{i\in{node}}( {\^y}_{node})-y^{(i)})^2$$


# instability

- prefer orthogonal decision boundaries
- sensitive to small variations in the training data
    - even get very different model with same data