### grp

## Hands-On Machine Learning with Scikit-Learn & TensorFlow

## CHAPTER 6: Decision Trees

## Decision Trees:
-  performs ...
    -  classification => predicts a class and aims to minimize impurity
    -  regression => predicts a value (average) and aims to minimize MSE
-  ***root node*** => highest node in tree
-  ***child node*** => connected to root node and contains 2 or more branches
-  ***leaf node*** => connected to root node and does not have any children; represents a classification
-  ***samples*** => counts how many training instances
-  ***gini*** => measures _impurity_ (***node is "pure" [gini=0] if all training instances it applies to belong to the same class***)
-  ***class probabilities*** => probability that instance belows to a particular class (k)
-  ***pruning*** => deleting unnecessary nodes via probability [p-value] if > 5% threshold then node is considered unnecessary and its children are deleted 

## Decision Tree Boundaries:
-  thick line represents the decision boundary of the root node
-  dotted line(s) represent decision boundaries of the child node(s) 

## Scikit-Learn CART [Classification and Regression Trees]:
-  produces binary trees => nonleaf nodes always have 2 children (yes/no answers)
-  algorithm splits training set into 2 subsets using a ***single feature*** (ex: petal length) 'k' and a ***threshold*** (ex: petal length <= 2.45cm) 'tk' that produces the purest subset (weighted by their size)
-  searches for pair (k, tk) that produces the purest subsets (weighted by their size) and continues to split subsets once it reaches the maximum depth (max_depth hyperparameter) or if no split can be performed to reduce impurity
-  ***gini impurity*** and ***entropy*** => https://en.wikipedia.org/wiki/Decision_tree_learning:
    -  gini impurity and entropy produce similar trees however gini impurity is slightly faster to compute, but tends to isolate the most frequent class in its own branch of tree ... entropy tends to produce slightly more balanced trees

## Regularization Hyperparameters:
-  nonparametric model:
    -  number of parameters is not determined prior to training thus:
        -  DTs make very few assumptions about the training data thus:
            -  adapts tightly to training data fitting it very closely and likely leading to overfitting
-  parametric model:
    -  has predetermined number of parameters thus:
        -  degree of freedom is limited thus:
            -  reducing the risk of overfitting however ... increasing risk of underfitting
-  ***avoid overfitting the training data by restricting the DT's freedom aka "regularization" during training***:
    -  controlled by the **max_depth** hyperparameter
    -  reduce **max_depth** to regularize the model thus reduce risk of overfitting

-  additional hyperparameters:
    -  min_samples_split => (minimum number of samples a node must have before it can be split)
    -  min_samples_leaf => (minimum number of samples a leaf node must have)
    -  min_weight_fraction_leaf => (same as min_samples_leaf but expressed as fraction of total number of weighted instances)
    -  max_leaf_nodes => (maximum number of leaf nodes)
    -  max_features => (maximum number of features that are evaluated for splitting at each node)
    -  ***increasing min hyperparameters or reducing max hyperparamters will regularize the model***

## _Exercises_

In [1]:
import sklearn
print(sklearn.__version__)

0.20.0


### train, visualize, and predict classification decision tree

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')

In [3]:
iris["data"][:10]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [4]:
iris["target"]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [5]:
from sklearn.tree import export_graphviz

export_graphviz(
        tree_clf,
        feature_names=iris.feature_names[2:],
        class_names=iris.target_names,
        rounded=True,
        filled=True
    )

'digraph Tree {\nnode [shape=box, style="filled, rounded", color="black", fontname=helvetica] ;\nedge [fontname=helvetica] ;\n0 [label="petal length (cm) <= 2.45\\ngini = 0.667\\nsamples = 150\\nvalue = [50, 50, 50]\\nclass = setosa", fillcolor="#e5813900"] ;\n1 [label="gini = 0.0\\nsamples = 50\\nvalue = [50, 0, 0]\\nclass = setosa", fillcolor="#e58139ff"] ;\n0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;\n2 [label="petal width (cm) <= 1.75\\ngini = 0.5\\nsamples = 100\\nvalue = [0, 50, 50]\\nclass = versicolor", fillcolor="#39e58100"] ;\n0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;\n3 [label="gini = 0.168\\nsamples = 54\\nvalue = [0, 49, 5]\\nclass = versicolor", fillcolor="#39e581e5"] ;\n2 -> 3 ;\n4 [label="gini = 0.043\\nsamples = 46\\nvalue = [0, 1, 45]\\nclass = virginica", fillcolor="#8139e5f9"] ;\n2 -> 4 ;\n}'

In [6]:
tree_clf.predict_proba([[5, 1.5]]) # highest probability is 0.90740741 [class label 1: iris-versicolor]

array([[0.        , 0.90740741, 0.09259259]])

In [7]:
tree_clf.predict([[5, 1.5]])

array([1])

### train regression decision tree

In [8]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')

### additional exercises:

https://github.com/ageron/handson-ml/blob/master/06_decision_trees.ipynb

1. What is the approximate depth of a Decision Tree trained (without restrictions) on a training set with 1 million instances?
2. Is a node’s Gini impurity generally lower or greater than its parent’s? Is it generally lower/greater, or always lower/greater?
3. If a Decision Tree is overfitting the training set, is it a good idea to try decreasing max_depth?
4. If a Decision Tree is underfitting the training set, is it a good idea to try scaling the input features?
5. If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?
6. If your training set contains 100,000 instances, will setting presort=True speed up training?
7. Train and fine-tune a Decision Tree for the moons dataset. a. Generate a moons dataset using make_moons(n_samples=10000, noise=0.4). b. Split it into a training set and a test set using train_test_split(). c. Use grid search with cross-validation (with the help of the GridSearchCV class) to find good hyperparameter values for a DecisionTreeClassifier. Hint: try various values for max_leaf_nodes. d. Train it on the full training set using these hyperparameters, and measure your model’s performance on the test set. You should get roughly 85% to 87% accuracy.
8. Grow a forest. a. Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly. Hint: you can use Scikit-Learn’s ShuffleSplit class for this. b. Train one Decision Tree on each subset, using the best hyperparameter values found above. Evaluate these 1,000 Decision Trees on the test set. Since they were trained on smaller sets, these Decision Trees will likely perform worse than the first Decision Tree, achieving only about 80% accuracy. c. Now comes the magic. For each test set instance, generate the predictions of the 1,000 Decision Trees, and keep only the most frequent prediction (you can use SciPy’s mode() function for this). This gives you majority-vote predictions over the test set. d. Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about 0.5 to 1.5% higher). Congratulations, you have trained a Random Forest classifier!

### grp