# DecisionTreeClassifier

-> Can perform both classification and regression tasks and even multioutput tasks.

-> It makes very few assumptions about the training data. If left unconstarined tree structure will adapt itself to the training data, fitting very closely and thereby overfitting. Its a non parametric model - that means number of parameters is not determined prior to training so the model structure is free to stick closely to the data. To avoid overfitting the training data, we need to restrict the decision tree's freedom during training - regularization.

-> Parametric model like linear model has a predetermined number of parameters, so its dof is limited reducing the risk of overfitting. but increasing the risk of underfitting.

-> Root node, Split node, Leaf node

-> Node's attributes :

samples - Counts how many training instances it applies to. eg: 50

value - Tells us how many training instances of each class this node applies to. eg: [0,50,50]

gini - Measures its Gini impurity. A node is pure ie gini=0 if all training instances it applies to belong to the same class. eg: value=[50,0,0]

Gi=1-(sigma k=1 to n Pik^2)
Pik = ratio of class k instances among the training instances in the ith node.

-> Scikit-Learn uses CART(Classification and Regression Tree) algorithm which produces only binary trees.

-> No further splitting at pure nodes.Split continues till max_depth.

-> Estimating class probability - A decision tree can estimate the probability that an instance belongs to a particular class k. First it traverses the tree to find the leaf node for this instance and then it returns the ratio of training instances of class k in this node.

## CART Training Algorithm

It first splits the training set into 2 subsets using a single feature k and a threshold tk eg: petal length <= 2.45cm. selecting k and tk - It searches for the pair (k,tk) that produces the purest subsets weighted by their size. It then does this splitting recursively on the subsets. It stops splitting once it reaches the maximum depth or if it cannot find a split that will reduce impurity. This is a greedy algorithm.

J(k,tk) = mleft/m * Gleft + mright/m * Gright

Training algorithm compares all features on all samples at each node -
 O(n*mlog2m) . Predictions requires traversing the decision tree from root to leaf. O(log2m). Predictions are fast even with larger training sets.

 Entropy impurity is a substitute for Gini impurity. A set's entropy is 0 if it contains instances of only one class. Entropy produces slightly more balanced trees. Gini impurity tends to isolate the most frequent class in its own branch of the tree.

 Regularization hyperparameters - max_depth, max_features, max_leaf_nodes, min_samples_split, min_samples_leaf, min_weight_fraction_leaf . Increasing min_* or decresing max_* will regularize the model.

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris(as_frame=True)
X_iris = iris.data[["petal length (cm)", "petal width (cm)"]].values
y_iris = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X_iris, y_iris)

In [3]:
from sklearn.datasets import make_moons

X_moons, y_moons = make_moons(n_samples=150, noise=0.2, random_state=42)

tree_clf1 = DecisionTreeClassifier(random_state=42)
tree_clf2 = DecisionTreeClassifier(min_samples_leaf=5, random_state=42)
tree_clf1.fit(X_moons, y_moons)
tree_clf2.fit(X_moons, y_moons)

In [4]:

X_moons_test, y_moons_test = make_moons(n_samples=1000, noise=0.2,
                                        random_state=43)
tree_clf1.score(X_moons_test, y_moons_test)

0.898

In [5]:
tree_clf2.score(X_moons_test, y_moons_test)

0.92

# DecisionTreeRegressor

Instead of predicting a class in each node, it predicts a value. Predicted value will be the average target value of the instances on that node.

In classification CART algo tries to split the training set in a way that minimizes impurity. In regression it tries to split the training set in a way that minimizes the MSE.

Cost function, J(k,tk) = mleft/m * MSEleft + mright/m * MSEright

## Limitations

-> All splits are perpendicular to an axis (ie orthogonal decision boundaries). It makes them sensitive to data's orientation. Solution - scale the data then apply PCA transformation. It rotates the data in a way that reduces the correlation between the features.

-> Decision trees have high variance. small changes to hyperparameters or to the data may produce very different models. By averaging predictions over many trees its possible to reduce variance. Such an ensemble of trees is called a random forest.

In [7]:
from sklearn.tree import DecisionTreeRegressor
import numpy as np
np.random.seed(42)
X_quad = np.random.rand(200, 1) - 0.5  # a single random input feature
y_quad = X_quad ** 2 + 0.025 * np.random.randn(200, 1)

tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg.fit(X_quad, y_quad)

In [8]:
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pca_pipeline = make_pipeline(StandardScaler(), PCA())
X_iris_rotated = pca_pipeline.fit_transform(X_iris)
tree_clf_pca = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf_pca.fit(X_iris_rotated, y_iris)