# **Decision Trees**


Decision trees are a common type of machine learning model used for binary classification tasks. The natural structure of a binary tree lends itself well to predicting a “yes” or “no” target. It is traversed sequentially here by evaluating the truth of each logical statement until the final prediction outcome is reached. Some examples of classification tasks that can use decision trees are: predicting whether a student will pass or fail an exam, whether an email is spam or not, if transaction is fraudulent or legitimate, etc.

Decision trees can also be used for regression tasks. As with other scikit-learn models, only numeric data can be used (categorical variables and nulls must be handled prior to model fitting). In this case, our categorical features have already been transformed and no missing values are present in the data set. The syntax is identical as the decision tree classifier, except the target, y, must be real-valued and the model used must be `DecisionTreeRegressor()`. As far as the model hyperparameters go, almost all are the same, except for the split criterion. The split criterion now needs be suitable for a regression task – the default for regression is Mean Squared Error (or MSE).

## **Model Fitting**

### **sklearn.tree.DecisionTreeClassifier**

```python
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier(max_depth=8, ccp_alpha=0.01, criterion="gini")

dtree.fit(X_train, y_train)
y_predicted = dtree.predict(X_test)

dtree.feature_importances_
```

## **Visualizing Decision Trees**

Two methods are available to visualize the tree within the tree module – the first is using `tree_plot` to graphically represent the decision tree. 

```python
import matplotlib.pylab as plt
from sklearn import tree

plt.figure(figsize=(20, 12))
tree.plot_tree(
    dtree,
    feature_names=X_train.columns,
    max_depth=5,
    class_names=["Drowned", "Survived"],
    label="all",
    filled=True,
)
plt.tight_layout()
plt.show()
```

The second uses `export_text` to list the rules behind the splits in the decision tree.

```python
print(tree.export_text(dtree, feature_names=X_train.columns))
```

### **sklearn.tree.DecisionTreeClassifier**

```python
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor(max_depth=3, ccp_alpha=0.001)
dt.fit(X_iris_train, y_iris_train)
```

Similarly, the tree can be visualized using `tree.plot_tree` – keeping in mind the splitting criteria is mse and the value is the average target variables of all samples in that leaf.

```python
plt.figure(figsize=(20, 12))
tree.plot_tree(dt, feature_names=X_iris_train.columns, max_depth=4, filled=True)
plt.tight_layout()
plt.show()
```

Decision trees are easy to understand, fully explainable, and have a natural way to visualize the decision making process. In addition, often little modification needs to be made to the data prior to modeling (such as scaling, normalization, removing outliers) and decision trees are relatively quick to train and predict. However, now let’s talk about some of their limitations.

One problem with the way we’re currently making our decision trees is that our trees aren’t always globally optimal. This means that there might be a better tree out there somewhere that produces better results. But wait, why did we go through all that work of finding information gain if it’s not producing the best possible tree?

Our current strategy of creating trees is greedy. We assume that the best way to create a tree is to find the feature that will result in the largest information gain right now and split on that feature. We never consider the ramifications of that split further down the tree. It’s possible that if we split on a suboptimal feature right now, we would find even better splits later on. Unfortunately, finding a globally optimal tree is an extremely difficult task, and finding a tree using our greedy approach is a reasonable substitute.

Another problem with our trees is that they are prone to overfit the data. This means that the structure of the tree is too dependent on the training data and may not generalize well to new data. In general, larger trees tend to overfit the data more. As the tree gets bigger, it becomes more tuned to the training data and it loses a more generalized understanding of the real world data.

## **Gini Impurity**

The root node is identified as the top of the tree. This is notated already with the number of samples and the numbers in each class (i.e. True vs. False) that was used to build the tree. Splits occur with True to the left, False to the right. Note the right split is a leaf node i.e., there are no more branches. Any decision ending here results in the majority class. (The majority class here is False.)

This idea can be quantified by calculating the Gini impurity of a set of data points. For two classes (1 and 2) with probabilites $p_1$ and $p_2$ respectively, the Gini impurity is:

$1-(p_1^2 + p_2^2)$

The goal of a decision tree model is to separate the classes the best possible, i.e. minimize the impurity (or maximize the purity). Notice that if $p_1$ is 0 or 1, the Gini impurity is 0, which means there is only one class so there is perfect separation. From the graph, the Gini impurity is maximum at $p_1=0.5$, which means the two classes are equally balanced, so this is perfectly impure!

In general, the Gini impurity for C classes is defined as:

$1-\Sigma p_i^2$

## **Information Gain**

For a classification task, the default split criteria is Gini impurity – this gives us a measure of how “impure” the groups are. At the root node, the first split is then chosen as the one that maximizes the information gain, i.e. decreases the Gini impurity the most. 

We know that we want to end up with leaves with a low Gini Impurity, but we still need to figure out which features to split on in order to achieve this. To answer this question, we can calculate the information gain of splitting the data on a certain feature. Information gain measures the difference in the impurity of the data before and after the split.