In [None]:
import pandas as pd
import numpy as np
import math
import sklearn.datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

##Seaborn for fancy plots. 
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = (14,14)

# Regression Trees

Trees can also perform regressions in addition to decisions. Using the regression tree models is pretty straightforward and very similar to any other model like linear regression. The regression tree itself is mostly similar to the decision tree, the primary difference is that both the outcomes and the error metrics are adapted to numerical values. 

<b>A Regression Tree:</b>

![Regression Tree](images/regression_tree.webp "Regression Tree" )

We can start by creating and looking at a regresion tree, as always, the mechanics of making and training the model is the same as we are used to. 

In [None]:
def sklearn_to_df(sklearn_dataset):
    df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
    df['target'] = pd.Series(sklearn_dataset.target)
    return df

df = sklearn_to_df(sklearn.datasets.load_boston())

df.head()

In [None]:
#Generate Model
df2 = pd.get_dummies(df, drop_first=True)
y = np.array(df2["target"]).reshape(-1,1)
X = np.array(df2.drop(columns={"target"}))

X_train, X_test, y_train, y_test = train_test_split(X, y)

clf = DecisionTreeRegressor()
clf = clf.fit(X_train, y_train)
print(clf.get_depth())
print(clf.score(X_test, y_test))
print(X.shape, y.shape)
plot_tree(clf)

In [None]:
# Generate a better image
from sklearn.tree import export_graphviz
export_graphviz(clf,
                     out_file="output/reg_tree_1.dot",
                     feature_names = df.drop(columns={"target"}).columns, 
                     class_names=["0","1"],
                     filled = True)

## Regression Tree Decision Making

The regression tree works very similarly to the decision tree. The key differences are:
<ul>
<li> <b>Predictions:</b> Instead of producing a classification at the end, it produces an average of all the values in that group. That average is the prediction for anything that falls into that leaf on the tree. 
<li> <b>Split Decisions: </b>Instead of using the information gain concept that decision trees do, a regression tree tries to minimize error when splitting, normally MSE. So the algorithm seeks splits that have the lowest average error between the error and the values.
    <ul>
    <li> As a note, this should be familiar from the idea of a cost function. We want the model to minimize the error, how we define error can change, but the process of finding the optimal choice is the same. 
    <li> Rather than measures of set purity, like gini or entropy, the model uses the error as the metric to measure which split generates the "best" fitting tree. 
    </ul>
</ul>

Just like decision trees, there are a few options that we can set as HP - one of those being the error metric. We can choose absolute error or a couple of others for the error metric; we can also set limits to growth like depth and min_split_size. 

The more we allow the tree to expand, the more potential predictions we can make, but the more likely we are to overfit. Limiting the tree size means each terminal leaf will represent more records with its prediction, and the tree will be less likely to overfit.

#### Regression Tree Limitations

One specific weakness with regression trees is that they don't "extend" like a linear regression, they're bounded by whatever data they have. So if the maximum prediction that is generated in training is 50, no matter what future inputs look like it will never be able to predict beyond that. We can see this if we chart an example, there isn't a nice smooth prediction curve like a linear regression, we get blocky steps.

![Regression Tree](images/regtree2.png "Regression Tree" )

### Use a Grid Search to Improve

We can also utilize a grid search to do some HP tuning. Along with some other options we can try different error metrics. We can set a list for any of the hyperparameters that we want to use in the grid search, and every combination will be executed and evaluated. 

Note that the names for absolute and squared error are changing, so depending on the specific version of sklearn you have installed you might need to use absolute_error/squared_error or mae/mse, the meaning is the same, they just changed the label to be more descriptive. 

In [None]:
tree_para = {'min_samples_split':[2,3,4,5,6,7,8,9,10],
            'max_depth':[7,8,9,10,11,12,13,14,15,16], 
            'criterion':["friedman_mse", "poisson", "squared_error", "absolute_error"]}

clfCV = GridSearchCV(estimator=DecisionTreeRegressor(random_state=0), param_grid=tree_para, cv=10) #See below for the CV argument
clfCV.fit(X_train, y_train)
clfCV.best_estimator_

Use the optimal combo from above and create a new model. We could have also grabbed the best model directly from above and saved it in a variable. 

In [None]:
clf2 = DecisionTreeRegressor(max_depth=8, min_samples_split=8, random_state=0)
clf2 = clf2.fit(X_train, y_train)
print(clf2.score(X_test, y_test))
plot_tree(clf2)

In [None]:
# Generate a better image
export_graphviz(clf2,
                     out_file="output/reg_tree_2.dot",
                     feature_names = df.drop(columns={"target"}).columns, 
                     class_names=["0","1"],
                     filled = True)

### Regression Predictions

We can look at the predictions made by the tree (limit the tree size to make the chart above and the results obvious). Predictions are only at the results of one of the terminal leafs, we don't get a curve like a linear regression. 

This is a reason that regression trees aren't normally all that common, the number of distinct values that can be predicted is limited by the number of leafs in the tree. If we count the number of distinct predictions made and compare it to the number of total predictions made, we can see how we have a very small number of distinct values being predicted. 

In [None]:
preds = clf2.predict(X_test)
sns.histplot(preds, binwidth=1)

print("Number of predictions made:", len(X_test))
print("Unique predictions:", len(np.unique(preds)))


## Exercise - Predict the Target (BodyFat)

In [None]:
#Load data
df_ = pd.read_csv("data/bodyfat.csv")

#Change BodyFat to be named target, to make code reuse easier
df_.rename(columns={"BodyFat":"target"}, inplace=True)

df_.head()

In [None]:
#Generate Model

## Trees Please

Trees are one of the common machine learning algorithms, and have several advantages:
<ul>
<li> They show how decisions are made. A human can follow a decision tree and see exactly what happens on the way to a prediction. 
<li> They (can be) quite fast. 
<li> They are more felxible than other algorithms in dealing with categorical data, as a tree can natively handle a categorical value. <b>Note:</b> this is true for a tree in theory, in practice, specific implementations may still require numerical inputs. 
<li> They work well in ensables, in particular many of the best non-neural network algorithms are based on boosing ensables of trees. We'll look at these later. 
<li> They are resistant to outliers.
<li> Trees illustrate some of the internal processes of machine learning, as we can follow the actions of the algorithm and see how it makes decisions.
</ul>

There are also some downsides:
<ul>
<li> Regression trees are limited, and they can't extrapolate. 
<li> Forest ensables generally get better results, but don't maintain the same level of understandibility. 
<li> Overfitting is a concern, and we need to be careful to limit the growth of the tree.
</ul>

In practice, trees form the foundation model for several of the best and most recently developed non-neural network algorithms, like xgboost. We'll look at this later when we examine boosted ensemble models.