# Decision trees

In the previous chapter, we created a predictive model based on linear regression. But is it possible to use another model?

The answer is yes, there are hundreds of models to choose from inside the *scikitlearn* library. Let's try a model called "decision tree", these are very intuitive models, easy to understand and analyze, which can be useful in many cases.

<div>
<img src="files/decision_tree_1.png" alt="CPU" width="50%" align='center'/> </div>

This model divides houses into two categories: those with 2 bedrooms or less and those with more than 2 bedrooms, and then it displays the average price of each group.

The model uses the dataset to decide how to allocate houses into these two groups, and then again to predict the price within each group. The step of setting a model's parameters from data is called training or fitting. The data used to set up this model is called training data.

The details of how the model is trained (e.g., how to split the data) are quite complex, and we will not discuss this topic in these notebooks. Once the model is fitted, we can apply it to new data to predict the price of a home.

We can consider more factors by using a tree with more "splits," meaning it is "deeper."
A decision tree that also takes into account the size of each house's land might look like this.

<div>
<img src="files/decision_tree_2.png" alt="CPU" width="60%" align='center'/> </div>

To predict the price of a house, we go through the decision tree, always choosing the path that corresponds to the features of that house. The predicted price for the house is found at the bottom of the tree, and this point is called a leaf.

<div>
<img src="files/classifier_tree_meme.webp" alt="CPU" width="50%" align='center' /> </div>

In [None]:
import pandas as pd
df = pd.read_csv("data/iowa_housing.csv")
df.shape

In [None]:
df.head(2)

# Multiple Explanatory Variables

Last time we ony used one feature. This time, let's try to build our models using several features.

In [None]:
feature_names = [
'LotArea', # Total lot area of a property, measured in square feet.
'YearBuilt', # Year when the house was constructed or built.
'1stFlrSF', # Total square footage of the first (ground) floor of the house.
'2ndFlrSF', # Total square footage of the second floor of the house.
'BedroomAbvGr', # Number of bedrooms located above the ground level.
'TotRmsAbvGrd', # Total number of rooms (excluding bathrooms) above ground level.
'GrLivArea', # Above ground living area (square feet)
]

In [None]:
X = df[feature_names]
y = df['SalePrice'] # Same as before

# Modeling

### Model Selection

We will choose a "decision tree," also known as *DecisionTreeRegressor*, which we will name 'model'.

In [None]:
from sklearn.tree import DecisionTreeRegressor

# random_state will allow model reproducibility
model = DecisionTreeRegressor(random_state=42)

### Model Fitting

In [None]:
model.fit(X,y)

### Visualization

Once our model is created, we can visualize it in various ways.

In [None]:
from sklearn import tree
# print(tree.export_text(model))

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(25,20))
tree.plot_tree(model,
               feature_names=X.columns,
               max_depth=2,
               filled=True)

plt.show()

### Predictions

In [None]:
y_pred = model.predict(X)
print(y_pred)

Let's compare the first 5 predictions with the first 5 values of y.

In [None]:
list(y_pred[:5])

In [None]:
list(y[:5])

It's strange, we would expect our model to be a bit off, but it seems like it's predicting our variable y down to the dollar! Let's verify this with some Pandas.

In [None]:
res = pd.DataFrame({'y':y,'y_pred':y_pred})
res['y_pred'] = res['y_pred'].astype(int)
res['diff'] = res['y'] - res['y_pred'].round()
res

In [None]:
res.loc[res['diff'] != 0]

Out of the 1760 predictions made, only 26 are incorrect, and even those are not very far from the expected results. Have we created the best possible model?

# Model Validation

Each model, once trained, must be evaluated using different metrics.

## MAE (*Mean Absolute Error*)

The MAE (*Mean Absolute Error*) is a good metric for evaluating how good is a prediction of a continuous value. For each property, we will calculate the absolute difference between the actual value and the value predicted by the model:

```
error = |actual value - predicted value|
```

If we then take the average of these values, it gives us the MAE (Mean Absolute Error). The MAE tells us what the average difference is between a prediction and the actual value.

One of its advantages is that it is expressed in the same unit as our prediction, here in dollars ($), which is useful for making comparisons.

<div>
<img src="files/mae_formula.svg" alt="MAE" width="50%" align='center'/> </div>

In [None]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y, y_pred) # Too good to be true?

In [None]:
model.score(X, y) # R²

**What did we do wrong?**

# Let's go deeper

### A little trick : column selection based on the rate of missing values

Applying `df.dropna()` to all columns would delete all rows in our dataframe. However, we can choose to eliminate all columns with a completion rate lower than 94% (for example), which allows us to retain at least 94% of our dataset.

In [None]:
# Example 1
(df['garagetype'].isna().sum() / df.shape[0]) < 0.06

In [None]:
# Example 2
(df['miscfeature'].isna().sum() / df.shape[0]) < 0.06

In [None]:
cols_to_keep = [col for col in df.columns if (df[col].isna().sum() / df.shape[0]) < 0.06]

In [None]:
# 1. Without dropping any rows or columns,
# 2. With dropping all rows that have a least one missing value
# 3. With using our columns.

df.shape, df.dropna().shape, df[cols_to_keep].dropna().shape 

In [None]:
new_df = df[cols_to_keep].dropna()

In [None]:
(new_df.isna().sum() > 0).sum()