# The Holdout Method

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("data/iowa_housing.csv")

feature_names = [
'LotArea', # Total lot area of a property, measured in square feet.
'YearBuilt', # Year when the house was constructed or built.
'1stFlrSF', # Total square footage of the first (ground) floor of the house.
'2ndFlrSF', # Total square footage of the second floor of the house.
'BedroomAbvGr', # Number of bedrooms located above the ground level.
'TotRmsAbvGrd', # Total number of rooms (excluding bathrooms) above ground level.
'GrLivArea', # Above ground living area (square feet)
]

X = df[feature_names]
y = df['SalePrice'] # Same as before

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X,y)
y_pred = tree_model.predict(X)

In [None]:
tree_model.score(X, y) # R²

In [None]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y, y_pred) # Too good to be true

## Train / test split

### Overfitting

The accuracy is excellent, but that's because the way we adjusted our model is incorrect. Since the data we used to train and test the model are the same, it's normal that we get almost only correct answers. We are dealing with a classic case of **overfitting**.

### Let's divide the data

To evaluate the robustness of a model, we will test it on data it has never seen before. To do this, we will split our data into two groups: one will be used for training (*train*), and the other to test the model (*test*).

The "train_size" parameter will determine the proportion of our data used for training. A value of 0.8 means that we reserve 80% of our data for training and 20% for testing.

In [None]:
from sklearn.model_selection import train_test_split
# Now our data are split in 4 different parts
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size=0.8)

tree_model.fit(X_train, y_train)

y_pred_tree = tree_model.predict(X_test)
print(mean_absolute_error(y_test, y_pred_tree))

The MAE is significantly higher, approximately 350 times higher! Since the average price of a house was around $180,000, this means our model is off by about 1/9 of the price. There are, of course, many ways to achieve a higher score.

## Let's compare with the linear regression

In [None]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()
lr_model.fit(X_train,y_train)

y_pred_lr = lr_model.predict(X_test)
print(mean_absolute_error(y_test, y_pred_lr))

(pd.DataFrame(zip(X.columns, lr_model.coef_), columns=['variable', 'coefficient'])
              .sort_values(by='coefficient', ascending=False)
              .reset_index(drop=True))

## Limits of the "Holdout" Method: Cross Validation

The division between "train" and "test" has a flaw: it prevents our model from being trained on the entire dataset and is sensitive to the random splitting of the data. Are we not risking to not feed important data to our model?

## The *K-Fold Cross Validation*

To solve this problem, we can use Cross Validation, along with one of its variants: K-Fold Cross Validation.

We will divide the dataset into $K$ equal parts, reserve the first part for testing, and train the model on the rest. Then repeat the same operation, selecting the second part as the test set, and so on.

In general, we then calculate the average of all the different scores obtained, which gives us a "cross-validated" score, i.e., a hypothetical score that could be achieved if the model were trained on the entire dataset.
Note that cross-validation does not produce a trained model; this method only provides a series of scores.

## Choosing the Number $K$

We often choose a number between 3 and 10. The higher the number, the more representative the final score will be. However, this will require more computational time.



## Cross validation (K-Fold)
<div>
<img src="files/cross_validation.jpg" alt="cross_validation" width="70%" align='center' source="https://www.50a.fr/img/upload/machine%20learning..jpg" /> </div>

In [None]:
from sklearn.model_selection import cross_validate

cross_validate(tree_model, X, y,
               #verbose=2,
               cv=10) # Default Score is R² for DecisionTreeRegressor

### DecisionTreeClassifier

In [None]:
# Our original score with the Tree model based on our initial split
tree_model.score(X_test, y_test) # R²

In [None]:
# Average score with cross validation
cross_validate(tree_model, X, y, cv=10)['test_score'].mean()

### LinearRegression

In [None]:
# Our original score with Linear Regression model based on our initial split
lr_model.score(X_test, y_test) # score is R²

In [None]:
# Average score with cross validation
cross_validate(lr_model, X, y, cv=10)['test_score'].mean()

# Model parameters

A decision tree can be configured in many different ways, as you can see by examining the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) for our type of model.

### Difference between parameters and hyperparameters

In machine learning, hyperparameters are the parameters that govern the process of generating the internal parameters of the model.

For example, in our model, its parameters include all the branches that lead to the leaves of our tree, among other things. These parameters were determined by the model during its training and changed significantly between the start of the fitting process and when it finished.

Hyperparameters, on the other hand, are parameters that are often set by a human and are not modified during training. They represent high-level directions or settings.

One of the most important hyperparameters for this model is the depth of the tree. For now, we didn't give any specific instructions, so this hyperparameter was generated by the program. Let's examine it:

In [None]:
tree_model.tree_.max_depth

There is a maximum of 25 levels of depth in our tree. Each time we add a level of depth to our tree, we increase its maximum number of leaves and therefore its precision. However the number of houses in each leaf will be reduced, which means that predictions will become less reliable. So, we need to find a balance between precision and reliability.

### Overfitting and Underfitting

The phenomena of overfitting and underfitting are central concepts in machine learning.

- Overfitting occurs when the model's results closely match the data it was trained on but make significant errors when applied to unknown data. This happens when our decision tree is too deep.

- Underfitting occurs when the model fails to distinguish essential features in our data. It will have a poor score on both the training data and the test data. This occurs when our model is not deep enough.

<div>
<img src="files/underfitting_and_overfitting.png" width="65%" align='center'/> </div>

The graph above shows the variation in MAE based on the depth of a decision tree. The term "validation" here refers to the "test" dataset. Here are some observations about the graph:

- On average, the model will always have a better score when predicting data from its training set rather than unknown data from the test set.

- Increasing the depth initially improves the model on both the training and test sets.

- There comes a point where increasing the depth improves precision on the training set, but the MAE starts to increase on the test set. This is the phenomenon of overfitting.

The goal of hyperparameters is to find this balance point, which should allow us to maximize the model's performance.

In [None]:
def get_mae(max_depth, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test):
    tree_model = DecisionTreeRegressor(max_depth=max_depth, random_state=42)
    tree_model.fit(X_train, y_train)
    y_pred_train = tree_model.predict(X_train)
    y_pred_test = tree_model.predict(X_test)
    mae_train = mean_absolute_error(y_train, y_pred_train)
    mae_validation = mean_absolute_error(y_test, y_pred_test)
    
    return mae_validation, mae_train

In [None]:
# Exemple
get_mae(10)

In [None]:
d = {}
for max_depth in range(2, 20):
    mae, mae_train = get_mae(max_depth)
    d[max_depth] = [mae, mae_train]
    print(f"Max Depth: {max_depth} \t\t Mean Absolute Error: {mae, mae_train}")

In [None]:
(pd.DataFrame.from_dict(d, orient='index')
             .rename(columns={0 : 'MAE Validation', 1 : 'MAE Train'})

).plot();

In [None]:
# Zoom on MAE Validation

(pd.DataFrame.from_dict(d, orient='index')
             .rename(columns={0 : 'MAE Validation', 1 : 'MAE Train'})['MAE Validation']

).plot();

## The Bias / Variance tradeoff

<div>
<img src="files/overfitting_underfitting_good_balance.png" width="80%" align='center'/ source='https://medium.com/analytics-vidhya/5-ways-to-achieve-right-balance-of-bias-and-variance-in-ml-model-f734fea6116'> </div>

- **Variance** is the variability of model predictions for a given input, indicating how much the model's predictions would differ if trained on different subsets of the data.

    **High variance** models are more complex and flexible, often capturing noise and fluctuations in the training data. They may perform well on the training data but may not generalize well to new data or the test dataset (**overfitting**).

- **Bias** is the error introduced by approximating a complex problem by a much simpler model. It represents the difference between the average prediction of the model and the true value.

    **High bias** models tend to be too simplistic and may **underfit** the data. They may not capture the underlying patterns and relationships in the data.

    However, they can have relatively good performance when trying to predict values on the test dataset.


## No Free Lunch Theorem

The term was popularized by David H. Wolpert and William Macready in the late 1990s.

The No Free Lunch theorem states that there is no algorithm that performs optimally on all possible problems or datasets.
In other words, no algorithm is universally superior, and the performance of an algorithm is highly dependent on the specific characteristics of the problem it is applied to.

<div>
<img src="files/no_free_lunch.png" width="75%" align='center' source='https://community.alteryx.com/t5/Data-Science/There-is-No-Free-Lunch-in-Data-Science/ba-p/347402'/> </div>

## What is the difference between test and validation sets?

- The **validation set** is used during the training process to tune hyperparameters and assess the performance of different models. It helps to prevent overfitting by providing an independent dataset that the model has not seen during training.

- The **test Set** is used to evaluate the final performance of the selected model, it provides an unbiased estimate of how well the model will generalize to unseen data. The test set should only be used once, after the model has been trained and tuned using the validation set.

Depending on what you are doing and the models you're using, you don't necessarily need to create a validation set. We'll get back to that later.

## Learning Curves and train/test split size

**Do we overfit? Do we underfit? Do we have enough data?**

To answer these questions, let's try to monitor how does the curve evolve if we modify the size of the training set. Then let's compute the score ($r²$) for all of our trainings.

<div>
<img src="files/learning_curves.jpg" width="75%" align='center' source='https://www.dataquest.io/blog/learning-curves-machine-learning/'/> </div>

## Understanding the curves

### A good model

As the training size increases :

 - The training score will decrease.
 - The test score will increase.
 - Most of the time the curves demonstrate convergence.


<div>
<img src="files/learning_curves_score.png" width="75%" align='center' source='https://www.dataquest.io/blog/learning-curves-machine-learning/'/> </div>

If the curves converge, it means that your model performs as well on the training set as it does on the test set. It does not perform worse on the test set, so there is no overfitting.

But other scenarios can occur.

### High biais (underfitting)


<div>
<img src="files/high_biais.png" width="75%" align='center' source='https://www.dataquest.io/blog/learning-curves-machine-learning/'/> </div>

Here the curves converge, so, just as before, there is no overfitting, but the model performs very poorly overall. Here we tried to use linear regression when clearly it is not a suitable model. So the model is underfitted, but it did not learn at all. There is therefore a strong bias, and we make a lot of errors.

### High variance (overfitting)

<div>
<img src="files/high_variance.png" width="75%" align='center' source='https://www.dataquest.io/blog/learning-curves-machine-learning/'/> </div>

In this other scenario, we observe that the training data has been perfectly predicted, but the score is low on the test data. There is therefore a significant gap between the two curves. We're dealing with an overfitting case.

### Ideal curves

<div>
<img src="files/learning_curves_ideal_curves.png" width="75%" align='center' source='https://www.dataquest.io/blog/learning-curves-machine-learning/'/> </div>

Ideal curves are therefore quite high (good score on both curves so no underfitting) and converge (no overfitting).

### Too much data?

These curves also indicate when it is not necessary to continue providing data to the model. By providing only what is strictly necessary, we shorten the duration of each model training, which allows us to save time and resources.

## Learning curves on our data

## With linear regression

In [None]:
from sklearn.model_selection import learning_curve

train_sizes = [25,50,75,100,250,500,750,1000,1150, 1400]

# Get train scores (R2), train sizes, and validation scores using `learning_curve`
train_sizes, train_scores, test_scores = learning_curve(estimator=LinearRegression(),
                                                        X=X,
                                                        y=y,
                                                        train_sizes=train_sizes,
                                                        cv=5)

# Take the mean of cross-validated train scores and validation scores
train_scores_mean = np.mean(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)

plt.plot(train_sizes, train_scores_mean, label='Training score')
plt.plot(train_sizes, test_scores_mean, label='Test score')
plt.ylabel('r2 score', fontsize=14)
plt.xlabel('Training set size', fontsize=14)
plt.title('Learning curves', fontsize=18, y=1.03)
plt.legend();

This graph show that our curves have converged (the gap between them isn't that large), and the test score has plateaued. Even if we add more data, it won't increase the score.

Also the score is below 0.7, so we can probably improve it.

## With DecisionTreeRegressor

In [None]:
from sklearn.model_selection import learning_curve


train_sizes = [25,50,75,100,250,500,750,1000,1150, 1400]

# Get train scores (R2), train sizes, and validation scores using `learning_curve`
train_sizes, train_scores, test_scores = learning_curve(estimator=DecisionTreeRegressor(),
                                                        X=X,
                                                        y=y,
                                                        train_sizes=train_sizes,
                                                        cv=5)

# Take the mean of cross-validated train scores and validation scores
train_scores_mean = np.mean(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)

plt.plot(train_sizes, train_scores_mean, label='Training score')
plt.plot(train_sizes, test_scores_mean, label='Test score')
plt.ylabel('r2 score', fontsize=14)
plt.xlabel('Training set size', fontsize=14)
plt.title('Learning curves', fontsize=18, y=1.03)
plt.legend();

Here the gap between the curves is bigger : the model perform very well on the train data set but doesn't behave as we wish on the test set : we're dealing with overfitting.

The Test score curve keeps going up, meaning that if he wad more data, oure results would probably be better.

### What happens next ?

The holdout method is here to help us improve our model but when we think we have done everything we could to improve our model (best split, best hyperparameters...), we can retrain the model but this time using the entire dataset to further improve accuracy. Then, we could test it on real data from a different dataset to see how it performs.

Another possibility would be to use a different model and see if it performs better or worse.