# Random Forests

Decision trees require us to make a difficult decision: a **deep tree** with many leaves will be **too accurate** because each prediction will have been calculated on a limited number of houses. However, a **shallow tree** with few leaves will be **less performant** because it fails to capture the importance of different features in our dataset.

Even the most sophisticated modeling techniques today face this trade-off between underfitting and overfitting. However, many models have clever ideas that can lead to better performance, such as the ***random forest*** algorithm.

A ***random forest*** generates many different decision trees and makes a prediction by **averaging the predictions of each tree**. Its predictions are usually much better than those of a single decision tree, and default parameters are often sufficient.

Disadvantage: the algorithm acts like a **black box**, and it's difficult to explain its choices.

This algorithm takes various forms. It can be used to predict a **discrete value** (*regressor*) or predict a **class** (*classifier*).

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

from pprint import pprint

df = pd.read_csv("data/iowa_housing.csv")

In [None]:
y = df['saleprice']

feature_names = ['lotarea',
                 'yearbuilt',
                 '1stflrsf',
                 '2ndflrsf',
                 'fullbath',
                 'bedroomabvgr',
                 'totrmsabvgrd',]

X = df[feature_names]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor

iowa_rf_model = RandomForestRegressor(random_state=42)
iowa_rf_model.fit(X_train, y_train)
y_pred = iowa_rf_model.predict(X_test)
mean_absolute_error(y_test, y_pred)

# More metrics

So far we used the MAE as or main metric. But we could use other metrics to test our predictions.

### MSE (*Mean Squared Error*)

The MSE (*Mean Squared Error*) quantifies the average squared difference between the predicted values and the actual values in a dataset. It penalizes larger prediction errors more heavily due to the squaring operation. On the other hand the unit is now very different from the one used for the prediction, so it's not easy to compare it to the original unit.

<div>
<img src="files/mse_formula.svg" alt="MSE" width="20%" align='center'/> </div>

In [None]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, y_pred)

### R², coefficient of determination

The proportion of the variance in a dependent variable (y) that is explained by one or more independent variables (X) in the regression model.

R² is a value between 0 and 1, where:

- **R² = 0** indicates that the model does not explain any of the variance in the dependent variable, and it's essentially a **poor fit**.
- **R² = 1** indicates that the model perfectly explains all the variance in the dependent variable, and it's an **excellent fit**.


The formula for calculating R² is as follows:
```
R² = 1 - (MSE_model / MSE_baseline)
```
Let's say we have only two values in our dataset, and we made those predictions :

**- Actual values: (10, 15)**

**- Predicted values: (12, 13)**

First, let's calculate the **mean** of the actual values:
```
Mean of actual values = (10 + 15) / 2
                      = 12.5
```
Second, let's compute the **Mean Squared Error (MSE)** of our model:

```
MSE_model = ((10 - 12)² + (15 - 13)²) / 2
          = (4 + 4) / 2
          = 4
```
Now, let's calculate the MSE of a simple baseline model. This is done by using the mean of the actual values instead of making different predictions for all our data points. In this case, the mean of actual values is 12.5, so:

```
MSE_baseline = ((10 - 12.5)² + (15 - 12.5)²) / 2
             = (6.25 + 6.25) / 2
             = 6.25
```
We've got everything we need, so lets' use the formula to calculate R²:

```
R² = 1 - (MSE_model / MSE_baseline)
   = 1 - (4 / 6.25)
   = 1 - 0.64
   = 0.36
```

R² value is 0.36, meaning our model explains 36% of the variance in the target variable.

In [None]:
from sklearn.metrics import r2_score

# r2_score([10, 15], [12, 13]) # Our example earlier

r2_score(y_test, y_pred)

In [None]:
# The method .score() from our model already uses the R² as metric.
iowa_rf_model.score(X_test, y_test)

## Hyperparameters optimisation

In [None]:
# Let's display the hyperparameters of our model

iowa_rf_model.get_params()

### Random Search

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
# Number of features to consider at every split
max_features = [1, 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]#.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,}

pprint(random_grid)

In [None]:
# Let's compute how many possibilities we have...
possibilities = 1
for param_list in random_grid.values():
    possibilities *= len(param_list)
possibilities

In [None]:
# Random search of parameters, using 3 fold cross validation, 
# search across 10 different combinations (out of the 3960), and use all available cores
rf_random = RandomizedSearchCV(estimator=iowa_rf_model,
                               param_distributions=random_grid,
                               n_iter=10,
                               cv=4,
                               verbose=2,
                               random_state=42,
                               n_jobs=-1)
# Fit the random search model
rf_random.fit(X_train, y_train)

## Cross validation (K-Fold)
<div>
<img src="files/cross_validation.jpg" alt="cross_validation" width="70%" align='center' source="https://www.50a.fr/img/upload/machine%20learning..jpg" /> </div>

Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. At the end, the model will have been trained on the entire dataset, and the average score of each iteration is calculated. This helps to better train the model, especially in cases where data is limited.

## Random search results

In [None]:
# The best random forest model found

rf_random.best_score_

In [None]:
# Its parameters

rf_random.best_params_

In [None]:
# The best parameters according to the random search
rf_random.best_estimator_

In [None]:
# The score using the best parameters

rf_random.best_estimator_.score(X_test, y_test)

### *Fine Tuning* : Grid Seach CV

In [None]:
# These are good parameters found thanks to the random search

{'n_estimators': 1400,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 30,
 'bootstrap': True}

In [None]:
from sklearn.model_selection import GridSearchCV

# Let's create the parameter grid based on the results of random search
# (results may vary, because it's a random search!)
param_grid = {
    'bootstrap': [True],
    'max_depth': [20, 30, 40],
    'max_features': ['sqrt'],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [3, 5, 7],
    'n_estimators': [1200, 1400, 1600]
}

# Instantiate the grid search model
grid_search_rf = GridSearchCV(estimator=iowa_rf_model,
                              param_grid=param_grid, 
                              cv=2,
                              n_jobs=-1,
                              verbose=1)

In [None]:
grid_search_rf.fit(X_train, y_train)

In [None]:
grid_search_rf.best_params_

In [None]:
grid_search_rf.score(X_test, y_test)

In [None]:
# Same thing then :
grid_search_rf.best_estimator_.score(X_test, y_test)

### Final model

In [None]:
final_rf = grid_search_rf.best_estimator_

from joblib import dump, load
dump(final_rf, 'best_iowa_rf_model.joblib') 