## 4.Model_optimization

Authors : Haddam Yacine, Ka Alioune, Renaud Adrien

<p align="center">
  <a>
    <img src="../src/figures/logo-hi-paris-retina.png" alt="Logo" width="280" height="180">
  </a>

  <h3 align="center">Data Science Bootcamp</h3>
</p>

In this lab, we will introduce some useful tools to improve a machine learning model. Sometimes, your machine learning models just don’t work as well as expected. When faced with this situation, what many people do is try different methods more or less at random or follow their guts. It might be : 

- **Adding more data**
- **Improve data quality**
- **Feature Selection**
- **Search best hyperparameters**
- **Trying a new model (or Averaging many model's output)**
- **Tweaking some variables**.

### Data Path

`data_dir` is the path to data folder.

In [None]:
data_dir = "/home/jovyan/personal_workspace/bootcamp/data"

## Libraries 

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import sys
import os

import pandas as pd
import warnings

from sklearn.metrics import mean_squared_error, max_error, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn import tree

sys.path.append('../src/notebooks')
from utils.utils_optimization import (
    select_best_features,
    variance_threshold_selector,
    feature_importance_selector,
    recursive_selection,
    F_test_selector
)


pd.set_option('display.max_columns', 500)

warnings.simplefilter(action='ignore', category=FutureWarning) 
warnings.simplefilter(action='ignore', category=UserWarning)

## Load data 

In [None]:
train = pd.read_feather(os.path.join(data_dir, 'model/train.feather'))
val = pd.read_feather(os.path.join(data_dir, 'model/val.feather'))
test = pd.read_feather(os.path.join(data_dir, 'model/test.feather'))

In [None]:
features = [
    # 'building_id',
    'lat',
    'lng',
    'square_feet',
    'air_temperature',
    'dew_temperature',
    'precip_depth_1_hr',
    'wind_speed',
    'sea_level_pressure',
    'wind_direction',
    'hour',
    'weekday',
    'month',
    'meter_name_chilledwater',
    'meter_name_electricity',
    'meter_name_hotwater',
    'meter_name_steam',
    'primary_use_Education',
    'primary_use_Entertainment/public assembly',
    'primary_use_Healthcare',
    'primary_use_Industry',
    'primary_use_Lodging/residential',
    'primary_use_Office',
    'primary_use_Other',
    'primary_use_Parking',
    'primary_use_Public services',
    'primary_use_Services',
    'zone_geo_EUROPE',
    'zone_geo_US',
    'site_id_0',
    'site_id_1',
    'site_id_2',
    'site_id_3',
    'site_id_4',
    'site_id_5',
    'site_id_6',
    'site_id_7',
    'site_id_9',
    'site_id_11',
    'site_id_12',
    'site_id_13',
    'site_id_15',
]

target = "meter_reading"

##  Selecting best features

Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in.

Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.

$\textbf{Benefits of performing feature selection}$

- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.


$\textbf{Methods}$:

- **Linear selection** :  based on [Fisher test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test) (or Chi-squared test), it consist on making one by one linear regression for all variables in the dataset. After regressions, you have to keep variables with low [p-value](https://quantifyinghealth.com/p-value-explanation/).


- **Removing features with low variance** ([VarianceThreshold](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html)).


- **Feature Importance** : delete all variables with low importance according to a model. [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html).


- **Recursive Feature Elimination ([RFE](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html))** : starting with all features in the training dataset, fitting the given machine learning algorithm used in the core of the model, ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.



- **Go further**
    - Stepwise
    - LASSO

We can use each of the methods independently with something like:
```python
variance_threshold_selector(X_train)
F_test_selector(X_train, y_train, 0.05)
feature_importance_selector(model, X_train, y_train, threshold=0.01)
```

But we can apply them all, and gather all the results in a table:

In [None]:
# define some thresholds that will be applied to select the features
p_value = .05
var_threshold = .15
feat_impor_threshold = .01

# define a model that we can use to select features
model = RandomForestRegressor(
    n_estimators=5,
    max_depth=16,
    random_state=0,
    n_jobs=-1
)

# apply all the methods
selected_features_df = select_best_features(
    train[features], train[target],
    p_value,
    var_threshold,
    model,
    feat_impor_threshold,
)

The table indicates, for each method, if the feature was selected. We can analysis the table to select the best features.

In [None]:
selected_features_df

## Searching the best hyper-parameters

An optimization procedure involves also defining a search space. This can be thought of geometrically as an n-dimensional volume, where each hyperparameter represents a different dimension and the scale of the dimension are the values that the hyperparameter may take on, such as real-valued, integer-valued, or categorical.

$\textbf{Search Space:}$ Volume to be searched where each dimension represents a hyperparameter and each point represents one model configuration.
A point in the search space is a vector with a specific value for each hyperparameter value. The goal of the optimization procedure is to find a vector that results in the best performance of the model after learning, such as maximum accuracy or minimum error.

A range of different optimization algorithms may be used, although two of the simplest and most common methods are random search and grid search.

$\textbf{Random Search :}$ Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.

$\textbf{Grid Search : }$ Define a search space as a grid of hyperparameter values and evaluate every position in the grid.

Grid search is great for spot-checking combinations that are known to perform well generally. Random search is great for discovery and getting hyperparameter combinations that you would not have guessed intuitively, although it often requires more time to execute.

We install a additional package for GridSearch. Just execute the following cell.

In [None]:
!{sys.executable} -m pip install hypopt

In [None]:
from hypopt import GridSearch

In [None]:
# Define the model that we will use in the grid search
model = RandomForestRegressor(random_state=0, n_jobs=-1, max_depth=16)

# Define the grid of parameters that will be searched
param_grid = {
    'n_estimators': [2, 10],
    'max_depth': [8, 16]
}

# Create the GridSearch object with the model and the grid of parameters
grid_search = GridSearch(model=model, param_grid=param_grid, seed=0, parallelize=False)

Lets fit the GridSearch.
For each set of parameters, we:
- fit the model on the train set
- evaluate the performances on the validation set

In [None]:
_ = grid_search.fit(train[features], train[target], val[features], val[target], verbose=True)

We can access the performances of the model for each set of parameters. Here, the metric used is the R^2.

In [None]:
grid_search.get_param_scores()

Finally, we compute our MAE and MSE metrics on the validation set for the best model found with the grid search.  
The best model can be accessed with `grid_search.best_estimator_`

In [None]:
grid_search.best_estimator_

In [None]:
# Compute our MAE and MSE metrics on the validation set
def print_errors(model, X, y):
    y_pred = model.predict(X)

    print(f'MAE : {mean_absolute_error(y, y_pred):.0f}')
    print(f'MSE : {mean_squared_error(y, y_pred):.0f}')
    print(f'MAX : {max_error(y, y_pred):.0f}')


print('Erros on Validation set : ')
print_errors(grid_search.best_estimator_, val[features], val[target])

In [None]:
# Compute our MAE and MSE metrics on the test set
print('Erros on Test set : ')
print_errors(grid_search.best_estimator_, test[features], test[target])

# It's your turn

Using what we learned so far, try to:
- improve your model
- create some insightful visualizations
- understand better your model performances

$\textbf{Go further ! }$
- Hastie T., Tibshirani R., Friedman J., « [The elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) - Data Mining, Inference and Prediction », pringer, 2009.