# Time to train models :-)

I've made the choice of training four different types of Machine Learning models and compare results between them:

- Gradient Descent based: [Ridge Regressor](31.Gradient%20Descent%20Based%20-%20Ridge%20Regressor.ipynb)

- Distance based: [KNeighborsRegressor](32.Distance%20Based%20-%20KNeighborsRegressor.ipynb)

- Category based: [RandomForestRegressor](33.Category%20Based%20-%20RandomForestRegressor.ipynb)

- Neural network: [MLPRegressor](34.Neural%20Network%20-%20MLPRegressor.ipynb)


Due to the quite big dataset I have, I will use a [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) step with large parameter scopes to identify the best ML model parameter intervals, and refine the search using reduced scopes with a [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). To do preprocessing using scaler like [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) and/or component reduction with [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), I will combine grid search with [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

The combination of [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) and [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) is an approach that I've already used in my [course #3 project](https://github.com/epfl-extension-school/project-adsml19-c3-s9-3871-2111/blob/master/house-prices/house-prices-solution-2-of-2.ipynb), inspired by this article: [SKlearn: Pipeline & GridSearchCV](https://medium.com/@cmukesh8688/sklearn-pipeline-gridsearchcv-54f5552bbf4e).

Before going to train. models, let me expose some tricks and functions that will be used when training models.

In [1]:
# Load my_utils.ipynb in Notebook
from ipynb.fs.full.my_utils import *

Opening connection to database
Add pythagore() function to SQLite engine
Fraction of the dataset used to train models: 10.00%
my_utils library loaded :-)


# Work on a fraction of the datasets

The *full* dataset I've built to train models is made of about 1.5 millions of lines and 50 features.

In order to speed-up model training, I will work on a fraction of this dataset. To have the same value accross all the Notebooks of this fraction parameter, I've coded it as a *CONSTANT* in [my_utils](my_utils.ipynb) library.

The current value is:

In [2]:
print("Fraction of the dataset used to train models: {:.2f}%".format(FRAC_VALUE_FOR_ML*100))

Fraction of the dataset used to train models: 10.00%


# *Mean Absolute Error* and *Mean Absolute Percent Error* to evaluate model performance.

I've choosen the [mean_absolute_error()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) approach from *sklearn.metrics* to evaluate my models performance. It helps to determine the *average* error made by models on prediction.

In order to simplify the use of this [mean_absolute_error()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) in the next Notebooks, I've coded two functions: *mae()* and *mape()*

## mae(): Mean Absolute Error

This function returns the *Mean Absolute Error* of a prediction in km/h. It takes care of the fact that the result vector of the *Full* dataset, *km_per_hour*, has been log transformed.

    def mae(y_pred, y) -> np.array:
        """
        Returns 10^mean_absolute_error() between the two result
        vector passed as parameter.

        Returns:
        --------
        10^(mean absolute error)

        """

        return 10**mean_absolute_error(y_pred, y)

## mape(): Mean Absolute Percent Error

This function returns the *Mean Absolute Error* expressed in percentage, 100% representing a perfect prediction (without any errors).

Expressing the score of prediction in percentage is an advantage as it takes care of the context. Saying that the model has a MAPE of 88% gives more representative information on the model performance than saying that MAE is 1.5 km/h.

    def mape(y_pred, y) -> float:
        """
        Define a performance metric in percentage

        Returns:
        --------
        mean absolute percentage score

        """

        # Return percentage value
        return 100 - np.mean(100 * (mean_absolute_error(y_pred, y) / y))




# Define a baseline

In order to evaluate our models, one quite easy and direct method would be to compare them to a simple baseline built with *sklearn.dummy.DummyRegressor*

As I've decided to evaluate my models against the *mean absolute error*, let's use this *DummyRegressor* with parameter *strategy='mean'*

In [3]:
from sklearn.dummy import DummyRegressor

# Load X and y dataset
X_tr, y_tr, X_va, y_va=load_Xy(frac=FRAC_VALUE_FOR_ML)

dummy = DummyRegressor(strategy='mean')
dummy.fit(X_tr, y_tr)

y_pred=dummy.predict(X_va)

print("Dummy classifier accuracy in km/h       : {:.2f} km/h".format(mae(y_pred, y_va)))
print("Dummy classifier accuracy in percentage : {:.2f} %".format(mape(y_pred, y_va)))

Dummy classifier accuracy in km/h       : 1.48 km/h
Dummy classifier accuracy in percentage : 84.02 %


In [4]:
# Save model for later use
save_model(model=dummy, name='dummy')


Saving model dummy to ./data/model-dummy.sav using 'pickle' library


# GridSearchCV scoring function

GridSearchCV use internal scoring function to determine prediction performance for each parameters tested, and performs a classification of the parameters combination to determine the best one.

I've decided to not use this internal scoring function (which by defaut is an *R2* function for regression) and define my own scoring function using [sklearn.metrics.make_scorer](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html) and the *mape()* function defined above.

The two following lines will be added to [my_utils](my_utils.ipynb) library to make it available in ML training Notebooks.

Using this scoring function will let me draw more understable train graphs as the results on the Y-axis will be expressed in *percentage of performance*.

    from sklearn.metrics import make_scorer
    custom_scorer = make_scorer(mape, greater_is_better=True)

> Note: The *greater_is_best* parameter to *True* will instruct the GridSearchCV objects that best result is the one with the higher *MAPE* value


# GridSearchCV results plotting function

As I've decided to use *RandomizedSearchCV* and *GridSearchCV* to tune some hyperparameters, I've reimplemented a function initially coded during my [cource #4 project](https://github.com/epfl-extension-school/project-adsml19-c4-s11-3871-2111/blob/master/mylib.py) used to draw the results from the search (using *xSearchCV()cv_results_* property).

This will help to visualy seek for the best hyperparameters.

Header of the function is copied below, implementation is available in [my_utils](my_utils.ipynb) library.

    def plot_grid_search_results(results_df,
                                 x_param,
                                 y_param=['mean_test_score', 'mean_train_score'],
                                 semilogx=True,
                                 xlabel='',
                                 ylabel='Score (%)',
                                 title='GridSearch results',
                                 figsize=(15,10),
                                 std_params={'mean_test_score': 'std_test_score'},
                                 std_factor=1,
                                 show_best_result=['mean_test_score'],
                                 greater_is_best=False
                                ) -> None:
        """
        Function to graph data points from GridSearchCV results., used to graph
        the mean test score of a GridSearchCV fitted object.

        Mandatory parameters are:
            results_df: A dataframe built from GridSearchCV.cv_results_ property
            x_param: The column name of the results_df dataframe to be used as X axis
            y_param: An array of column to be plotted on the Y axis.

        Optionnal parameters:
            semilogx: If True, the X data points are plotted using a log10 scale
            xlabel: Label of the X axis
            ylabel: Label of the Y axis
            title: Title of the graph
            figsize: Size ot the graph
            std_param: A dict with key=y_param element and value the corresponding
                       standard deviation column name.
                       This parameters is used to draw the std deviation of the
                       y_params as a filled area around the data plot
            std_factor: This parameter is used to amplify the standard deviation
                        when building the std dev filled area. Default value is 1 and
                        changing increasing it allows displaying standard 'small'
                        deviation behaviours.
                        Be warn that when changing this parameter to a value other
                        that 1, the filled area does not represent absolute values
                        but a trend of it.

        The function will also determine, for each of the y_param to be plotted,
        which is the plot with the highest y_param value, and use the coordinates
        to draw a red cross on the plotted line, along with horizontal and vertical
        lines to the X and Y axis.

        For that purpose, the function first sort the results_df dataframe using
        the x_param column in ascending order.

        Returns:
        --------
        None

        """

# Functions to save and load fitted models

Same as I've done in my [course #4 project](https://github.com/epfl-extension-school/project-adsml19-c4-s11-3871-2111), I will save the train models on disk using *pickle.dump()* method.

To do so, I've coded into [my_utils](my_utils.ipynb) two functions, *get_model_filename()*, *save_model()* and *load_model()*

## Function headers

### get_model_filename()

    def get_model_filename(model_name) -> str:
        """
        Basic function that will return the filename used to store on disk
        the model passed as parameter

        Returns:
        --------
        str

        """

### save_model()

    def save_model(model, name) -> None:
        """
        Function that saves on disk the fitted model passed as first
        parameter using pickle or keras library, depending on model type
        It uses the function getModelFilename() with the 'name'
        parameter to get the filename where to save the model.

        Returns:
        --------
        None

        """
        
### load_model()

    def load_model(name):
        """
        Function that loads from disk the model of which name is passed
        as first parameter. It uses the function getModelFilename() with
        the 'name' parameter to get the filename from where to load the model.

        Returns:
        --------
        Fitted model

        """

## How to use those functions ?

Saving a trained model would ba as easy as:

    save_model('model_name', <fitted_model>)

Loading it will be:

    model=load_model('model_name')
    
Models are save on disk, in the *data* directory with the following pattern:

    model-<model_name>.sav

# Time to go...

To the first model: [Gradient Descent Based - Ridge Regressor](31.Gradient%20Descent%20Based%20-%20Ridge%20Regressor.ipynb)