# MLflow tracking

In this notebook, we propose to implement the processing chain of the *food forecasting* problem in interaction with the [tracking](https://www.mlflow.org/docs/latest/tracking.html) and [flavours](https://www.mlflow.org/docs/latest/models.html#model-customization) APIs of [MLflow](https://mlflow.org/).

<img src="/mlflow_training/images/mlflow_tracking.jpg" style="width: 600px;"/>

In the following, we will compare two ways of working:
* [MLflow tracking: tutorial](#part1)
    * [Generate logs](#spart11)
    * [Organize logs in runs](#spart12)
    * [Organizing runs into experiments](#spart13)
* [Foodcast processing chain with MLflow](#part2)
    * [Loading](#spart21)
    * [Offline feature engineering](#spart22)
    * [Validating](#spart23)
    * [Training](#spart24)
    * [Online feature engineering](#spart25)
    * [Predicting](#spart26)

## Setup

In [None]:
!wget https://storage.googleapis.com/mlflow-formation/requirements.txt
!wget https://storage.googleapis.com/mlflow-formation/mlflow_training.zip
!unzip -qq /content/mlflow_training.zip
!pip install -r requirements.txt --quiet
!rm -rf mlflow_training.zip requirements.txt sample_data __MACOSX

### Now let's restart the kernel so that the installed librairies get loaded !
To do so, click on Execution --> Restart the execution environment 

___
# MLflow tracking : tutorial

<a class='anchor' id='part1'></a>

In this section, we will discover the basics of [MLflow tracking](https://www.mlflow.org/docs/latest/tracking.html).

In [None]:
import os
import sys
sys.path.append('/content/mlflow_training/')
import yaml
import logging
import logging.config
from foodcast.domain.transform import etl
from foodcast.domain.multi_model import MultiModel
from foodcast.application.mlflow_utils import mlflow_log_pandas, mlflow_log_plotly
from sklearn.ensemble import RandomForestRegressor
import foodcast.settings as settings
import mlflow
import mlflow.sklearn
import mlflow.pyfunc

with open(settings.LOGGING_CONFIGURATION_FILE, 'r') as f:
    logging.config.dictConfig(yaml.safe_load(f.read()))

%load_ext autoreload
%autoreload 2

The following section maps the mlflow port coming from google colab into a publicly accessible url. <br>
It will prompt a login page and you can create an account either with real info or fake ones ;)<br>
You can then copy the auth token in the NGROK_AUTH_TOKEN variable below.

In [None]:
from pyngrok import ngrok

get_ipython().system_raw("mlflow ui --port 5000 &")

ngrok.kill()

NGROK_AUTH_TOKEN = ""
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLflow Tracking UI:", ngrok_tunnel.public_url)

## Generate logs

<a class='anchor' id='spart11'></a>

The general idea is to save information in files. This saving process is called logging.

### Logging parameters

The simplest information to log is the parameter. A parameter is a key-value pair: the key is a name (a string), and the value is a basic python object (`float`, `string` etc.).

**Exercise:** log a `age` parameter, containing your age (in years).

**Hint:** we will use the method [mlflow.log_param](https://www.mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_param).

In [None]:
pass

**Question:** A new directory has just been created: where? what is it called? where is the information logged?

### Navigate the graphical interface

<a class='anchor' id='ui'></a>

Throughout the following, [MLflow](https://mlflow.org/) runs will be viewable via a built-in GUI.

This GUI is simply a utility that reads the `mlruns` directory created by [MLflow](https://mlflow.org/).

**Exercise:** find the logged information by navigating the [MLflow](https://mlflow.org/) GUI.

In [None]:
# Follow the steps above to achieve this exercise

**Exercise:** log two parameters in one line of code: `age` (your age) and `neighbor_age` (the age of your left neighbor).

**Hint:** we will use the method [mlflow.log_params](https://www.mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_params).

In [None]:
pass

### Logging a standard model

Beyond parameters, [MLflow](https://mlflow.org/) provides a convention for storing predictive models.

**Exercise:** log a `RandomForestRegressor` of any kind into a directory called `my_random_forest`.

**Hint: **We can base this on [mlflow.sklearn.log_model](https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.log_model)

In [None]:
pass

**Question :** What is a model in the [MLflow](https://mlflow.org/) convention?

### Logging a custom model

If you want to think outside the box and log a home-made model, it must inherit from the `PythonModel` class of [MLflow](https://mlflow.org/) (this is the case for example with our `MultiModel`). On the other hand, to log this model (and later deploy it), we need to provide additional information, namely:
* the code which allows to know the API of the model and to deserialize it
* the model's dependencies, in the form of a virtual deployment environment.

In [1]:
mlflow.pyfunc.log_model(
    python_model=MultiModel(RandomForestRegressor()),
    artifact_path='my_multi_model',
    code_path=[os.path.join('..', 'foodcast', 'domain', 'multi_model.py')],
    conda_env={
        'channels': ['defaults', 'conda-forge'],
        'dependencies': [
            'mlflow=1.8.0',
            'numpy=1.17.4',
            'python=3.7.6',
            'scikit-learn=0.21.3',
            'cloudpickle==1.3.0'
        ],
        'name': 'multi-model-env'
    }
)

NameError: name 'mlflow' is not defined

**Question :** What is the difference with the previous example in the `mlruns` repository ?

### Logging files

Finally, for everything else, [MLflow](https://mlflow.org/) allows to log files. As an illustration, we use a dataframe, which is neither a parameter nor a model.

In [None]:
data = etl(settings.DATA_DIR, 199, 200)

Since [MLflow](https://mlflow.org/) doesn't specifically provide a `log` method for dataframes, you have to save it to a file first and then log the file.

**Exercise:** save `data` to a `data.csv` file.

**Hint:** we will not save the index.

In [None]:
pass

**Exercise:** log the file `data.csv` in a directory `data`.

**Hint:** we will use the method [mlflow.log_artifact](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_artifact).

In [None]:
pass

**Exercise:** delete the `data.csv` file that is floating around in your `notebooks` directory.

**Hint:** you can run a terminal command directly in a jupyter cell, prefixing it with a `!`.

In [None]:
pass

**Question :** Is the logged file `data.csv` still in `mlruns` ?

###  Factorization

A pattern will become recurrent with [MLflow](https://mlflow.org/): 
* save data locally
* log local data in a run, via `log_artifact`.

The disadvantage is that this pollutes the working directory with artifacts unnecessarily. To overcome these problems, you can use the `mlflow_utils` module (home-made).

The functions `mlflow_log_pandas` and `mlflow_log_plotly` have the same arguments as `mlflow.log_artifact`, but do not pollute the current directory. Instead, the intermediate data (before being logged) is stored in your `/tmp` directory, which is emptied every time you restart your computer.

In [None]:
mlflow_log_pandas??

**Exercise :** Log the dataframe `data` in `artifacts/data/data.csv` using `mlflow_log_pandas`.

In [None]:
pass

## Organize logs into runs

<a class='anchor' id='spart12'></a>

For now, all of our previous operations have been logged in the same place. If we rerun the previous cells, all the information will be overwritten. This is because we have been working in a single run until now.

Runs are the basic structure of [MLflow tracking](https://www.mlflow.org/docs/latest/tracking.html), and allow to separate the information in different directories.

### Encapsulate the logs
In order to get a real history of the actions, we have to encapsulate the logs in runs.

First of all, the current run must be terminated.

In [None]:
mlflow.end_run()

Then, we can use the runs as ContextManager, with the `with` keyword.

In [None]:
with mlflow.start_run():
    mlflow_log_pandas(data, 'data', 'data.csv')

**Exercise :** Launch the previous cell twice

In [None]:
# Come on, this is an easy one.

**Question:** While browsing the GUI, what do you notice about the contents of the `mlruns` directory?

### Examining a run
By combining `with` with `as`, we can save the run in a `my_run` variable. We can even give it a `run_name`.

In [None]:
with mlflow.start_run(run_name='my_run_name') as my_run:
    mlflow_log_pandas(data, 'data', 'data.csv')

**Exercise:** find the id of the previous run without using the graphical interface.

**Hint:** we can use the [run info](https://www.mlflow.org/docs/latest/python_api/mlflow.entities.html#mlflow.entities.Run).

In [None]:
pass

**Hint:** find the complete path where the run saves the artifacts without using the GUI.

**Hint:** we can use the [run info](https://www.mlflow.org/docs/latest/python_api/mlflow.entities.html#mlflow.entities.Run).

In [None]:
pass

**Exercise:** find the name of the run, without using the graphical interface.

**Hint:** we can use the [run tags](https://www.mlflow.org/docs/latest/python_api/mlflow.entities.html#mlflow.entities.Run).

In [None]:
pass

## Organizing runs into experiments

<a class='anchor' id='spart13'></a>

Rather than putting all the runs in one place, we can arrange them in specific directories, called experiments. The default experiment is called `Default` in the GUI, and corresponds to the subdirectory `0` in `mlruns`.

**Tip:** You can think of experiments as *features* branches in git: as soon as you want to develop a new feature, you create a corresponding experiment, so that you don't mix runs that have nothing to do with each other.

### Create an experiment

You can create an experiment in python or in command line.

**Exercise:** create an experiment called `my_experiment`.

**Hint:** the function [set_experiment](https://www.mlflow.org/docs/latest/python_api/mlflow.html#mlflow.set_experiment) allows you to choose a default experiment if it exists, and to create it if it does not.

In [None]:
pass

### Choosing an experiment

The choice of experiment can be made at run time via the `experiment_id` argument to `start_run`. However, now that `my_experiment` is the default experiment, there is no need to even bother!

**Exercise:** take the last run you did and run it in the `my_experiment`.

In [None]:
# Come on, the code is elsewhere

**Question :** What happened in the UI ?

### Delete an experiment

The safest and most permanent way to delete an experiment is to delete the corresponding directory in `mlruns` and the contents of `mlruns/.trash`.

**Hint:** delete the experiment `my_experiment`.

**Hint**: You can run a unix command in a jupyter cell with the prefix `!`.

In [None]:
pass

**Question :** Is the experience still appearing in the UI ?

___
# foodcast processing chain with MLflow

<a class='anchor' id='part2'></a>

As a reminder, the scenario is a weekly sales forecast. The training set is a sliding window in one-week steps, and the prediction set is always the following week.

In [None]:
import os
import sys
sys.path.append('/content/mlflow_training/')
import yaml
import logging
import logging.config
from foodcast.domain.transform import etl
from foodcast.domain.feature_engineering import features_offline, features_online
from foodcast.domain.forecast import span_future, cross_validate, plotly_predictions
from foodcast.domain.multi_model import MultiModel
from foodcast.application.mlflow_utils import mlflow_log_pandas, mlflow_log_plotly
from sklearn.ensemble import RandomForestRegressor
import foodcast.settings as settings
import mlflow
import mlflow.sklearn
import mlflow.pyfunc

with open(settings.LOGGING_CONFIGURATION_FILE, 'r') as f:
    logging.config.dictConfig(yaml.safe_load(f.read()))

%load_ext autoreload
%autoreload 2

If this notebook was "put into production", it would have to be restarted each week by updating the following parameters:
* `start_week`: the number of the first week of training,
* `end_week`: the number of the last week of training,
* `next_week`: the number of the prediction week.

In [None]:
start_week = 180
end_week = 200
next_week = 201

## Creating and selecting a dedicated experience
Before starting, we propose to create a dedicated foodcast experience.

In [None]:
mlflow.set_experiment('foodcast')

## Tip

In the following, make sure that the first lines of code after the `with mlflow.start_run()` are the ones that log the run parameters. In case of a crash, you will have logged the information and can investigate more easily.

## Loading

<a class='anchor' id='spart21'></a>

Loads and cleans up the training data on the previously defined sliding window. The data is the sales history of the two restaurants under consideration over the sliding training window defined at the beginning of this notebook.

**Exercise:** create a run [MLflow](https://mlflow.org/) that:
* is called `load`
* log the `start_week` and `end_week` parameters
* load and clean up the `data` input data via the `etl` function
* log `data` in `data/data.csv`.

In [None]:
pass

## Offline feature engineering

<a class='anchor' id='spart22'></a>

Feature engineering and train/test separation.

**Exercise:** create a run [MLflow](https://mlflow.org/) that:
* is called `features`
* log the parameters `start_week` and `end_week`.
* performs feature engineering via the `features_offline` function
* perform the variable/target separation `x_train` / `y_train`
* log the dataframes obtained in `training_set/x_train.csv` and `training_set/y_train.csv`.
* pass the date of the events in index for `x_train` and `y_train`

In [None]:
with mlflow.start_run(run_name='features'):
    # TODO: log parameters
    train = None
    x_train, y_train = None, None
    # TODO: log x_train
    # TODO: log y_train
    # x_train = x_train.set_index('order_date')
    # y_train = y_train.set_index('order_date')['cash_in']

## Validating

<a class='anchor' id='spart23'></a>

Model instantiation and chronological cross validation.

**Exercise:** create a run [MLflow](https://mlflow.org/) that:
* is called `validate`
* log the parameters `start_week`, `end_week`, `n_fold`, `n_estimators`, `n_models`.
* instantiate a `MultiModel` random forest (`n_estimators=10`, `n_models=10`)
* validate the model by 10-fold cross-validation via the `cross_validate` function
* log the predictions obtained in `cross_validation/predictions.csv` (don't forget to `reset_index()`)
* log the figure obtained via `plotly_predictions` in `plots/validation.html`
* log the minimum and maximum MAE metrics for each validation step
* log the MAE metrics for each estimator of the multi-model for each validation step

In [None]:
with mlflow.start_run(run_name='validate'):
    n_fold = 10
    n_estimators = 10
    n_models = 10
    # TODO: log parameters
    model = None
    maes, preds_train = None, None
    # TODO: log preds_train.reset_index()
    fig = None
    # TODO: log fig with mlflow_log_plotly
    # for i, mae in enumerate(maes):
    #     mlflow.log_metric('MAE_MIN', mae.min(), step=i)
    #     mlflow.log_metric('MAE_MAX', mae.max(), step=i)
    #     for j, result in enumerate(mae):
    #         mlflow.log_metric('MAE{}'.format(j), result, step=i)

**Exercise:** visualize the range of possible MAEs in the graphical interface.

In [None]:
fig

## Training

<a class='anchor' id='spart24'></a>

Training the model on the entire training set.

**Exercise:** create a run [MLflow](https://mlflow.org/) that:
* is called `train`
* log the parameters `start_week`, `end_week`, `n_estimators`, `n_models`.
* train the multi-model on the training set `x_train / y_train`
* log its `single_estimator` attribute, which is a standard scikit-learn model, to the `simple_model` directory
* log the complete model, which is a custom model, in the `multi_model` directory

In [None]:
pass

## Online feature engineering

<a class='anchor' id='spart25'></a>

Feature engineering and building the prediction game.

**Exercise:** create a run [MLflow](https://mlflow.org/) that:
* is called `future`
* logs the `next_week` parameter
* loads and cleans up a recent past week `past` to calculate turnover lags, via the `etl` function
* generates a prediction set `x_pred` via the `span_future` function
* perform online feature engineering via the `features_online` function
* log the obtained prediction set in `prediction_set/x_pred.csv` function

In [None]:
with mlflow.start_run(run_name='future'):
    # TODO: log parameters
    past = etl(settings.DATA_DIR, next_week - 1, next_week - 1)
    x_pred = span_future(past['order_date'].max())
    x_pred = None
    # TODO: log x_pred

## Predicting

<a class='anchor' id='spart26'></a>

Predicting the model over the next week.

**Exercise:** create a run [MLflow](https://mlflow.org/) that:
* is called `predict`
* log the parameters `next_week`, `start_week`, `end_week`, `n_estimators`, `n_models`
* pass the date in index in `x_pred`
* predict the turnover `y_pred` on `x_pred`
* log the predictions in `predictions/y_pred.csv` (don't forget to `reset_index()`)
* log the figure obtained via `plotly_predictions` in `plots/predictions.html`

In [None]:
with mlflow.start_run(run_name='predict'):
    # TODO: log parameters
    # x_pred = x_pred.set_index('order_date')
    y_pred = None
    # log y_pred.reset_index()
    fig = None
    # log fig with mlflow_log_plotly

In [None]:
fig

# Congrats !

You know master the [tracking](https://www.mlflow.org/docs/latest/tracking.html) [model flavours](https://www.mlflow.org/docs/latest/tracking.html) parts of MLflow !