# Foodcast - food forecasting
In this notebook, we propose to explore and understand the different building blocks of a weekly sales forecasting problem. The introduction of [MLflow](https://mlflow.org/) will be done in a second time.

The dataset is [this one](https://www.kaggle.com/henslersoftware/19560-indian-takeaway-orders) and is located in the `data` directory of the project. The `data/raw` directory contains the data as is.

The scenario is as follows: a restaurant chain has several restaurants in a given city. Each restaurant records its total sales. The chain wants to be able to predict its sales volume from one week to the next, for all locations.

The idea is to get into production conditions, with a weekly prediction rhythm. Each week, a processing and forecasting pipeline must be activated. To emulate this kind of environment, we split the data into weekly batches in `data/batches` using the `reformatting.py` script. For simplicity, the weeks are identified with an integer

<img src="https://mlflow-training.s3.eu-west-3.amazonaws.com/data.png" style="width: 400px;"/>

In this notebook, we discuss the following pipeline steps:
* [Importing libraries](#part0)
* [Loading and cleaning data](#part1)
* [Feature engineering on the training set (offline)](#part2)
* [Training a predictive model](#part3)
* [Feature engineering on the prediction game (online)](#part4)
* [Prediction and visualization](#part5)
* [Modeling uncertainties](#part6)

**Note:** all elementary functions that implement these steps are **already coded**. In this notebook, it is just a matter of getting familiar with them.

## Setup

Here, we get the project code from a distant cloud storage and install requirements

In [None]:
!wget https://mlflow-training.s3.eu-west-3.amazonaws.com/requirements.txt
!wget https://mlflow-training.s3.eu-west-3.amazonaws.com/mlflow_training.zip
!unzip -qq /content/mlflow_training.zip
!pip install -r requirements.txt --quiet
!rm -rf mlflow_training.zip requirements.txt sample_data __MACOSX

### Now let's restart the kernel so that the installed librairies get loaded !
To do so, click on Execution --> Restart the execution environment 

# Librairies imports

<a class='anchor' id='part0'></a>

The project's structure is as follows :

<img src="https://mlflow-training.s3.eu-west-3.amazonaws.com/tree.png" style="width: 300px;"/>

In [None]:
import sys
sys.path.append('/content/mlflow_training/')
import yaml
import logging
import logging.config
import pandas as pd
pd.set_option('display.min_rows', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 500)
pd.set_option('max_colwidth', 400)
from foodcast.domain.transform import etl
from foodcast.domain.feature_engineering import features_offline, features_online
from foodcast.domain.forecast import span_future, cross_validate, plotly_predictions
from foodcast.domain.multi_model import MultiModel
from sklearn.ensemble import RandomForestRegressor
import foodcast.settings as settings
import plotly.graph_objects as go

with open(settings.LOGGING_CONFIGURATION_FILE, 'r') as f:
    logging.config.dictConfig(yaml.safe_load(f.read()))

%load_ext autoreload
%autoreload 2

# Data loading and cleaning

<a class='anchor' id='part1'></a>

In this section, we'll focus on data preprocessing.

### Elementary functions

The following pre-processing takes place in four steps, each encoded in an elementary function:
* `extract`: loads the data from a restaurant in a user-defined time interval
* `clean`: cleans the corresponding dataset:
    * homogenization of column names
    * type corrections
    * aggregation of the amount at the transaction level
    * deletion of useless columns
    * chronological sorting
*`merge`: merges the data from each restaurant into a single dataframe representing the restaurant chain
* `resample`: resamples the dataset at the hour level


### Exercises

These four functions are encapsulated in a single master function, called `etl`, which is the subject of the next exercise.

In [None]:
etl??

**Exercise :** extract a fully pre-processed dataset for weeks 197 to 200.

**Hint :** the name of the directory where the data is located is stored in `settings.DATA_DIR`.

In [None]:
df = None

**Exercise :** plot the revenue against time with [plotly](https://plotly.com/python/line-charts/#line-plot-with-goscatter).

In [None]:
fig = go.Figure()
pass
fig.update_layout(
    title='Cash-in',
    xaxis_title='date',
    yaxis_title='dollars',
    font=dict(
        family='Computer Modern',
        size=18,
        color='#7f7f7f'
    )
)

# Feature engineering on the training set (offline)

<a class='anchor' id='part2'></a>

In this section, we focus on the feature engineering and the creation of the training set.

### Elementary functions
The following feature engineering is made in 3 steps, each one being already coded in those functions:
* `dummy_day` : encodes the day in the week into 6 binary features.
* `hour_cos_sin` : encodes the hour of the day into 2 continous features.
* `lag_offline` : Retrieves the revenue of exactly one week ago.

Implementation wise, `lag_offline` is just a `shift` of the target variable in the training set.

### Exercises

These four functions are encapsulated in a single master function, called `features_offline`, which is the subject of the following exercise.

In [None]:
features_offline??

**Exercise :** Perform the feature engineering on the training set previously loaded.

In [None]:
df = None

**Exercise :** check by hand on one or two lines the validity of the variable `lag_1W` just created.

In [None]:
pass

### Features / target split

We now split the dataset into feature variables and target variable, making sure to keep the date information as the index.

In [None]:
# Uncomment to achieve exercise

# x_train = df.drop(columns=['cash_in'])
# y_train = df[['order_date', 'cash_in']]
# x_train = x_train.set_index('order_date')
# y_train = y_train.set_index('order_date')['cash_in']

# Training of a predictive model

<a class='anchor' id='part3'></a>

In this section, we focus on the training of a predictive model and its validation.

### A first basic model
To begin with, we introduce a random forest model with 10 trees. 

**Exercise :** create a `RandomForestRegressor` made of 10 trees, with a fixed random seed (of your choice).

In [None]:
simple_model = None

### Temporal cross-validation on the training set

Temporal cross-validation is natural for a forecasting problem. It is in fact natural for any model life cycle subject to data drift.

<img src="https://mlflow-training.s3.eu-west-3.amazonaws.com/timeseriessplit.png" style="width: 300px;"/>

It's the `cross_validate` function which implements it, based on [TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) from scikit-learn.

In [None]:
cross_validate??

**Exercise :** Validate the model on the training set 3 times (3 folds).

In [None]:
maes, preds = None, None

**Question :** In what unit is the MAE expressed? Given the consumption patterns, is this really a relevant performance indicator?

### Plot predictions

We can plot the predictions with the `plotly_predictions` function.

In [None]:
plotly_predictions??

**Exercise :** Plot predictions obtained with cross-validation, comparing to real targets.

In [None]:
pass

### Training on whole data set

Is is the `fit` method of `RandomForestRegressor` needed here.

**Exercise :** train the model on the whole training set.

**Hint :** We can use the `x_train` and `y_train` previously obtained.

In [None]:
pass

# Feature engineering on the prediction data (online)

<a class='anchor' id='part4'></a>

In this section, we'll focus on the creation of the prediction dataset and its feature engineering.

### Elementary functions
The feature engineering that follows takes place in four steps, each encoded in an elementary function:
* `span_future`: generates the prediction dates in the future.
* `dummy_day`: encodes the day of the week into 6 binary variables.
* `hour_cos_sin` : encodes the time of day in 2 continuous variables.
* `lag_online` : retrieves the revenue of a week in the past.

### Why lag offline and lag online?

Compared to the training set, it is more difficult to compute a lag of revenue on the prediction set because the latter is by definition in the future, and contains no past information.

Two methods are possible:

* **the expensive RAM method:** this involves concatenating `train` and `future` and performing a `shift`. If the `train` is large, it takes up a lot of memory space while only a small amount of information is of interest.
* **the recommended method:** it consists in loading only the observations of the past week, `past`, concatenate with `future`, and perform a `shift`. This uses very little memory (only one week of data).

### Exercises

These four functions are encapsulated in the functions `span_future` and `features_online`, which are the subject of the next exercises.

First, we generate a `past` set that corresponds to the week just before the prediction week.

**Exercise:** create a cleaned `past` dataset describing week 200.

**Hint:** we can reuse the `etl` function.

In [None]:
past = None

Then, we need to generate the prediction data, it is the `span_future` function that will cover it.

In [None]:
span_future??

**Exercise :** generate a dataframe of dates to predict in the future (compared to the training data).

**Hint :** we can use `past['order_date'].max()` as a starting point to generate our data.

In [None]:
future = None

All the steps of online feature engineering are gathered in the `features_online` function.

In [None]:
features_online??

**Exercise :** create a future prediction game, using the recommended lag online method (see above).

In [None]:
future = None

**Exercise :** check by hand on one or two lines the validity of the created variable.

**Hint :** we can, for example, look in detail at the variables calculated on November 5, 2018 at 6pm.

In [None]:
pass

We keep the date information on the index of the prediction data.

In [None]:
# Uncomment to achieve exercise

# future = future.set_index('order_date')

# Prediction and visualization

<a class='anchor' id='part5'></a>

In this section, we focus on predicting the revenue in the future of the training game.

### Revenue prediction

It is the `predict` method of `RandomForestRegressor` that comes into play.

**Exercise :** predict the revenue on the prediction dataset. 

**Indice :** we'll gather the predictions into a `DataFrame` with the same index as `future` and only one column called `y_pred_simple`.

In [None]:
y_pred = None

### Predictions plot

**Exercise :** plot the predictions with the `plotly_predictions` function.

In [None]:
pass

# Modeling uncertainties (Optional)

<a class='anchor' id='part6'></a>

n this section, we propose to add uncertainty to our predictions. A simple way to obtain uncertainty in the results is to perturb both the dataset and the model, as shown below.

<img src="https://mlflow-training.s3.eu-west-3.amazonaws.com/multimodel.png" style="width: 500px;"/>

It is the `MultiModel` class that implements this schema: 
* bootstrapping on the data
* Variation of the random seed of the model (if it exists). 

In [None]:
MultiModel?

**Exercise :** Implement a `MultiModel` made of 10 `simple_model` replicas.

In [None]:
multi_model = None

### Temporal cross-validation on the training set

**Exercise:** validate the model on the training set with three repetitions (folds).

**Hint:** the syntax is identical to that used for `simple_model`.

In [None]:
maes, preds = None, None

**Question:** what is the mean and standard deviation of the MAEs on each repeat of the cross-validation?

**Hint:** `axis=1`

In [None]:
pass

### Plot of predictions

We can plot the predictions obtained via the `plotly_predictions` function. This function handles the predictions of a multi-model well. In particular, it does not plot a prediction curve but a *range* of predictions.

**Exercise:** plot the predictions obtained by cross-validation against the expected truth.

**Hint:** the syntax is identical to the one used for `simple_model`.

In [None]:
pass

### Training on the whole dataset

It is the `fit` method from `MultiModel` at play here.

In [None]:
MultiModel.fit??

**Exercise:** train the model on the whole training set.

**Hint:** we will use the `x_train` and `y_train` dataframes obtained previously.

In [None]:
pass

### Sales prediction

This is the `predict` method of `MultiModel` that comes into play.

In [None]:
MultiModel.predict??

**Exercise:** predict the revenue on the prediction set.

**Hint:** Be aware of the non-standard API of the `predict` method. Indeed, the `predict` method contains an additional argument, the `context`. This specificity is necessary to be compatible with [MLflow](https://mlflow.org/), but will become invisible afterwards.

In [None]:
y_pred = None

### Visualization of predictions with uncertainty

The predictions obtained can be plotted using the `plotly_predictions` function.

**Exercise:** plot the sales forecast on the prediction set. 

**Hint:** the syntax is identical to that used for `simple_model`.

In [None]:
pass

# Congrats !

You know master the food forecasting use case which is fully compatible with Mlflow as we'll see in the next lab. 

### To go further

In the following lab, we'll see different mlflow functionnalities :
* tracking and reproductibility
* models packaging
* Visualization in the UI