# M5 using mlforecast

[mlforecast](https://nixtla.github.io/mlforecast/) is a framework to perform time series forecasting using machine learning models. It abstracts away most of the details and tries to mimic the scikit-learn API.

This notebook is inspired by https://www.kaggle.com/kneroma/m5-first-public-notebook-under-0-50.

# Environment setup

# Install distributed dependencies

In [None]:
%%capture
!pip install coiled dask==2021.04.1 distributed==2021.04.1 mlforecast[distributed]

# Build lightgbm from source
This is needed because there was a [bug](https://github.com/microsoft/LightGBM/issues/4026) in distributed training.

In [None]:
%%capture
%%bash
git clone --recursive https://github.com/microsoft/LightGBM.git /kaggle/tmp/LightGBM
cd /kaggle/tmp/LightGBM/python-package
python setup.py install

# Libraries

In [None]:
from functools import partial
from pathlib import Path

import coiled
import dask.dataframe as dd
import lightgbm as lgb
import numpy as np
import pandas as pd
from dask.distributed import Client
from mlforecast.core import TimeSeries
from mlforecast.distributed.forecast import DistributedForecast
from mlforecast.distributed.models.lgb import LGBMForecast
from window_ops.rolling import rolling_mean

from kaggle_secrets import UserSecretsClient

In [None]:
assert lgb.__version__ > '3.2.1'

# Cluster setup

In [None]:
%%time
user_secrets = UserSecretsClient()
COILED_TOKEN = user_secrets.get_secret('COILED_TOKEN')  # fill this in Add-ons -> Secrets

cloud = coiled.Cloud(
    user='jose-moralez',  # your coiled user here
    token=COILED_TOKEN,
)
cluster = coiled.Cluster(
    name='m5-mlforecast',
    software='jose-moralez/mlforecast',
    n_workers=4,
    worker_cpu=4,
    worker_memory='8 GiB',
    scheduler_cpu=1,
    scheduler_memory='8 GiB',
    cloud=cloud,
    backend_options=dict(region='us-east-2'),
    shutdown_on_close=True,
)

In [None]:
client = Client(cluster)
client.wait_for_workers(4)
client

# Data loading

In [None]:
input_path = Path('../input/m5-preprocess/processed/')

data = pd.read_parquet(input_path/'sales.parquet')
data

mlforecast requires a dataframe with an index named **unique_id** which identifies each time serie, a column **ds** containing the datestamps and a column **y** with the series values.

In [None]:
data = data.rename(columns={'id': 'unique_id', 'date': 'ds'})
data = data.set_index('unique_id')
data

Send this data to the cluster. Note that this isn't the best way to do it, a better way would be to save this to a remote storage like S3 and read it from there, however since this notebook reads the data from its inputs then this is probably the easiest way.

We set the number of partitions equal to the number of workers and the partitions will be made along the series ids, this ensures that every partition holds different series and that there aren't any series that are present in more than one partition.

In [None]:
remote_data = dd.from_pandas(data, npartitions=4).persist()
remote_data

Metadata for predictions

In [None]:
prices = pd.read_parquet(input_path/'prices.parquet')
prices

In [None]:
cal = pd.read_parquet(input_path/'calendar.parquet')
cal = cal.rename(columns={'date': 'ds'})
cal.head()

## Forecast setup

There are two inputs needed: a regressor that follows the scikit-learn API and a time series object which defines the features to be computed.

### Model

In [None]:
lgb_params = {
    'objective': 'poisson',
    'metric': 'rmse',
    'force_row_wise': True,
    'learning_rate': 0.075,
    'bagging_freq': 1,
    'bagging_fraction': 0.75,
    'lambda_l2': 0.1,
    'n_estimators': 1200,
    'num_leaves': 128,
    'min_data_in_leaf': 100,
}

model = LGBMForecast(**lgb_params)
model

### TimeSeries
This is where we define the features. A brief description of each argument:

* **freq**: frequency of our time series. This is a pandas abbreviation and is used to get the next dates when computing the predictions.
* **lags**: lags that we want to use as features.
* **lag_transforms**: dictionary where the keys are the lags that we want to use and the values are a list of transformations to apply to them. The transformations are defined as `numba` jitted functions. If the function takes more arguments than the input array, these are passed as a tuple `(func, arg1, arg2, ...)`.
* **date_features**: date attributes to use for training. These are computed from the `ds` column and are updated in each timestep.\n* **num_threads**: number of threads to use in preprocessing and updates, defaults to all cpus. Since the transformations are `numba` jitted functions, we can use multithreading to compute our features.

In [None]:
ts = TimeSeries(
    freq='D',
    lags=[7, 28],
    lag_transforms = {
        7:  [(rolling_mean, 7), (rolling_mean, 28)],
        28: [(rolling_mean, 7), (rolling_mean, 28)],
    },
    date_features=['year', 'month', 'day', 'dayofweek', 'quarter', 'week'],
)
ts

### Define forecaster
Once we have our model and time series, we instantiate a `Forecast` object with them.

In [None]:
fcst = DistributedForecast(model, ts)

## Training

At the time of making this notebook, LightGBM doesn't support evaluation sets in distributed training yet (follow [this PR](https://github.com/microsoft/LightGBM/pull/4101) if you're interested), so we'll just call `Forecast.fit` on our data which will perform the preprocessing and training on all available data.

`Forecast.fit` takes the following additional arguments:

* **dropna**: whether or not to drop rows with null values after building all the features. Using lags and transformations on the lags generates many rows with `np.nan`s, this is a flag to indicate whether we want to drop them when we're done.
* **keep_last_n**: keep only last `n` samples from each time serie after computing the features. The updates are performed by applying the transformations on the series again and taking only the last value. This can save memory if you have very long series and your transformations only use a small window, like in this case where we have series with thousands of data points and our transformations require only 28 (lag) + 27 (window) samples.
* **static_features**: define which features are static. By default all extra columns (other than **ds** and **y**) are considered static and are replicated when building the features for the next timestep, setting this overrides that and repeats only the ones defined here.

In [None]:
%%time
fcst.fit(
    remote_data,
    dropna=True,    
    keep_last_n=28+27,
    static_features=['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']        
)

## Predictions

By default the predictions are computed repeating the static features and updating the transformations and the date features. If you want to do something different you can define your own predict function as explained [here](https://nixtla.github.io/mlforecast/forecast.html#Custom-predictions).

In [None]:
def my_predict_fn(model, new_x, features_order, cal, prices, alpha):
    new_x = new_x.reset_index()  # for sorting later
    new_x = new_x.merge(cal)
    new_x = new_x.merge(prices)
    new_x = new_x.sort_values('unique_id')
    new_x = new_x[features_order]
    predictions = model.predict(new_x)
    return alpha * predictions

Calling `Forecast.predict(horizon)` computes the predictions for the next `horizon` steps. We can also provide a custom `predict_fn` like we do in this case, using `my_predict_fn` defined above. This step uses multithreading if `num_threads` was set to a value greater than 1 or was left empty and you have more than 1 cpu (here we have 4).

We'll send calendar and prices to each worker so the prediction function takes the local dataframes instead of serializing them in the function.

In [None]:
cal_future = client.scatter(cal, broadcast=True)
prices_future = client.scatter(prices, broadcast=True)

In [None]:
%%time
alphas = [1.028, 1.023, 1.018]
preds = None
for alpha in alphas:
    alpha_preds = fcst.predict(
        horizon=28,
        predict_fn=my_predict_fn,
        cal=cal_future,
        prices=prices_future,
        alpha=alpha
    ).compute()
    alpha_preds = alpha_preds.set_index('ds', append=True)
    if preds is None:
        preds = 1 / 3 * alpha_preds
    else:
        preds += 1 / 3 * alpha_preds
preds

## Shutdown cluster

In [None]:
cluster.close()
client.close()

## Submission

In [None]:
wide = preds.reset_index().pivot_table(index='unique_id', columns='ds')
wide.columns = [f'F{i+1}' for i in range(28)]
wide.columns.name = None
wide.index.name = 'id'
wide

In [None]:
sample_sub = pd.read_csv('../input/m5-forecasting-accuracy/sample_submission.csv', index_col='id')
sample_sub.update(wide)
np.testing.assert_allclose(sample_sub.sum().sum(), preds['y_pred'].sum())
sample_sub.to_csv('submission.csv')