# 3 Benchmarking with `sktime`

this notebook: setting up reproducible forecasting benchmarking experiments with `sktime`.

A benchmarking experiment is specified by:

* one or multiple models, possibly highly composite pipelines as before!
* evaluation metrics, e.g., MAPE, CRPS
* data sets, e.g., M5 collection
* re-sampling setup, e.g., expanding window splitter with certain parameters
* fit/update specification
* possibly, post-hoc analyses on results of the above

for *reproducible* benchmarking, need to pass on the above information

`sktime` makes this easy via:

* persisting blueprints of composites, metrics, re-sampling set-ups
* persisting fitted estimators if required
* standard data access interfaces for common benchmark data
* few-line set-up of benchmarking experiment

explained below!

option 1:

* python environment versions
* jupyter notebook with experiment
* any code for custom estimator classes

option 2:

* python environment versions
* list of persisted object blueprints - estimators, metrics, cv
* benchmark experiment setup params
* any code for custom estimator classes

## 3.1 Persisting models and objects


for reproducibility, one may like to share:

* model blueprint specs, e.g., equivalent of spec `Pipeline([("foo", Foo()), ("bar", Bar(42))])`
* fitted models, e.g., state of `my_pipe.fit(y)` after the `fit` - specific to data!

### 3.1.1 Persisting model blueprints

blueprint specs can be serialized using simple string print - this contains all information!

In [None]:
# let's define an example pipeline
from sktime.forecasting.compose._pipeline import TransformedTargetForecaster
from sktime.forecasting.naive import NaiveForecaster
from sktime.transformations.series.impute import Imputer

pipe = TransformedTargetForecaster(
    steps=[
        ("imputer", Imputer()),
        ("forecaster", NaiveForecaster()),
    ]
)

In [None]:
# serialize the pipeline to a string
# this is useful for logging and sharing
# pipe_str can be saved to a file, database, or shared over the internet
pipe_str = str(pipe)
pipe_str

for pseudo-random determinism, set any `random_state` parameters in the estimators

to deserialize, use `registry.craft` in the same python environment

for python environment, e.g., use `pip freeze`

In [None]:
from sktime.registry import craft

pipe_new = craft(pipe_str)
pipe_new

this is the same estimator blueprint as `pipe`!

To compare blueprint, simply use the `==` operator (this is a `scikit-base` feature)

In [None]:
pipe_new == pipe

share complex pipelines like this with your researcher friends (or in the appendix of your publications)!

I.e., process as follows:

* publishing researcher shares `pipe_str = str(pipe)` or `str(my_estimator)` and `pip freeze > requirements.txt` output
* reproducing researcher installs env from `requirements.txt` and runs `craft(pipe_str)` in that env

For custom estimators, in addition, the custom module needs to be shared.

Highly complex estimators can consist of multiple definition blocks - this is also supported by `craft` as follows.

Instead of a string conversion, we can also serialize:

In [None]:
# pipe_spec is a string representation of the pipeline
# it can be stored in a file or a database like this
# the "return" statement indicates which object we store
# temporary variables like pipe, cv can be defined
pipe_spec = """
pipe = TransformedTargetForecaster(steps=[
    ("imputer", Imputer()),
    ("forecaster", NaiveForecaster())])
cv = ExpandingWindowSplitter(
    initial_window=24,
    step_length=12,
    fh=[1, 2, 3])

return ForecastingGridSearchCV(
    forecaster=pipe,
    param_grid=[{
        "forecaster": [NaiveForecaster(sp=12)],
        "forecaster__strategy": ["drift", "last", "mean"],
    },
    {
        "imputer__method": ["mean", "drift"],
        "forecaster": [ThetaForecaster(sp=12)],
    },
    {
        "imputer__method": ["mean", "median"],
        "forecaster": [ExponentialSmoothing(sp=12)],
        "forecaster__trend": ["add", "mul"],
    },
    ],
    cv=cv,
    n_jobs=-1)
"""

In [None]:
craft(pipe_spec)

sometimes, estimators require soft dependencies to be installed,

and complain at construction (or `craft`)

for this, required dependencies can be queried *before* construction:

In [None]:
from sktime.registry import deps

deps(pipe_spec)

... although this should not be necessary if `pip freeze` output is available

another useful convenience: `imports` can be used to print a full import block:

In [None]:
from sktime.registry import imports

imports(pipe_spec)  # the result can be copied above the spec in to a jupyter cell

### 3.1.2 Persisting fitted models

persisting fitted models can be useful to share in a reproducibility setting,

(note, data source plus blueprint with `random_state` may be easier to share)

to persist a fitted model:

In [None]:
from sktime.datasets import load_airline

y = load_airline()

In [None]:
# example pipeline
from sktime.forecasting.compose._pipeline import TransformedTargetForecaster
from sktime.forecasting.naive import NaiveForecaster
from sktime.transformations.series.impute import Imputer

pipe = TransformedTargetForecaster(
    steps=[
        ("imputer", Imputer()),
        ("forecaster", NaiveForecaster()),
    ]
)

pipe.fit(y, fh=[1, 2, 3])

to serialize fitted objects, use `save` - default is `pkl`, but may differ for deep learning

* no args produces in-memory object
* `str` or `Path` arg will serialize to file

In [None]:
pipe_mem = pipe.save()
# pipe_mem is a pickle

to deserialize use the `load` method on the memory object or a `str`, `Path`:

In [None]:
from sktime.base import load

pipe_new = load(pipe_mem)

the loaded object can be used for prediction now.

In [None]:
pipe_new.predict()

## 3.2 Forecast evaluation metrics



### 3.1 Metrics for Point Forecasts

### 3.2 Metrics for Probabilistic Forecasts

## 3.3 Benchmarking - comparing estimator performance

The `benchmarking` modules allows you to easily orchestrate benchmarking experiments in which you want to
compare the performance of one or more algorithms over one or more datasets and benchmark configurations.

Benchmarking as an endevour in general is very easy to get wrong, giving false conclusions about estimator
performance - see this [2022 research from Princeton](https://reproducible.cs.princeton.edu/)
for numerous examples of such mistakes in peer reviewed academic papers as evidence of this.

`sktime`'s `benchmarking` module is designed to provide benchmarking functionality while enforcing best
practices and structure to help users avoid making mistakes (such as data leakage, etc.) which invalidate
their results. The `benchmarking` module is designed for easy usage in mind, as such it interfaces
directly with `sktime` objects and classes. Previously developed estimator should be usable as they are without
alterations.

This notebook demonstrates usage of the `benchmarking` module.

In [None]:
from sktime.benchmarking.forecasting import ForecastingBenchmark
from sktime.datasets import load_airline
from sktime.forecasting.model_selection import ExpandingWindowSplitter
from sktime.forecasting.naive import NaiveForecaster
from sktime.performance_metrics.forecasting import MeanSquaredPercentageError

### Instantiate an instance of a benchmark class
In this example we are comparing forecasting estimators.

In [None]:
benchmark = ForecastingBenchmark()

### Add competing estimators
We add different competing estimators to the benchmark instance. All added estimators will 
be automatically ran through each added benchmark tasks, and their results compiled.

In [None]:
benchmark.add_estimator(
    estimator=NaiveForecaster(strategy="mean", sp=12),
    estimator_id="NaiveForecaster-mean-v1",
)
benchmark.add_estimator(
    estimator=NaiveForecaster(strategy="last", sp=12),
    estimator_id="NaiveForecaster-last-v1",
)

### Add benchmarking tasks
These are the prediction/validation tasks over which every estimator will be tested and their results compiled.

The exact arguments for a benchmarking task depend on the whether the objective is forecasting, classification, etc.,
but generally they are similar. The following are the required arguments for defining a forecasting benchmark task.

#### Specify cross-validation split regime(s)
Define cross-validation split regimes, using standard `sktime` objects. 

In [None]:
cv_splitter = ExpandingWindowSplitter(
    initial_window=24,
    step_length=12,
    fh=12,
)

#### Specify performance metric(s)
Define performance metrics on which to compare estimators, using standard `sktime` objects.

In [None]:
scorers = [MeanSquaredPercentageError()]

#### Specify dataset loaders
Define dataset loaders, which are callables (functions) which should return a dataset. Generally
this is a callable which returns a dataframe containing the entire dataset. One can use
the `sktime` defined datasets, or define their own. Something as simple as the following
example will suffice: 
```python
def my_dataset_loader():
    return pd.read_csv("path/to/data.csv")
```
The datasets will be loaded when running the benchmarking tasks, ran through the cross-validation
regime(s) and subsequently the estimators will be tested over the dataset splits.

In [None]:
dataset_loaders = [load_airline]

#### Add tasks to the benchmark instance
Use the previously defined objects to add tasks to the benchmark instance.
Optionally use loops etc. to easily setup multiple benchmark tasks reusing arguments.

In [None]:
for dataset_loader in dataset_loaders:
    benchmark.add_task(
        dataset_loader,
        cv_splitter,
        scorers,
    )

### Run all task-estimator combinations and store results

Note that `run` won't rerun tasks it already has results for, so adding a new
estimator and running `run` again will only run tasks for that new estimator.

In [None]:
results_df = benchmark.run("./forecasting_results.csv")
results_df.T

### Credits: notebook 3 - Metrics and Evaluation

notebook creation:

forecaster pipelines:
transformer pipelines & compositors:
dunder interface:

tuning, autoML:
CV and splitters:
forecasting metrics:
backtesting, evaluation: