[ENH] Enhancements to kotsu Benchmarking Framework #4873

hazrulakmal · 2023-07-12T17:29:29Z

A great initial effort was made by @DBCerigo @TNTran92 @alex-hh to create a user-friendly sktime benchmarking interface for quick univariate forecasting benchmarks. The decision to include Kotsu as part of the benchmarking module was made in issue #2777.

After the first kotsu integration PR was merged, the work was then put on hold, but there are still future tasks to complete, as suggested by the original author in issue #2804. I plan to continue the excellent work that has been done, and this issue is intended to keep track of what needs to be done.

Additionally, there were some issues and suggestions for improving the current implementation, which were discussed by @fkiraly , @marrov, and myself previously here. Below is a summary of the discussion - please do add anything if you I miss something out.

ForecastBenchmark force writes results to the hard drive, but it should allow users to save results to different storage files (cloud) or formats (e.g. Parquet).
ForecastBenchmark return results in aggregated fold metrics, without providing intermediate results. Users have no control over further evaluation processing steps, such as adding new metrics to the benchmark after it has run. visualizing predictions, compute different aggregations like median.
ForecastBenchmark datasets loader assumes a callable from sktime.datasets. does this mean it only accept datasets that exist in the sktime.datasets ? is this intented? what about external datasets from the users?
The estimator_id is restricted to certain strings without clear user signposting. Should all strings be allowed?
Currently, the add_estimator method only takes a single estimator per call. Can be modified to accept a list of estimators.

My initial thoughts on the problems in chronological order

The Kotsu framework itself is responsible for results storage and forces results to be saved as CSV, no other option asfaik. Making it more flexible should be done within the Kotsu framework rather than in sktime.
I believe that evaluate itself has an option to return raw predictions, so we may need to make some tweaks to the current forecasting benchmarking to allow for this option.
Should we instead accept sktime in-memory data formats (scitype, mtype)?
The estimator_id naming convention should follow the format [username/](entity-name)-v(major).(minor) (e.g. hazrulakmal/airline-v1.2 or airline-v1 etc), as per required by Kotsu. To allow for all kinds of strings, this has to be done in Kotsu.
I agree with this part!

Moving forwads this is what I wanna do

implement storing estimator predictions for intermediate results
support for univariate with exogenous, hierarchical & multivariate [ENH] Extend benchmarking to accept sktime-in-memory data formats #5055
data_loader accepts sktime in-memory data formats [ENH] Extend benchmarking to accept sktime-in-memory data formats #5055
implement add_estimator to receive multiple estimator at once [ENH] input checks for BaseBenchmark, allow add_estimator to accept multiple estimators #4877
progression prints
redesign dataset module as suggested here [ENH] Dataset module refactor #5105
add some or all datasets from forecasting.org [ENH] add Monash Forecasting Repository data loader #4826
update the forecast benchmarking notebook to show the new, modified, or additional features
allow for all kinds of strings for estimator_id [ENH] Allow unrestriced ID string for BaseBenchmarking #5130

The text was updated successfully, but these errors were encountered:

fkiraly · 2023-07-12T17:37:47Z

just dumping the hackmd notebook here for reference, as those notebooks have the tendency to degrade and/or disappear.

To view, click the dropdown arrow.

Design notes for benchmarking framework

current benchmark vignette

# 1
from sktime.benchmarking.forecasting import ForecastingBenchmark

benchmark = ForecastingBenchmark()

# 2
from sktime.forecasting.naive import NaiveForecaster

benchmark.add_estimator(
    estimator=NaiveForecaster(strategy="mean", window_length=3),
    estimator_id="Naive-mean-3-v1",
)
benchmark.add_estimator(estimator=AutoARIMA(), estimator_id="AutoARIMA-v1")
benchmark.add_estimator(estimator=AutoETS(auto=True), estimator_id="AutoETS-v1")
benchmark.add_estimator(
    estimator=forecaster_with_differencer.clone(), estimator_id="LightGBM-v1"
)

# 3
cv = ExpandingWindowSplitter(initial_window=24, fh=fh, step_length=2)
scorers = [smape]

benchmark.add_task(
    load_shampoo_sales,
    cv,
    scorers,
)

# 4
results_df = benchmark.run(output_file="results.csv")
results_df.set_index("model_id").iloc[:, -2:].style.format("{:.1%}")

comments MR

new feature - Model registry to enable:
- Add estimators (as opposed to add single estimator)
- Storage/loading of model registry (persistance)
  - of fitted models, after the experiment is run
  - model name, model ID, run ID
- Registry wide modifiers (e.g. reconcilers)
problem: Forced to csv storage
- Idea: decouple storage approach (strategy pattern)
problem: return per model are stats of aggregated fold metrics: eg MAPE_mean, etc. Users have no access to more granular options (scores per time index within a fold, scores per level of hierarchy, etc)
problem: no intermediate results, so need to have 100% what metric you want before running. If after the run you need a different metric, currenyly requires re-run
problem: for scaled metric, train data (or cutoff per fold) need to be stored, so this should also be a stores intermediate

comments FK (based on discussion with MR earlier)

1

benchmark class could be polymorphic, i.e., dispatch to classifier, forecaster benchmark etc.
- maybe sth for later once forecasting works well

3

problem: estimator_id is restricted to certain strings without clear user signposting
- all strings should be allowed (?)
- estimator_id to be auto-generated (as unique string, like in dunders) if not passed
problem: add_estimator interface is unnecessarily complex
- shorten add_estimator to add? Argument is already called estimator
  - could be + dunder too
- should clone automatically happen?
- allow fitted models or not? What happens if fitted model is passed?

4

problem: data loader interface is very restrictive, and requirements on the "loader" are undocumented
- allow in-memory data containers
- change the "loader" to BaseDataset or similar with unified interface
problem: run requires csv file
- should be able to run locally
- should be file backend flexible if possible
- None - should default to completely in-memory?
problem: aggregation mode is fixed, not flexible
- should be flexible, different options to agregate or not:
  - over folds
  - over hierarchy levels
  - over test data indices or fh index
- should this be part of the task, or part of the run interface?
problem: does not produce intemediate results
- forecasts disaggregates should be accessible
  - if it is computed anyway, option to store or return?

FK speculative interface design

# 1
from sktime.benchmarking.forecasting import ForecastingBenchmark

benchmark = ForecastingBenchmark()
# alternative, can load existing benchmark
# benchmark = ForecastingBenchmark.load(serialization_ref)

# 2
from sktime.forecasting.naive import NaiveForecaster

benchmark.add(estimator=NaiveForecaster(strategy="mean", window_length=3))
# this stores the NaiveForecaster in the registry of ForecastingBenchmark
# returns reference to self (benchmark)
# pretty prints list() - default list() is list("all") which returns multiple DataFrame and prints all

benchmark.add(estimator=AutoARIMA(), name="supercalifragilistic")
# stores the AutoARIMA model under the name - can be any

benchmark.add(estimator=AutoETS(auto=True), name="supercalifragilistic")
# raises a warning and stores under supercalifragilistic_2

benchmark + Prophet()  # or += ? does the same as benchmark.add

benchmark.list("estimator")
# returns a pd.DataFrame with columns
# type (str), name (str), UID (str)
# lists estimators that have been added


# 3
cv = ExpandingWindowSplitter(initial_window=24, fh=fh, step_length=2)
scorers = [smape]

benchmark.add(  # polymorphic, detects type(?)
    # perhaps we want internal dispatch to add_task, add_estimator etc
    load_shampoo_sales,  # data loader function
    cv,  # sktime splitter
    scorers,
)

y = load_airline()

benchmark.add(
    y,  # pandas Series or anything in sktime data format
    # can omit cv, default is ExpandingWindowSplitter
    scorers,
)

benchmark.add(
    MonashLoad("dataname"),  # speculative BaseData class
    cv,
    # can also omit scorers, in that case no scoring takes place
)

benchmark.add(
    M5task()  # speculative BaseTask class
)

benchmark.list("task")
# returns a pd.DataFrame with columns
# type (str), name (str), UID (str), maybe metadata? dataset, cv, etc
# lists tasks that have been added

# 4 - wip
results_df = benchmark.run(
    output="results.csv",

)

HA comments

add seems potentially useful, but will require dispatch
- so we want to have add_estimator etc in the first place before dispatch
- might be unclear to the user if the information is not explicit
likes dunder + (but might have the same problem as add)
likes defaults (but add should maybe be split up?)
add_task has advantage over add_cv that cv can be specific to dataset (more easily)
dataset assume callable. should we rather accept sktime in-memory data formats?

MR interface proposal

# Actual imports
from sktime.forecasting.naive import NaiveForecaster

# Fake imports
from sktime.benchmarking import ModelRegistry, ForecastingBenchmark

# Model registry concept: decouple this from bechmarking & make persistable
# Option 1)
# Create an empty registry and populate one by one
model_registry = ModelRegistry()
model_registry.add_model(model=NaiveForecaster(), name='Naive') # any name
# Resolves name clashes by providing a uuid for each model
model_registry.add_model(model=NaiveForecaster('drift'), name='Naive')
model_registry # Returns a df with: str(model), name, uuid 
model_registry.save('my_model_registry')

# Option 2)
# Load a saved registry
models = ModelRegistry().load("my_model_registry")

# Option 3)
# Create a registry from an iterable of models
list_models = [NaiveForecaster(), NaiveForecaster(strategy='drift')]
dict_models = {'naive': NaiveForecaster(), 'naive drift': NaiveForecaster(strategy='drift')}

model_registry = ModelRegistry().add_models(list_models) # Auto names
model_registry = ModelRegistry().add_models(dict_models) # Predefined names

# It should be able to apply common modifiers to all models in the registry
# Note: very unsure how to define this interface in a good way
model_registry.set({'reconciler', 'bu'})

# Backtest runner concept: only for one dataset and cv
backtest = ForecastingBenchmark()
backtest.models = model_registry

comments

not support for other tasks, ann

FK comments

adds significant feature - model registry
- perhaps not the 1st thing to change-refactor
- but useful feature
- great idea to develop separately from benchmark
  - conceptual difference: fitted models vs blueprints (here)
  - fitted models need advanced serialization
    - blueprints need just serialize specification
- should we call this "collection"?
  - sktime users might find some collections interesting

Hazrul inteface Proposal

# 1
from sktime.benchmarking.forecasting import ForecastingBenchmark

benchmark = ForecastingBenchmark()

# 2
from sktime.forecasting.naive import NaiveForecaster

benchmark.add_estimator(
    estimator=NaiveForecaster(strategy="mean", window_length=3),
    estimator_id="Naive-mean-3-v1",
)
benchmark.add_estimator(estimator=AutoARIMA(), estimator_id="AutoARIMA-v1")
benchmark.add_estimator(estimator=AutoETS(auto=True), estimator_id="AutoETS-v1")
benchmark.add_estimator(
    estimator=forecaster_with_differencer.clone(), estimator_id="LightGBM-v1"
)

# 3
cv = ExpandingWindowSplitter(initial_window=24, fh=fh, step_length=2)
scorers = [smape]


benchmark.add_dataset(dataset:callable:in_memory:pandas)
# add 

benchmark.add_cv()
# 

benchmark.add_scores(scores:list[callable, BaseMetric], calllable)


# 4
results_df = benchmark.run(in_memory=True,
                           output_file="file_name")
# parallise the run behind the scene
# allows outputing different file format
# should be outputing the estimator predictions and truth value. 
# 

benchmark.visualise(options:str) -> plot
# str=name for visualising graph
# some graphs in mind:
# 1. point prediciton performance of many model on radar plot
# 2. boxplots - cv range score, mean for each models

benchmark.test(options:str)
# str: rank, t-test, etc
# return the test for each model


benchmark.return_estimator(estimator_id:str)
# return chosen fitted model instance

results_df.set_index("model_id").iloc[:, -2:].style.format("{:.1%}")

FK comments

useful idea: visualization, testing
- should this be from benchmark?
- how does visualize, test know where the results are?
  - otherwise, we re-run the experiment 3 times (not good?)
  - we could remember a pointer?
    - to a BenchmarkResults object?
  - state/attribute design - important to make this explicit
return_estimator (could be some variant of list, or list fitted?)
possibly contentious - splitting up task into add_cv etc
- this will not be the same for other learning tasks

fkiraly · 2023-07-13T23:11:57Z

Some reflections upon your points, @hazrulakmal - I think these are indeed the distillation of some of the most pertinent design points here, in the context of the existing framework!

Having said that, the design discussion notebook contains some additional ideas which it might also be worth to consider thereafter.

Reflections below on:

high-level design and parameters of the rework
thoughts about the specific issues identified

high-level design, rework

On high-level, I would start from the premise that we do not assume a long-term dependency on kotsu, and therefore its problematic idiosyncrasies above. I'd also say mid-term considerations depend a lot on its maintenance model, (re-)engagement of @DBCerigo, and so on. Having said that, we should also not assume we drop it, but let considerations primarily depend on design, usability, maintainability.

Regarding the short-term changes, I would suggest:

most of the above can (and first probably should) be applied as punctual patches in sktime.
I can see no viable alternative to making changes in sktime primarily, as we need full control of deprecation and test suite to avoid user sided breakage.
mid-term, we can decide on whether the sum of changes suggests moving changes upstream to kotsu, or removal of kotsu interface dependencies, or a mixture thereof.

fkiraly · 2023-07-13T23:18:52Z

addressing the 5 issues listed by @hazrulakmal

I agree that kotsu is the natural end locus for extending the framework subject to the current high-level design.
But, we should not assume kotsu (or assume not kotsu). As far as I currently see it, we're wrapping evaluate mainly, and the data layer handling is hard-coded in kotsu. So, why not consider removing the kotsu logic from the return? That might be simplest, and still allow preserving the user facing interface.
indeed, evaluate has the feature, so options are either a patchthrough, or removal of the intermediate kotsu layer that hard-codes the aggrgation.
yes - short-term, I think we definitely should allow the user to pass sktime mtypes (primarily pd.DataFrame based ones, but convert_to allows any). Mid-term, my preference would be data type polymorphism, i.e., allowing in-memory containers, and pytorch-like loader classes. The current sktime loaders are odd creatures, they are not clearly designed and exist mainly for reasons of use in notebooks and docstrings rather than as clean interfaces, but it's perhaps good to support them too, temporarily.
I disagree that this has to be done in kotsu. It could easily be done in sktime, by two mechanisms: dispatch of kotsu in handling names entirely; or, convert arbitrary names in kotsu compatible names ("translate" the interface by adding the dashes etc, or removing them between sktime and kotsu)
yes, agreed - this is an easy fix.

…pt multiple estimators (#4877) See #4873 The current ForecastingBenchmark interface requires manually looping over multiple estimators to benchmark multiple models, as the add_estimator interface only accepts one estimator per call. This PR provides an option for users to add multiple models with a single add_estimator call while maintaining the existing interface. The interface for multiple models looks as follows: ```python from sktime.benchmarking.forecasting import ForecastingBenchmark from sktime.forecasting.naive import NaiveForecaster from sktime.forecasting.arima import AutoARIMA benchmark = ForecastingBenchmark() # for auto naming models = [NaiveForecaster(), AutoARIMA()] benchmark.add_estimator(models) # for custom naming models ={"Arima-v1":AutoARIMA(), "Naive-v1":NaiveForecaster()} benchmark.add_estimator(models) # existing interface still works benchmark.add_estimator(NaiveForecaster()) benchmark.add_estimator(AutoARIMA()) ```

hazrulakmal · 2023-09-06T04:47:49Z

After working quite some time for this enhancement, I suppose Kotsu in the long run is not desirable for a few reasons.

Kotsu is limited in many ways. It is a good package for quickly validating metrics, but if we want to do more than that, its limitations become apparent. For example, Kotsu does not support managing serialization/saving an estimator for later use, saving results is limited to CSV format thus limiting the results format for which we can save (e.g dataset splits and estimator predictions cant be saved in this format) and it was not built with parallelization in mind. While we could improve Kotsu to solve these issues, it would require a hacky interface to override and change kotsu internal code in sktime in order to take these issues into account. This would make long-term maintenance difficult to manage if there are any changes in upstream Kotsu that do not take our changes into account.
The kotsu original authors/community are no longer actively developing Kotsu or contributing sktime, which could cause communication bottlenecks.

Personally, moving forward, I would want to suggest a rework on the benchmarking module from scratch and it has to meet the following high-level requirements

Reproducibility: (low-hanging fruit)
1. support for model experiment tracking logger (MLFlow, W&B, Comet etc) for handling data, estimator, hyperparameter and experiment results version storage. Focus on the abstract class and make MLFlow work first.
2. able to save and load fitted stated estimators.
Automation (low-hanging fruit)
1. Workflow automation for conducting forecasting, classification & annotation experiments efficiently.
Performance (high-hanging fruit)
1. support for out-of-the-box parallel processing or distributed computing for faster experimentation on large datasets. I suppose we can use Fugue (dask, spark and ray) as the backend.

Before I embark on this journey, I think we need to do some rearrangement in the benchmarking module for a better developer experience. there seems to be a legacy benchmarking module, kotsu benchmarking module and this new upcoming rework. I think we need to handle this migration. I suppose we can isolate them in separate dictionaries and interface the import through init file so that the status quo can be maintained. In the long run, once one of the modules matures, we can perhaps depreciate the rest so that maintenance becomes more manageable.

Moving forward, I will focus on the rework especially on the low-hanging fruit first, but I will also keep kotsu enhancement in mind. However, I will prioritize rework over kotsu.

hazrulakmal · 2023-09-06T05:02:51Z

So, why not consider removing the kotsu logic from the return? That might be simplest, and still allow preserving the user facing interface.

and

indeed, evaluate has the feature, so options are either a patchthrough, or removal of the intermediate kotsu layer that hard-codes the aggrgation.

In order to do these both, saving the intermediate results (fitted estimators & predictions) requires a different file format (not CSV as per the status quo) I consider the approach of rewriting the kotsu.run to get more control of the storage handling but this will mean a direct override (not very clean approach) to the kotsu internal code as discussed in PR 5055.

Point 4 and 5 is completed in PR 5130 and PR 4877 respectively.

fkiraly · 2023-09-06T21:46:00Z

Thanks for your detailed and well measured analysis, @hazrulakmal!

I'm sorry for the back/forth here, unfortunately that's a side effect on working with a pre-existing code base.

The kotsu original authors/community are no longer actively developing Kotsu or contributing sktime, which could cause communication bottlenecks.

For me, this point 2 of yours would already be strong enough to step away from it - as we have ended up taking a dependency that is now basically a liability. This is because even if we commit to maintaining it, it is not clear whether we can even obtain the maintenance credentials to do that. Similarly, work towards overrides looks increasingly like a time sink risk from the same perspective.

Personally, moving forward, I would want to suggest a rework on the benchmarking module from scratch and it has to meet the following high-level requirements

Agreed! The basic requirements seem like the most important ones to me. For reproducibility, I assume you also meant the raw predictions etc.

efore I embark on this journey, I think we need to do some rearrangement in the benchmarking module for a better developer experience. there seems to be a legacy benchmarking module, kotsu benchmarking module and this new upcoming rework. I think we need to handle this migration. I suppose we can isolate them in separate dictionaries and interface the import through init file so that the status quo can be maintained. In the long run, once one of the modules matures, we can perhaps depreciate the rest so that maintenance becomes more manageable.

Yes, that seems like a sensible strategy - though with a minor risk that we end up with three benchmarking frameworks if we don't manage the transition right.
We could lessen the risk if we refactor step-wise, subject to a near-consolidated user facing interface, although depending on how strict one is, that cautious approach might be more hassle than it is worth.

Moving forward, I will focus on the rework especially on the low-hanging fruit first, but I will also keep kotsu enhancement in mind. However, I will prioritize rework over kotsu.

Imo absolutely sensible.

FYI @benHeid, @tcuongd.

Towards #4873 The current state of kotsu benchmarking is that it restrict users to only input ID of certain format i.e ID that follows the following format `[username/](entity-name)-v(major).(minor)` We should be flexible and allow users to decide if they want to use this feature rather than enforcing it upon them. Therefore, this PR aims to: 1. Set the default option to switch off the input ID format. 2. Allow users to define their own ID format using regex if they want to. This is also a design for which to override, make patches and get more control to kotsu framework within sktime. I started with a small change like this one first to get a green light before proceeding with others. 1. Some of the existing tests are changed to reflect the default ID format that no longer contains "-v1" at the end. 2. Added a new test to ensure an error message is raised if a user specifies ID format.

hazrulakmal added the enhancement Adding new functionality label Jul 12, 2023

fkiraly added API design API design & software architecture module:metrics&benchmarking metrics and benchmarking modules labels Jul 12, 2023

hazrulakmal mentioned this issue Jul 13, 2023

[ENH] input checks for BaseBenchmark, allow add_estimator to accept multiple estimators #4877

Merged

fkiraly added this to In progress in Workstream: benchmarking Aug 4, 2023

fkiraly assigned hazrulakmal Aug 4, 2023

hazrulakmal added bugfix Fixes a known bug or removes unintended behavior and removed bugfix Fixes a known bug or removes unintended behavior labels Aug 4, 2023

hazrulakmal mentioned this issue Aug 8, 2023

[ENH] Extend benchmarking to accept sktime-in-memory data formats #5055

Open

hazrulakmal mentioned this issue Aug 19, 2023

[ENH] Allow unrestriced ID string for BaseBenchmarking #5130

Merged

2 tasks

hazrulakmal mentioned this issue Sep 7, 2023

[ENH] Benchmarking Development Discussion #5196

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Enhancements to kotsu Benchmarking Framework #4873

[ENH] Enhancements to kotsu Benchmarking Framework #4873

hazrulakmal commented Jul 12, 2023 •

edited

fkiraly commented Jul 12, 2023 •

edited

current benchmark vignette

comments MR

comments FK (based on discussion with MR earlier)

1

3

4

FK speculative interface design

HA comments

MR interface proposal

comments

FK comments

Hazrul inteface Proposal

FK comments

fkiraly commented Jul 13, 2023 •

edited

fkiraly commented Jul 13, 2023 •

edited

hazrulakmal commented Sep 6, 2023 •

edited

hazrulakmal commented Sep 6, 2023

fkiraly commented Sep 6, 2023

[ENH] Enhancements to kotsu Benchmarking Framework #4873

[ENH] Enhancements to kotsu Benchmarking Framework #4873

Comments

hazrulakmal commented Jul 12, 2023 • edited

fkiraly commented Jul 12, 2023 • edited

current benchmark vignette

comments MR

comments FK (based on discussion with MR earlier)

1

3

4

FK speculative interface design

HA comments

MR interface proposal

comments

FK comments

Hazrul inteface Proposal

FK comments

fkiraly commented Jul 13, 2023 • edited

high-level design, rework

fkiraly commented Jul 13, 2023 • edited

addressing the 5 issues listed by @hazrulakmal

hazrulakmal commented Sep 6, 2023 • edited

hazrulakmal commented Sep 6, 2023

fkiraly commented Sep 6, 2023

hazrulakmal commented Jul 12, 2023 •

edited

fkiraly commented Jul 12, 2023 •

edited

fkiraly commented Jul 13, 2023 •

edited

fkiraly commented Jul 13, 2023 •

edited

hazrulakmal commented Sep 6, 2023 •

edited