New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Enhancements to kotsu Benchmarking Framework #4873
Comments
just dumping the hackmd notebook here for reference, as those notebooks have the tendency to degrade and/or disappear. To view, click the dropdown arrow. Design notes for benchmarking frameworkcurrent benchmark vignette# 1
from sktime.benchmarking.forecasting import ForecastingBenchmark
benchmark = ForecastingBenchmark()
# 2
from sktime.forecasting.naive import NaiveForecaster
benchmark.add_estimator(
estimator=NaiveForecaster(strategy="mean", window_length=3),
estimator_id="Naive-mean-3-v1",
)
benchmark.add_estimator(estimator=AutoARIMA(), estimator_id="AutoARIMA-v1")
benchmark.add_estimator(estimator=AutoETS(auto=True), estimator_id="AutoETS-v1")
benchmark.add_estimator(
estimator=forecaster_with_differencer.clone(), estimator_id="LightGBM-v1"
)
# 3
cv = ExpandingWindowSplitter(initial_window=24, fh=fh, step_length=2)
scorers = [smape]
benchmark.add_task(
load_shampoo_sales,
cv,
scorers,
)
# 4
results_df = benchmark.run(output_file="results.csv")
results_df.set_index("model_id").iloc[:, -2:].style.format("{:.1%}") comments MR
comments FK (based on discussion with MR earlier)1
3
4
FK speculative interface design# 1
from sktime.benchmarking.forecasting import ForecastingBenchmark
benchmark = ForecastingBenchmark()
# alternative, can load existing benchmark
# benchmark = ForecastingBenchmark.load(serialization_ref)
# 2
from sktime.forecasting.naive import NaiveForecaster
benchmark.add(estimator=NaiveForecaster(strategy="mean", window_length=3))
# this stores the NaiveForecaster in the registry of ForecastingBenchmark
# returns reference to self (benchmark)
# pretty prints list() - default list() is list("all") which returns multiple DataFrame and prints all
benchmark.add(estimator=AutoARIMA(), name="supercalifragilistic")
# stores the AutoARIMA model under the name - can be any
benchmark.add(estimator=AutoETS(auto=True), name="supercalifragilistic")
# raises a warning and stores under supercalifragilistic_2
benchmark + Prophet() # or += ? does the same as benchmark.add
benchmark.list("estimator")
# returns a pd.DataFrame with columns
# type (str), name (str), UID (str)
# lists estimators that have been added
# 3
cv = ExpandingWindowSplitter(initial_window=24, fh=fh, step_length=2)
scorers = [smape]
benchmark.add( # polymorphic, detects type(?)
# perhaps we want internal dispatch to add_task, add_estimator etc
load_shampoo_sales, # data loader function
cv, # sktime splitter
scorers,
)
y = load_airline()
benchmark.add(
y, # pandas Series or anything in sktime data format
# can omit cv, default is ExpandingWindowSplitter
scorers,
)
benchmark.add(
MonashLoad("dataname"), # speculative BaseData class
cv,
# can also omit scorers, in that case no scoring takes place
)
benchmark.add(
M5task() # speculative BaseTask class
)
benchmark.list("task")
# returns a pd.DataFrame with columns
# type (str), name (str), UID (str), maybe metadata? dataset, cv, etc
# lists tasks that have been added
# 4 - wip
results_df = benchmark.run(
output="results.csv",
) HA comments
MR interface proposal# Actual imports
from sktime.forecasting.naive import NaiveForecaster
# Fake imports
from sktime.benchmarking import ModelRegistry, ForecastingBenchmark
# Model registry concept: decouple this from bechmarking & make persistable
# Option 1)
# Create an empty registry and populate one by one
model_registry = ModelRegistry()
model_registry.add_model(model=NaiveForecaster(), name='Naive') # any name
# Resolves name clashes by providing a uuid for each model
model_registry.add_model(model=NaiveForecaster('drift'), name='Naive')
model_registry # Returns a df with: str(model), name, uuid
model_registry.save('my_model_registry')
# Option 2)
# Load a saved registry
models = ModelRegistry().load("my_model_registry")
# Option 3)
# Create a registry from an iterable of models
list_models = [NaiveForecaster(), NaiveForecaster(strategy='drift')]
dict_models = {'naive': NaiveForecaster(), 'naive drift': NaiveForecaster(strategy='drift')}
model_registry = ModelRegistry().add_models(list_models) # Auto names
model_registry = ModelRegistry().add_models(dict_models) # Predefined names
# It should be able to apply common modifiers to all models in the registry
# Note: very unsure how to define this interface in a good way
model_registry.set({'reconciler', 'bu'})
# Backtest runner concept: only for one dataset and cv
backtest = ForecastingBenchmark()
backtest.models = model_registry comments
FK comments
Hazrul inteface Proposal# 1
from sktime.benchmarking.forecasting import ForecastingBenchmark
benchmark = ForecastingBenchmark()
# 2
from sktime.forecasting.naive import NaiveForecaster
benchmark.add_estimator(
estimator=NaiveForecaster(strategy="mean", window_length=3),
estimator_id="Naive-mean-3-v1",
)
benchmark.add_estimator(estimator=AutoARIMA(), estimator_id="AutoARIMA-v1")
benchmark.add_estimator(estimator=AutoETS(auto=True), estimator_id="AutoETS-v1")
benchmark.add_estimator(
estimator=forecaster_with_differencer.clone(), estimator_id="LightGBM-v1"
)
# 3
cv = ExpandingWindowSplitter(initial_window=24, fh=fh, step_length=2)
scorers = [smape]
benchmark.add_dataset(dataset:callable:in_memory:pandas)
# add
benchmark.add_cv()
#
benchmark.add_scores(scores:list[callable, BaseMetric], calllable)
# 4
results_df = benchmark.run(in_memory=True,
output_file="file_name")
# parallise the run behind the scene
# allows outputing different file format
# should be outputing the estimator predictions and truth value.
#
benchmark.visualise(options:str) -> plot
# str=name for visualising graph
# some graphs in mind:
# 1. point prediciton performance of many model on radar plot
# 2. boxplots - cv range score, mean for each models
benchmark.test(options:str)
# str: rank, t-test, etc
# return the test for each model
benchmark.return_estimator(estimator_id:str)
# return chosen fitted model instance
results_df.set_index("model_id").iloc[:, -2:].style.format("{:.1%}") FK comments
|
Some reflections upon your points, @hazrulakmal - I think these are indeed the distillation of some of the most pertinent design points here, in the context of the existing framework! Having said that, the design discussion notebook contains some additional ideas which it might also be worth to consider thereafter. Reflections below on:
high-level design, reworkOn high-level, I would start from the premise that we do not assume a long-term dependency on Regarding the short-term changes, I would suggest:
|
addressing the 5 issues listed by @hazrulakmal
|
…pt multiple estimators (#4877) See #4873 The current ForecastingBenchmark interface requires manually looping over multiple estimators to benchmark multiple models, as the add_estimator interface only accepts one estimator per call. This PR provides an option for users to add multiple models with a single add_estimator call while maintaining the existing interface. The interface for multiple models looks as follows: ```python from sktime.benchmarking.forecasting import ForecastingBenchmark from sktime.forecasting.naive import NaiveForecaster from sktime.forecasting.arima import AutoARIMA benchmark = ForecastingBenchmark() # for auto naming models = [NaiveForecaster(), AutoARIMA()] benchmark.add_estimator(models) # for custom naming models ={"Arima-v1":AutoARIMA(), "Naive-v1":NaiveForecaster()} benchmark.add_estimator(models) # existing interface still works benchmark.add_estimator(NaiveForecaster()) benchmark.add_estimator(AutoARIMA()) ```
After working quite some time for this enhancement, I suppose Kotsu in the long run is not desirable for a few reasons.
Personally, moving forward, I would want to suggest a rework on the benchmarking module from scratch and it has to meet the following high-level requirements
Before I embark on this journey, I think we need to do some rearrangement in the benchmarking module for a better developer experience. there seems to be a legacy benchmarking module, kotsu benchmarking module and this new upcoming rework. I think we need to handle this migration. I suppose we can isolate them in separate dictionaries and interface the import through init file so that the status quo can be maintained. In the long run, once one of the modules matures, we can perhaps depreciate the rest so that maintenance becomes more manageable. Moving forward, I will focus on the rework especially on the low-hanging fruit first, but I will also keep kotsu enhancement in mind. However, I will prioritize rework over kotsu. |
and
In order to do these both, saving the intermediate results (fitted estimators & predictions) requires a different file format (not CSV as per the status quo) I consider the approach of rewriting the Point 4 and 5 is completed in PR 5130 and PR 4877 respectively. |
Thanks for your detailed and well measured analysis, @hazrulakmal! I'm sorry for the back/forth here, unfortunately that's a side effect on working with a pre-existing code base.
For me, this point 2 of yours would already be strong enough to step away from it - as we have ended up taking a dependency that is now basically a liability. This is because even if we commit to maintaining it, it is not clear whether we can even obtain the maintenance credentials to do that. Similarly, work towards overrides looks increasingly like a time sink risk from the same perspective.
Agreed! The basic requirements seem like the most important ones to me. For reproducibility, I assume you also meant the raw predictions etc.
Yes, that seems like a sensible strategy - though with a minor risk that we end up with three benchmarking frameworks if we don't manage the transition right.
Imo absolutely sensible. |
Towards #4873 The current state of kotsu benchmarking is that it restrict users to only input ID of certain format i.e ID that follows the following format `[username/](entity-name)-v(major).(minor)` We should be flexible and allow users to decide if they want to use this feature rather than enforcing it upon them. Therefore, this PR aims to: 1. Set the default option to switch off the input ID format. 2. Allow users to define their own ID format using regex if they want to. This is also a design for which to override, make patches and get more control to kotsu framework within sktime. I started with a small change like this one first to get a green light before proceeding with others. 1. Some of the existing tests are changed to reflect the default ID format that no longer contains "-v1" at the end. 2. Added a new test to ensure an error message is raised if a user specifies ID format.
A great initial effort was made by @DBCerigo @TNTran92 @alex-hh to create a user-friendly sktime benchmarking interface for quick univariate forecasting benchmarks. The decision to include Kotsu as part of the benchmarking module was made in issue #2777.
After the first kotsu integration PR was merged, the work was then put on hold, but there are still future tasks to complete, as suggested by the original author in issue #2804. I plan to continue the excellent work that has been done, and this issue is intended to keep track of what needs to be done.
Additionally, there were some issues and suggestions for improving the current implementation, which were discussed by @fkiraly , @marrov, and myself previously here. Below is a summary of the discussion - please do add anything if you I miss something out.
ForecastBenchmark
force writes results to the hard drive, but it should allow users to save results to different storage files (cloud) or formats (e.g. Parquet).ForecastBenchmark
return results in aggregated fold metrics, without providing intermediate results. Users have no control over further evaluation processing steps, such as adding new metrics to the benchmark after it has run. visualizing predictions, compute different aggregations like median.ForecastBenchmark
datasets loader assumes a callable fromsktime.datasets
. does this mean it only accept datasets that exist in thesktime.datasets
? is this intented? what about external datasets from the users?estimator_id
is restricted to certain strings without clear user signposting. Should all strings be allowed?add_estimator
method only takes a single estimator per call. Can be modified to accept a list of estimators.My initial thoughts on the problems in chronological order
evaluate
itself has an option to return raw predictions, so we may need to make some tweaks to the current forecasting benchmarking to allow for this option.estimator_id
naming convention should follow the format[username/](entity-name)-v(major).(minor)
(e.g. hazrulakmal/airline-v1.2 or airline-v1 etc), as per required by Kotsu. To allow for all kinds of strings, this has to be done in Kotsu.Moving forwads this is what I wanna do
BaseBenchmark
, allowadd_estimator
to accept multiple estimators #4877estimator_id
[ENH] Allow unrestriced ID string for BaseBenchmarking #5130The text was updated successfully, but these errors were encountered: