This notebook covers the following topics:
1. Defining a time series forecasting `Task`
2. Multivariate and univariate forecasting tasks
3. Backtesting / evaluation using multiple cutoff dates
4. Evaluation on a `Benchmark` consisting of multiple tasks
5. Aggregating benchmark results

In [1]:
import warnings
from pathlib import Path

import datasets
from tqdm.auto import tqdm

import fev

warnings.simplefilter("ignore")
datasets.disable_progress_bars()

## Evaluation on a single Task
A `fev.Task` object contains all information that uniquely identifies a time series forecasting task.

### Data sources
Dataset stored on Hugging Face Hub: https://huggingface.co/datasets/autogluon/chronos_datasets

In [2]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets",
    dataset_config="monash_cif_2016",
    horizon=12,
)

Dataset stored on S3

In [3]:
# Dataset consisting of a single parquet / arrow file
task = fev.Task(
    dataset_path="s3://autogluon/datasets/timeseries/m1_monthly/data.parquet",
    horizon=12,
)
# Dataset consisting of multiple parquet / arrow files
task = fev.Task(
    dataset_path="s3://autogluon/datasets/timeseries/m1_monthly/*.parquet",
    horizon=12,
)

Dataset stored locally

In [4]:
# Download dataset from HF Hub and save it locally
ds = datasets.load_dataset("autogluon/chronos_datasets", name="m4_hourly", split="train")
local_path = "/tmp/m4_hourly/data.parquet"
ds.to_parquet(local_path)

task = fev.Task(
    dataset_path=local_path,
    horizon=48,
)

### Covariates
By default, all columns of type `Sequence` are interpreted as known covariates, and all remaining columns are interpreted as static covariates.

In [5]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets_extra",
    dataset_config="ETTh",
    horizon=24,
    target_column="OT",
)
past_data, future_data = task.get_input_data(trust_remote_code=True)
print(past_data)
print(future_data)

Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL', 'OT'],
    num_rows: 2
})
Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL'],
    num_rows: 2
})


We can configure how the covariates are used as part of the task definition.

For example, here we say that 
- columns `HUFL` and `HULL` are known only in the past
- columns `MUFL` and `MULL` are excluded from the dataset

In [6]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets_extra",
    dataset_config="ETTh",
    horizon=24,
    target_column="OT",
    past_dynamic_columns=["HUFL", "HULL"],
    excluded_columns=["MUFL", "MULL"],
)

past_data, future_data = task.get_input_data()
print(past_data)
print(future_data)

Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'LUFL', 'LULL', 'OT'],
    num_rows: 2
})
Dataset({
    features: ['id', 'timestamp', 'LUFL', 'LULL'],
    num_rows: 2
})


### Predictions format
Each task expects predictions to follow a certain format that is specified by `task.predictions_schema`.

For point forecasting tasks (i.e., if `quantile_levels=None`), predictions must contain a single array of length `horizon` for each time series.

In [7]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets",
    dataset_config="m4_hourly",
    horizon=48,
    eval_metric="MASE",
    seasonality=24,
)

In [8]:
task.predictions_schema

{'predictions': Sequence(feature=Value(dtype='float64', id=None), length=48, id=None)}

For probabilistic forecasting tasks (i.e., if `quantile_levels` is provided), predictions must additionally contain a prediction for each quantile level.

In [9]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets",
    dataset_config="m4_hourly",
    horizon=48,
    seasonality=24,
    quantile_levels=[0.1, 0.5, 0.9],
    eval_metric="WQL",
)

In [10]:
task.predictions_schema

{'predictions': Sequence(feature=Value(dtype='float64', id=None), length=48, id=None),
 '0.1': Sequence(feature=Value(dtype='float64', id=None), length=48, id=None),
 '0.5': Sequence(feature=Value(dtype='float64', id=None), length=48, id=None),
 '0.9': Sequence(feature=Value(dtype='float64', id=None), length=48, id=None)}

## Multivariate and univariate forecasting
In all previous examples we considered univariate forecasting tasks, where the goal was to predict a single `target_column` into the future. 

`fev` also supports multivariate tasks, where the goal is to simultaneously predict multiple target columns. 

### "Real" multivariate tasks
We can define multivariate forecasting tasks by setting the `target_column` attribute to a `list` of column names.


In [11]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets_extra",
    dataset_config="ETTh",
    horizon=3,
    target_column=["OT", "LUFL", "LULL"],
)

The input data created by the task in this case is identical to what would happen if we used `["OT", "LUFL", "LULL"]` as `past_dynamic_columns`.
That is, the target columns `["OT", "LUFL", "LULL"]` are available in `past_data` but not in `future_data`.

In [12]:
past_data, future_data = task.get_input_data()
print(past_data)
print(future_data)

Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL', 'OT'],
    num_rows: 2
})
Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'MUFL', 'MULL'],
    num_rows: 2
})


The only difference in a multivariate task is that the predictions must be formatted as a `datasets.DatasetDict` where
- each key corresponds to the name of the target column
- each value is a `datasets.Dataset` containing the predictions for this column in a format compatible with `task.predictions_schema`

In [13]:
def naive_forecast_multivariate(task: fev.Task) -> datasets.DatasetDict:
    """Predicts the last observed value in each multivariate column."""
    past_data, future_data = task.get_input_data()
    predictions = datasets.DatasetDict()
    for col in task.target_columns_list:
        predictions_for_column = []
        for ts in past_data:
            predictions_for_column.append({"predictions": [ts[col][-1] for _ in range(task.horizon)]})
        predictions[col] = datasets.Dataset.from_list(predictions_for_column)
    return predictions

In [14]:
predictions = naive_forecast_multivariate(task).cast(task.predictions_schema)
predictions

DatasetDict({
    OT: Dataset({
        features: ['predictions'],
        num_rows: 2
    })
    LUFL: Dataset({
        features: ['predictions'],
        num_rows: 2
    })
    LULL: Dataset({
        features: ['predictions'],
        num_rows: 2
    })
})

We can also look at the individual values in the `Dataset` objects

In [15]:
for col in task.target_column:
    print(f"Predictions for column '{col}'")
    print(f"\t{predictions[col].to_list()}")

Predictions for column 'OT'
	[{'predictions': [11.043999671936035, 11.043999671936035, 11.043999671936035]}, {'predictions': [48.18349838256836, 48.18349838256836, 48.18349838256836]}]
Predictions for column 'LUFL'
	[{'predictions': [3.5329999923706055, 3.5329999923706055, 3.5329999923706055]}, {'predictions': [-10.331000328063965, -10.331000328063965, -10.331000328063965]}]
Predictions for column 'LULL'
	[{'predictions': [1.6749999523162842, 1.6749999523162842, 1.6749999523162842]}, {'predictions': [-1.2899999618530273, -1.2899999618530273, -1.2899999618530273]}]


The rest of the code can stay the same.

In [16]:
task.compute_metrics(predictions)

{'MASE': 1.1921320632260508}

### Converting multivariate tasks into univariate tasks
Alternatively, we can convert a multivariate task into a univariate one by creating multiple univariate time series from each multivariate time series.

The original `ETTh` dataset contains two multivariate time series with the following ids:

In [17]:
past_data["id"]

array(['ETTh1', 'ETTh2'], dtype='<U5')

If we set `generate_univariate_targets_from=["OT", "LUFL", "LULL"]`, `fev` will create 3 univariate time series from each time series in the original dataset.

In [18]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets_extra",
    dataset_config="ETTh",
    horizon=3,
    generate_univariate_targets_from=["OT", "LUFL", "LULL"],
)

In [19]:
past_data, future_data = task.get_input_data()
print(past_data)
print(future_data)

num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.


Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'MUFL', 'MULL', 'target'],
    num_rows: 6
})
Dataset({
    features: ['id', 'timestamp', 'HUFL', 'HULL', 'MUFL', 'MULL'],
    num_rows: 6
})


The new dataset contains 6 items (2 original ids $\times$ 3 target columns).

In [20]:
past_data["id"]

array(['ETTh1_LUFL', 'ETTh1_LULL', 'ETTh1_OT', 'ETTh2_LUFL', 'ETTh2_LULL',
       'ETTh2_OT'], dtype='<U10')

We can confirm that the naive forecast achieves the same MASE score on this equivalent representation of the multivariate task.

In [21]:
def naive_forecast_univariate(task: fev.Task) -> list[dict]:
    """Predicts the last observed value."""
    past_data, future_data = task.get_input_data()
    predictions = []
    for ts in past_data:
        predictions.append({"predictions": [ts[task.target_column][-1] for _ in range(task.horizon)]})
    return predictions

In [22]:
task.compute_metrics(naive_forecast_univariate(task))

{'MASE': 1.1921320632260506}

## Backtesting & custom cutoffs
By default, the train/test split is generated as follows:
- test set contains the last `horizon` time steps of each time series
- train set contains everything up to the last `horizon` time steps of each time series

We can create the train/test splits at custom points in the time series using the `cutoff` argument.

The default behavior corresponds to setting `cutoff = -horizon`:

In [23]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets_extra",
    dataset_config="ETTh",
    horizon=24,
    target_column="OT",
    cutoff=-24,
)

We can set cutoff to a positive or negative integer. In this case, the training data will correspond to `y[:cutoff]` and the test set will be `y[cutoff : cutoff + horizon]`.

We can also set `cutoff` to a datetime-like string. In this case, `cutoff` will be the last timestamp in the training data.

In [24]:
task = fev.Task(
    dataset_path="autogluon/chronos_datasets_extra",
    dataset_config="ETTh",
    horizon=24,
    target_column="OT",
    cutoff="2017-01-01",
)
past_data, future_data = task.get_input_data()
print(f"Last train timestamp: {past_data[0]['timestamp'][-1]}")
print(f"First test timestamp: {future_data[0]['timestamp'][0]}")

Last train timestamp: 2017-01-01T00:00:00.000000000
First test timestamp: 2017-01-01T01:00:00.000000000


We can create tasks corresponding to multiple backtests by providing different values for the `cutoff`:

In [25]:
tasks = [
    fev.Task(
        dataset_path="autogluon/chronos_datasets_extra",
        dataset_config="ETTh",
        horizon=24,
        target_column="OT",
        cutoff="2017-01-01",
    ),
    fev.Task(
        dataset_path="autogluon/chronos_datasets_extra",
        dataset_config="ETTh",
        horizon=24,
        target_column="OT",
        cutoff="2017-02-07",
    ),
    fev.Task(
        dataset_path="autogluon/chronos_datasets_extra",
        dataset_config="ETTh",
        horizon=24,
        target_column="OT",
        cutoff="2017-06-03",
    ),
]

The `fev.TaskGenerator` class provides a more concise way to create multiple related configurations, e.g., for backtesting:

In [26]:
task_generator = fev.TaskGenerator(
    dataset_path="autogluon/chronos_datasets_extra",
    dataset_config="ETTh",
    horizon=24,
    target_column="OT",
    variants=[
        {"cutoff": "2017-01-01"},
        {"cutoff": "2017-02-07"},
        {"cutoff": "2017-06-03"},
    ],
)
tasks = task_generator.generate_tasks()
for i, task in enumerate(tasks):
    print(f"Task {i}")
    past_data, future_data = task.get_input_data()
    print(f"\tLast train timestamp: {past_data[0]['timestamp'][-1]}")
    print(f"\tFirst test timestamp: {future_data[0]['timestamp'][0]}")

Task 0
	Last train timestamp: 2017-01-01T00:00:00.000000000
	First test timestamp: 2017-01-01T01:00:00.000000000
Task 1
	Last train timestamp: 2017-02-07T00:00:00.000000000
	First test timestamp: 2017-02-07T01:00:00.000000000
Task 2
	Last train timestamp: 2017-06-03T00:00:00.000000000
	First test timestamp: 2017-06-03T01:00:00.000000000


If we don't specify `variants`, then `TaskGenerator.generate_tasks()` will produce a single `Task`.

In [27]:
task_generator = fev.TaskGenerator(
    dataset_path="my_dataset",
    dataset_config="my_config",
    horizon=12,
)
task_generator.generate_tasks()

[Task(dataset_path='my_dataset', dataset_config='my_config', horizon=12, cutoff=-12, lead_time=1, min_context_length=1, max_context_length=None, seasonality=1, eval_metric='MASE', extra_metrics=[], quantile_levels=None, id_column='id', timestamp_column='timestamp', target_column='target', generate_univariate_targets_from=None, past_dynamic_columns=[], excluded_columns=[])]

If we do specify `variants`, then `TaskGenerator.generate_tasks()` will produce a single `Task` for each variant in `variants`.

In each of the variants, the dict provided in variants will override the default parameters for the task.

In [28]:
task_generator = fev.TaskGenerator(
    dataset_path="my_dataset",
    dataset_config="my_config",
    variants=[
        {"horizon": 12},
        {"horizon": 24},
    ],
)
task_generator.generate_tasks()

[Task(dataset_path='my_dataset', dataset_config='my_config', horizon=12, cutoff=-12, lead_time=1, min_context_length=1, max_context_length=None, seasonality=1, eval_metric='MASE', extra_metrics=[], quantile_levels=None, id_column='id', timestamp_column='timestamp', target_column='target', generate_univariate_targets_from=None, past_dynamic_columns=[], excluded_columns=[]),
 Task(dataset_path='my_dataset', dataset_config='my_config', horizon=24, cutoff=-24, lead_time=1, min_context_length=1, max_context_length=None, seasonality=1, eval_metric='MASE', extra_metrics=[], quantile_levels=None, id_column='id', timestamp_column='timestamp', target_column='target', generate_univariate_targets_from=None, past_dynamic_columns=[], excluded_columns=[])]

Alternatively, we can use the keywords `num_rolling_windows`, `initial_cutoff` and `rolling_step_size` to create multiple rolling evaluation tasks from a single `TaskGenerator`.

We can use integer-based cutoffs:

In [29]:
task_generator = fev.TaskGenerator(
    dataset_path="my_dataset",
    dataset_config="my_config",
    horizon=24,
    num_rolling_windows=3,
    initial_cutoff=-96,
    rolling_step_size=None,  # defaults to `horizon`
)
task_generator.generate_tasks()

[Task(dataset_path='my_dataset', dataset_config='my_config', horizon=24, cutoff=-96, lead_time=1, min_context_length=1, max_context_length=None, seasonality=1, eval_metric='MASE', extra_metrics=[], quantile_levels=None, id_column='id', timestamp_column='timestamp', target_column='target', generate_univariate_targets_from=None, past_dynamic_columns=[], excluded_columns=[]),
 Task(dataset_path='my_dataset', dataset_config='my_config', horizon=24, cutoff=-72, lead_time=1, min_context_length=1, max_context_length=None, seasonality=1, eval_metric='MASE', extra_metrics=[], quantile_levels=None, id_column='id', timestamp_column='timestamp', target_column='target', generate_univariate_targets_from=None, past_dynamic_columns=[], excluded_columns=[]),
 Task(dataset_path='my_dataset', dataset_config='my_config', horizon=24, cutoff=-48, lead_time=1, min_context_length=1, max_context_length=None, seasonality=1, eval_metric='MASE', extra_metrics=[], quantile_levels=None, id_column='id', timestamp_co

Or timestamp-based cutoffs:

In [30]:
task_generator = fev.TaskGenerator(
    dataset_path="my_dataset",
    dataset_config="my_config",
    horizon=24,
    num_rolling_windows=3,
    initial_cutoff="2024-01-04",
    rolling_step_size="12h",  # required if `initial_cutoff` is a string
)
task_generator.generate_tasks()

[Task(dataset_path='my_dataset', dataset_config='my_config', horizon=24, cutoff='2024-01-04T00:00:00', lead_time=1, min_context_length=1, max_context_length=None, seasonality=1, eval_metric='MASE', extra_metrics=[], quantile_levels=None, id_column='id', timestamp_column='timestamp', target_column='target', generate_univariate_targets_from=None, past_dynamic_columns=[], excluded_columns=[]),
 Task(dataset_path='my_dataset', dataset_config='my_config', horizon=24, cutoff='2024-01-04T12:00:00', lead_time=1, min_context_length=1, max_context_length=None, seasonality=1, eval_metric='MASE', extra_metrics=[], quantile_levels=None, id_column='id', timestamp_column='timestamp', target_column='target', generate_univariate_targets_from=None, past_dynamic_columns=[], excluded_columns=[]),
 Task(dataset_path='my_dataset', dataset_config='my_config', horizon=24, cutoff='2024-01-05T00:00:00', lead_time=1, min_context_length=1, max_context_length=None, seasonality=1, eval_metric='MASE', extra_metrics=

## Evaluation on a Benchmark consisting of multiple tasks
A `fev.Benchmark` object is essentially a collection of `Task`s.

We can create a benchmark from a list of dictionaries. Each dictionary is interpreted as a `fev.TaskGenerator`.

In [31]:
task_generators = [
    {
        "dataset_path": "autogluon/chronos_datasets",
        "dataset_config": "monash_m3_monthly",
        "horizon": 18,
        "seasonality": 12,
        "eval_metric": "MASE",
    },
    {
        "dataset_path": "autogluon/chronos_datasets",
        "dataset_config": "monash_electricity_weekly",
        "horizon": 8,
        "quantile_levels": [0.1, 0.5, 0.9],
        "eval_metric": "WQL",
        "variants": [
            {"cutoff": "2013-01-01"},
            {"cutoff": "2014-01-01"},
        ],
    },
]
benchmark = fev.Benchmark.from_list(task_generators)

Or from a YAML file

In [32]:
benchmark_path = Path(fev.__file__).parents[2] / "benchmarks" / "example" / "tasks.yaml"
# Show contents of the benchmark YAML file
!cat {benchmark_path}

tasks:
- dataset_path: autogluon/chronos_datasets
  dataset_config: monash_m1_yearly
  horizon: 8
- dataset_path: autogluon/chronos_datasets
  dataset_config: monash_electricity_weekly
  horizon: 8
  seasonality: 1
  variants:
  - cutoff: "2013-01-01"
  - cutoff: "2014-01-01"


In [33]:
benchmark = fev.Benchmark.from_yaml(benchmark_path)

In [34]:
benchmark.tasks

[Task(dataset_path='autogluon/chronos_datasets', dataset_config='monash_m1_yearly', horizon=8, cutoff=-8, lead_time=1, min_context_length=1, max_context_length=None, seasonality=1, eval_metric='MASE', extra_metrics=[], quantile_levels=None, id_column='id', timestamp_column='timestamp', target_column='target', generate_univariate_targets_from=None, past_dynamic_columns=[], excluded_columns=[]),
 Task(dataset_path='autogluon/chronos_datasets', dataset_config='monash_electricity_weekly', horizon=8, cutoff='2013-01-01T00:00:00', lead_time=1, min_context_length=1, max_context_length=None, seasonality=1, eval_metric='MASE', extra_metrics=[], quantile_levels=None, id_column='id', timestamp_column='timestamp', target_column='target', generate_univariate_targets_from=None, past_dynamic_columns=[], excluded_columns=[]),
 Task(dataset_path='autogluon/chronos_datasets', dataset_config='monash_electricity_weekly', horizon=8, cutoff='2014-01-01T00:00:00', lead_time=1, min_context_length=1, max_conte

Now let's evaluate some simple forecasting models on this toy benchmark.

In [35]:
!pip install -q statsforecast

In [36]:
from statsforecast.models import ARIMA, SeasonalNaive, Theta


def predict_with_model(task: fev.Task, model_name: str = "naive") -> list[dict]:
    past_data, future_data = task.get_input_data()
    if model_name == "seasonal_naive":
        model = SeasonalNaive(season_length=task.seasonality)
    elif model_name == "theta":
        model = Theta(season_length=task.seasonality)
    elif model_name == "arima":
        model = ARIMA(season_length=task.seasonality)
    else:
        raise ValueError(f"Unknown model_name: {model_name}")

    predictions = []
    for ts in past_data:
        predictions.append({"predictions": model.forecast(y=ts[task.target_column], h=task.horizon)["mean"]})
    return predictions

In [37]:
import time

summaries = []
for task in tqdm(benchmark.tasks, desc="Tasks completed"):
    for model_name in ["seasonal_naive", "arima", "theta"]:
        start_time = time.time()
        predictions = predict_with_model(task, model_name=model_name)
        infer_time_s = time.time() - start_time
        eval_summary = task.evaluation_summary(
            predictions, model_name=model_name, inference_time_s=infer_time_s, training_time_s=0.0
        )

        summaries.append(eval_summary)

Tasks completed:   0%|          | 0/3 [00:00<?, ?it/s]

In [38]:
fev.leaderboard(summaries, baseline_model="seasonal_naive")

Unnamed: 0_level_0,gmean_relative_error,avg_rank,avg_inference_time_s,median_inference_time_s,avg_training_time_s,median_training_time_s,training_corpus_overlap,num_failures
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
theta,0.914107,1.0,8.867179,1.160022,0.0,0.0,0.0,0
seasonal_naive,1.0,2.0,4.310597,4.382275,0.0,0.0,0.0,0
arima,1.870027,3.0,0.361961,0.39433,0.0,0.0,0.0,0


The `leaderboard` method aggregates the performance into a single number.

We can investigate the performance for individual tasks using the `pivot_table` method

In [39]:
fev.pivot_table(summaries)

model_name,arima,seasonal_naive,theta
dataset_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
chronos_datasets_monash_electricity_weekly,3.05693,1.573758,1.497915
chronos_datasets_monash_m1_yearly,10.236634,5.89889,4.988582


Recall that our benchmark definition contains two tasks for `monash_electricity_weekly` with different cutoff dates. The above cell averaged the results across both cutoff dates.

We can have a look at the results for individual cutoffs as follows.

In [40]:
# you can filter any task properties such as `eval_metric`, `horizon`, etc
fev.pivot_table(summaries, task_columns=["dataset_name", "cutoff"])

Unnamed: 0_level_0,model_name,arima,seasonal_naive,theta
dataset_name,cutoff,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
chronos_datasets_monash_electricity_weekly,2013-01-01T00:00:00,2.907207,1.520114,1.401086
chronos_datasets_monash_electricity_weekly,2014-01-01T00:00:00,3.206652,1.627403,1.594743
chronos_datasets_monash_m1_yearly,-8,10.236634,5.89889,4.988582


Both `leaderboard()` and `pivot_table()` methods can handle single or multiple evaluation summaries in different formats:
- `pandas.DataFrame`
- list of dictionaries
- paths to JSONL (orient="record") or CSV files

Here is an example of how we can work with URLs of CSV files:

In [41]:
summaries = [
    "https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/chronos_zeroshot/results/auto_arima.csv",
    "https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/chronos_zeroshot/results/auto_theta.csv",
    "https://raw.githubusercontent.com/autogluon/fev/refs/heads/main/benchmarks/chronos_zeroshot/results/seasonal_naive.csv",
]
fev.leaderboard(summaries, metric_column="MASE")

Unnamed: 0_level_0,gmean_relative_error,avg_rank,avg_inference_time_s,median_inference_time_s,avg_training_time_s,median_training_time_s,training_corpus_overlap,num_failures
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
auto_theta,0.858722,1.703704,286.465526,23.892088,,,0.0,0
auto_arima,0.869449,1.703704,1674.733082,75.8837,,,0.0,0
seasonal_naive,1.0,2.592593,2.41595,0.096449,,,0.0,0


In [42]:
fev.pivot_table(summaries, task_columns="dataset_config", metric_column="WQL")

model_name,auto_arima,auto_theta,seasonal_naive
dataset_config,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ETTh,0.089012,0.132979,0.12209
ETTm,0.10499,0.078587,0.141348
dominick,0.484773,0.485493,0.452916
ercot,0.041214,0.041004,0.036604
exchange_rate,0.010667,0.009714,0.012984
m4_quarterly,0.079384,0.079077,0.118648
m4_yearly,0.125041,0.11464,0.161439
m5,0.61652,0.636228,1.024088
monash_australian_electricity,0.066902,0.054564,0.083695
monash_car_parts,1.333026,1.336601,1.599952
