Evaluating trained model on new dataset efficiently #1531

eivindeb · 2023-01-31T15:43:39Z

Hi,

I was wondering what is the intended interface through which one can take a trained model and efficiently use this model on a new dataset to evaluate metrics and/or obtain predictions?

It seems that this is what the methods .backtest and .historical_forecasts are for, however, they run very slowly. This seems to be because they construct datasets of a single data sample, predict on that dataset, and then construct a new dataset for the next sample and so on.

For torch models I am able to emulate backstep using trainer.validate and the ._build_train_dataset function like the attached code example. On my laptop the backtesting method uses ~10x longer than trainer.validate (129.6s vs 15.2s) and the gap increases rapidly with larger amounts of data. The datasets that I want to use darts on contain millions of data samples, so the backtest and historical_forecast methods are too slow to be usable. Is there something I have missed?

The same question goes for obtaining predictions for large amounts of data. I was not able to use the same method (i.e. trainer.predict) to obtain predictions as the PLForecastingModule's predict_step does not seem compatible with the trainer.predict loop logic.

if __name__ == "__main__":
    import darts.datasets
    import darts.models
    import darts.utils
    import darts.metrics
    import torch.utils.data
    import time
    import math

    dataset = darts.datasets.TemperatureDataset().load()
    train, val = dataset.split_after(0.5)
    train = darts.utils.missing_values.fill_missing_values(train)
    val = darts.utils.missing_values.fill_missing_values(val)

    nn_model = darts.models.BlockRNNModel(30, 5, n_epochs=1)
    nn_model.fit(train)

    backtesting_val_start = time.time()
    backtest_metrics = nn_model.backtest(series=val,
                                         retrain=False,
                                         metric=darts.metrics.metrics.mse,
                                         forecast_horizon=nn_model.output_chunk_length,
                                         verbose=True)
    print(f"Backtesting validate execution time: {time.time() - backtesting_val_start:.1f}s")

    val_dataset = nn_model._build_train_dataset(
                target=val,
                past_covariates=None,
                future_covariates=None,
                max_samples_per_ts=None,
            )
    val_dataloader = torch.utils.data.DataLoader(
                val_dataset,
                batch_size=nn_model.batch_size,
                shuffle=False,
                num_workers=2,
                pin_memory=True,
                drop_last=False,
                collate_fn=nn_model._batch_collate_fn,
            )
    trainer_val_start = time.time()
    val_metrics = nn_model.trainer.validate(model=nn_model.model, dataloaders=val_dataloader)
    print(f"PL Trainer validate execution time: {time.time() - trainer_val_start:.1f}s")
    assert math.isclose(backtest_metrics, val_metrics[0]["val_loss"])

solalatus · 2023-02-01T07:22:41Z

I tried to use different approaches, but got to the same conclusion: as of now, testing is too slow. My guess is: for every timestep, a new "session" is used, that gives the overhead. But that is just a guess on my side.

dennisbader · 2023-02-08T08:17:27Z

We are aware that there is potential of improvement for historical_forecasts, especially for our Regression and TorchForecastinModels.

This is high on our to do list.

regression models: build tabularization once at beginning (if possible, meaning non-autoregressive predictions and retrain=False)
TorchForecastingModels: build inference dataset once (if possible same as above)

These two points should drastically reduce processing time.

chengyineng38 · 2023-03-08T19:23:36Z

I'm also running into this issue but confusion around historical_forecasts, backtest, and predict.

My setup is to train NBeatsModel on 2017-2021 multivariate data with past covariates, but generate metrics on 2022 data (also with covariates). The data has weekly intervals.

It looks like historical_forecasts and backtest are very similar. The latter additionally computes metrics, but otherwise, they are the same -- is that the correct understanding? If I were to use these two methods, should I have passed the entire 2017-2022 data for training? I'm also confused about the start argument here. If I set my start to be 2022-01-01 and retrain=False and have trained the model, the model definitely won't be re-trained, right? With a forecast_horizon = 7 and stride=1, what is the date range of the input sequence each time to generate predictions?
I also saw start=some float,e.g. 0.6, what does 60% of a series mean? If the series starts in January, 60% is July?

predict looks like it's generating future predictions. Is this the correct method for me to use?

madtoinou · 2023-10-30T09:04:06Z

Hi @eivindeb,

historical_forecasts() with retrain=False was optimized for RegressionModel (release 0.25.0) and torch-based models (upcoming release 0.27.0), it should solve part of the problem.

I am not sure to understand to compatibility issue problem that you are describing between Trainer.predict() and PLForecasting.predict_step(), can you please elaborate?

@chengyineng38 your understanding is correct : backtest() heavily rely on historical_forecasts(). The series passed to these methods ultimately depends on what you want to do:

assess the performance of a model trained on a historic (old) dataset on more recent data: concatenate the old dataset to the new one (if there is no gap between the time indexes) or use just the recent dataset (some dates will be missing as some values are necessary to generate the first forecast), pass it as series and use start='first date in the recent dataset' and retrain=False.
assess the performance of a model on a new dataset: pass the new dataset as series, leave start empty, set retrain=False.
assess how a model would have performed on a given dataset with regular retraining : use a single dataset, retrain=True and start='date where model is expected to be functional'.

retrain indicates if the model should be retrain before predicting forecast_horizon, it will use the whole series up to start for the first training round if set to True. If retrain=False, the model won't be retrained at all.
start is used to indicate when the first forecast horizon should begin, particularly useful if retrain=True and you want the model to be trained on a large portion of the series. start=0.6 just means that the 60% of the dataset will be used to train the model, ie the Timestep at series[int(len(series)*0.6)] will be used as start value.

predict() is used to generate a simple forecast horizon, based on the input series and covariates (without any additional logic).

dennisbader · 2024-01-21T15:48:30Z

Closing as historical forecasting/backtesting was optimized for torch models in darts version 0.27.0

solcmichDS mentioned this issue Feb 6, 2023

[FEATURE] Add support for global training in historical forecasts, backtest, residuals #1538

Open

madtoinou added the question Further information is requested label Oct 30, 2023

dennisbader closed this as completed Jan 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating trained model on new dataset efficiently #1531

Evaluating trained model on new dataset efficiently #1531

eivindeb commented Jan 31, 2023

solalatus commented Feb 1, 2023

dennisbader commented Feb 8, 2023

chengyineng38 commented Mar 8, 2023 •

edited

Loading

madtoinou commented Oct 30, 2023

dennisbader commented Jan 21, 2024

Evaluating trained model on new dataset efficiently #1531

Evaluating trained model on new dataset efficiently #1531

Comments

eivindeb commented Jan 31, 2023

solalatus commented Feb 1, 2023

dennisbader commented Feb 8, 2023

chengyineng38 commented Mar 8, 2023 • edited Loading

madtoinou commented Oct 30, 2023

dennisbader commented Jan 21, 2024

chengyineng38 commented Mar 8, 2023 •

edited

Loading