Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating trained model on new dataset efficiently #1531

Closed
eivindeb opened this issue Jan 31, 2023 · 5 comments
Closed

Evaluating trained model on new dataset efficiently #1531

eivindeb opened this issue Jan 31, 2023 · 5 comments
Labels
question Further information is requested

Comments

@eivindeb
Copy link

Hi,

I was wondering what is the intended interface through which one can take a trained model and efficiently use this model on a new dataset to evaluate metrics and/or obtain predictions?

It seems that this is what the methods .backtest and .historical_forecasts are for, however, they run very slowly. This seems to be because they construct datasets of a single data sample, predict on that dataset, and then construct a new dataset for the next sample and so on.

For torch models I am able to emulate backstep using trainer.validate and the ._build_train_dataset function like the attached code example. On my laptop the backtesting method uses ~10x longer than trainer.validate (129.6s vs 15.2s) and the gap increases rapidly with larger amounts of data. The datasets that I want to use darts on contain millions of data samples, so the backtest and historical_forecast methods are too slow to be usable. Is there something I have missed?

The same question goes for obtaining predictions for large amounts of data. I was not able to use the same method (i.e. trainer.predict) to obtain predictions as the PLForecastingModule's predict_step does not seem compatible with the trainer.predict loop logic.

if __name__ == "__main__":
    import darts.datasets
    import darts.models
    import darts.utils
    import darts.metrics
    import torch.utils.data
    import time
    import math

    dataset = darts.datasets.TemperatureDataset().load()
    train, val = dataset.split_after(0.5)
    train = darts.utils.missing_values.fill_missing_values(train)
    val = darts.utils.missing_values.fill_missing_values(val)

    nn_model = darts.models.BlockRNNModel(30, 5, n_epochs=1)
    nn_model.fit(train)

    backtesting_val_start = time.time()
    backtest_metrics = nn_model.backtest(series=val,
                                         retrain=False,
                                         metric=darts.metrics.metrics.mse,
                                         forecast_horizon=nn_model.output_chunk_length,
                                         verbose=True)
    print(f"Backtesting validate execution time: {time.time() - backtesting_val_start:.1f}s")

    val_dataset = nn_model._build_train_dataset(
                target=val,
                past_covariates=None,
                future_covariates=None,
                max_samples_per_ts=None,
            )
    val_dataloader = torch.utils.data.DataLoader(
                val_dataset,
                batch_size=nn_model.batch_size,
                shuffle=False,
                num_workers=2,
                pin_memory=True,
                drop_last=False,
                collate_fn=nn_model._batch_collate_fn,
            )
    trainer_val_start = time.time()
    val_metrics = nn_model.trainer.validate(model=nn_model.model, dataloaders=val_dataloader)
    print(f"PL Trainer validate execution time: {time.time() - trainer_val_start:.1f}s")
    assert math.isclose(backtest_metrics, val_metrics[0]["val_loss"])
@solalatus
Copy link
Contributor

I tried to use different approaches, but got to the same conclusion: as of now, testing is too slow. My guess is: for every timestep, a new "session" is used, that gives the overhead. But that is just a guess on my side.

@dennisbader
Copy link
Collaborator

We are aware that there is potential of improvement for historical_forecasts, especially for our Regression and TorchForecastinModels.

This is high on our to do list.

  • regression models: build tabularization once at beginning (if possible, meaning non-autoregressive predictions and retrain=False)
  • TorchForecastingModels: build inference dataset once (if possible same as above)

These two points should drastically reduce processing time.

@chengyineng38
Copy link

chengyineng38 commented Mar 8, 2023

I'm also running into this issue but confusion around historical_forecasts, backtest, and predict.

My setup is to train NBeatsModel on 2017-2021 multivariate data with past covariates, but generate metrics on 2022 data (also with covariates). The data has weekly intervals.

It looks like historical_forecasts and backtest are very similar. The latter additionally computes metrics, but otherwise, they are the same -- is that the correct understanding? If I were to use these two methods, should I have passed the entire 2017-2022 data for training? I'm also confused about the start argument here. If I set my start to be 2022-01-01 and retrain=False and have trained the model, the model definitely won't be re-trained, right? With a forecast_horizon = 7 and stride=1, what is the date range of the input sequence each time to generate predictions?
I also saw start=some float,e.g. 0.6, what does 60% of a series mean? If the series starts in January, 60% is July?

predict looks like it's generating future predictions. Is this the correct method for me to use?

@madtoinou madtoinou added the question Further information is requested label Oct 30, 2023
@madtoinou
Copy link
Collaborator

Hi @eivindeb,

historical_forecasts() with retrain=False was optimized for RegressionModel (release 0.25.0) and torch-based models (upcoming release 0.27.0), it should solve part of the problem.

I am not sure to understand to compatibility issue problem that you are describing between Trainer.predict() and PLForecasting.predict_step(), can you please elaborate?

@chengyineng38 your understanding is correct : backtest() heavily rely on historical_forecasts(). The series passed to these methods ultimately depends on what you want to do:

  • assess the performance of a model trained on a historic (old) dataset on more recent data: concatenate the old dataset to the new one (if there is no gap between the time indexes) or use just the recent dataset (some dates will be missing as some values are necessary to generate the first forecast), pass it as series and use start='first date in the recent dataset' and retrain=False.
  • assess the performance of a model on a new dataset: pass the new dataset as series, leave start empty, set retrain=False.
  • assess how a model would have performed on a given dataset with regular retraining : use a single dataset, retrain=True and start='date where model is expected to be functional'.

retrain indicates if the model should be retrain before predicting forecast_horizon, it will use the whole series up to start for the first training round if set to True. If retrain=False, the model won't be retrained at all.
start is used to indicate when the first forecast horizon should begin, particularly useful if retrain=True and you want the model to be trained on a large portion of the series. start=0.6 just means that the 60% of the dataset will be used to train the model, ie the Timestep at series[int(len(series)*0.6)] will be used as start value.

predict() is used to generate a simple forecast horizon, based on the input series and covariates (without any additional logic).

@dennisbader
Copy link
Collaborator

Closing as historical forecasting/backtesting was optimized for torch models in darts version 0.27.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants