Where to learn more about reduction with sliding windows for time-series? #4489

ForceBru · 2023-04-17T14:07:31Z

ForceBru
Apr 17, 2023

As far as I understand, reductions are an important part of sktime, but this is my first time seeing this, so I'm not particularly sure why this approach works. I watched some sktime talks (like https://youtu.be/Wf2naBHRo8Q?t=1880) and read the sktime papers [1, 2]. I'm also familiar with the basic forecasting techniques like ARIMA-GARCH and exponential smoothing.

What these papers are saying is that "a common example of reduction is to solve classical forecasting through time series regression via a sliding window approach and iteration over the forecasting horizon" [2] and "we first split the training series into fixed-length windows and stack them on top of each other. This gives us a matrix of lagged values in a tabular format, and thus allows us to apply any tabular regression algorithm" [1]. When discussing reduction and sliding windows, both papers cite [3], which goes on to cite papers from the 90s and earlier.

Are there any more recent treatments of this "reduction + sliding windows" approach? I found another 2022 talk where the conclusion is: "forecasting can be treated as a tabular ML task [reduction?] and can compete with statistical models" (https://youtu.be/9QtL7m3YS9I?t=2044), but I don't think this talk really explained why such an approach works. In particular, since windows are constructed from dependent data, then, for example, windows [y1, y2, y3, y4] and [y2, y3, y4, y5] won't be independent either, right? The talk mentioned above says: "...there's ordering to the data, so rows are not independent". But don't observations have to be independent for machine learning to work?

Could someone please recommend some papers or maybe textbooks that explain what reduction is, why it works and how to use it in more detail? Sorry if such requests aren't allowed.

Löning, Markus, and Franz Király. 2020. “Forecasting with Sktime: Designing Sktime’s New Forecasting API and Applying It to Replicate and Extend the M4 Study.” arXiv. http://arxiv.org/abs/2005.08067.
Löning, Markus, Anthony Bagnall, Sajaysurya Ganesh, Viktor Kazakov, Jason Lines, and Franz Király. 2019. “Sktime: A Uniﬁed Interface for Machine Learning with Time Series.” In Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver, Canada. https://arxiv.org/abs/1909.07872
Bontempi, Gianluca, Souhaib Ben Taieb, and Yann-Aël Le Borgne. 2013. “Machine Learning Strategies for Time Series Forecasting.” In Business Intelligence, edited by Marie-Aude Aufaure and Esteban Zimányi, 138:62–77. Lecture Notes in Business Information Processing. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36318-4_3.

Answered by fkiraly

Apr 18, 2023

Could someone please recommend some papers or maybe textbooks that explain what reduction is

I think the reference for "reduction" in the context of forecasting and ML, specifically the sliding window tabluation strategies, is indeed the Bontempi paper.

In sktime, we use it slightly more generally, in that it is a composition where one estimator of a certain type is used to solve a problem of a different type, e.g., a supervised regressor is used for forecasting as part of an algorithm that can call the regressor interface.

For this more general concept, you can have a look at our 2019 and 2020 papers, or also "Designing Machine Learning Toolboxes: Concepts, Principles and Patterns" (20…

View full answer

fkiraly · 2023-04-18T18:45:09Z

fkiraly
Apr 18, 2023
Maintainer

Could someone please recommend some papers or maybe textbooks that explain what reduction is

I think the reference for "reduction" in the context of forecasting and ML, specifically the sliding window tabluation strategies, is indeed the Bontempi paper.

In sktime, we use it slightly more generally, in that it is a composition where one estimator of a certain type is used to solve a problem of a different type, e.g., a supervised regressor is used for forecasting as part of an algorithm that can call the regressor interface.

For this more general concept, you can have a look at our 2019 and 2020 papers, or also "Designing Machine Learning Toolboxes: Concepts, Principles and Patterns" (2021) https://arxiv.org/abs/2101.04938

Are there any more recent treatments of this "reduction + sliding windows" approach?

Perhaps the most prominent recent discussion is around the M4 competition, and the M5 competition where the ML approaches reappear in the accurary/probabilistic context.

A good starting point is "The M4 Competition: Results, findings, conclusion and way forward".

But don't observations have to be independent for machine learning to work?

Regarding the independence question: a common assumption for ML methods on tabular data, i.e., in the supervised learning space, are indeed i.i.d. data/label pairs.

However, this is neither sufficient for a method to work - imagine a bad algorithm - nor necessary (see below). It's just a common assumption in the analysis of algorithms.

Why it works for time series can be gleaned from an example where the data are stationary.

Assume a situation where we have a stationary time series $y_1, y_2, \dots, y_T$ where values become independent as soon as they are $k$ indices apart, but not earlier. This can be generated, say, by first taking i.i.d. data $z_1, z_2, \dots, z_{T+k-1},$ and then defining $y_i = f(z_i, \dots, z_{i +k-1})$ for any function $f$ of $k$ arguments.
In this example, any non-overlapping feature-label blocks of distance $k$ or more from each other will be i.i.d., and the usual assumption for ML methods will be satisfied. As $T$ increases, the number of such non-overlapping blocks you can find also increases linearly (with a fractional factor of $1/2k$)

If you now add in more overlapping blocks, the sliding window reduction method just gets "more information", and it's similar enough that it won't make things worse usually. More quantitative statements can be derived, along the lines of using an "effective sample size" (which is smaller due to auto-correlation than $T$).

Now, I cannot point to a paper that would carry out the full formal approach here to show that this is indeed the case, for abstract ML models, but it also shouldn't be too hard.

(note: a key assumption in this discussion was stationarity, or, more weakly, certain properties of the autocorrelation)

2 replies

ForceBru Apr 29, 2023
Author

Thanks, I'll look into the M competitions papers. Guess I'll need to brush up on the probabilistic requirements for ML methods. Sorry for the really late reply.

fkiraly May 2, 2023
Maintainer

no worries, glad we could help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where to learn more about reduction with sliding windows for time-series? #4489

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Where to learn more about reduction with sliding windows for time-series? #4489

ForceBru Apr 17, 2023

Replies: 1 comment · 2 replies

fkiraly Apr 18, 2023 Maintainer

ForceBru Apr 29, 2023 Author

fkiraly May 2, 2023 Maintainer

ForceBru
Apr 17, 2023

Replies: 1 comment 2 replies

fkiraly
Apr 18, 2023
Maintainer

ForceBru Apr 29, 2023
Author

fkiraly May 2, 2023
Maintainer