ENH added RollingWindowCV to sklearn.model_selection #24589

MSchmidt99 · 2022-10-05T15:31:00Z

Reference Issues/PRs

Addresses #22523, similar to #23780.
Addresses #24243.
Addresses homogeneous variant of #6322.

What does this implement/fix? Explain your changes.

Implementation of rolling window cross validation with support for longitudinal time series data.

A variant of TimeSeriesSplit which yields equally sized rolling windows, which
allows for more consistent parameter tuning.
If a time column is passed then the windows will be sized according to the time
steps given without blending (this is useful for longitudinal data).
Parameters
----------
n_splits : int, default=5
    Number of splits.
time_column : Iterable, default=None
    Column of the dataset containing dates. Will function identically with `None`
    when observations are not longitudinal. If observations are longitudinal then
    will facilitate splitting train and validation without date bleeding.
train_prop : float, default=0.8
    Proportion of each window which should be allocated to training. If
    `buffer_prop` is given then true training proportion will be
    `train_prop - buffer_prop`.
    Validation proportion will always be `1 - train_prop`.
buffer_prop : float, default=0.0
    The proportion of each window which should be allocated to nothing. Cuts into
    `train_prop`.
slide : float, default=0.0
    `slide + 1` is the number of validation lenghts to step by when generating
    windows. A value between -1.0 and 0.0 will create nearly stationary windows,
    and should be avoided unless for some odd reason it is needed.
bias : {'left', 'right'}, default='right'
    A 'left' `bias` will yeld indicies beginning at 0 and not necessarily ending
    at N. A 'right' `bias` will yield indicies not necessarily beginning with 0 but
    will however end at N.
max_long_samples : int, default=None
    If the data is longitudinal and this variable is given, the number of
    observations at each time step will be limited to the first `max_long_samples`
    samples.
Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import RollingWindowCV
>>> X = np.random.randn(20, 2)
>>> y = np.random.randint(0, 2, 20)
>>> rwcv = RollingWindowCV(n_splits=5)
>>> for train_index, test_index in rwcv.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
TRAIN: [1 2 3 4 5 6 7 8 9] TEST: [10 11]
TRAIN: [ 3  4  5  6  7  8  9 10 11] TEST: [12 13]
TRAIN: [ 5  6  7  8  9 10 11 12 13] TEST: [14 15]
TRAIN: [ 7  8  9 10 11 12 13 14 15] TEST: [16 17]
TRAIN: [ 9 10 11 12 13 14 15 16 17] TEST: [18 19]
>>> # Use a time column with longitudinal data and reduce train proportion
>>> time_col = np.tile(np.arange(16), 2)
>>> X = np.arange(64).reshape(32, 2)
>>> y = np.arange(32)
>>> rwcv = RollingWindowCV(
...     time_column=time_col, train_prop=0.5, n_splits=5, bias='right'
... )
>>> for train_index, test_index in rwcv.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
TRAIN: [ 1 17  2 18  3 19  4 20  5 21] TEST: [ 6 22  7 23]
TRAIN: [ 3 19  4 20  5 21  6 22  7 23] TEST: [ 8 24  9 25]
TRAIN: [ 5 21  6 22  7 23  8 24  9 25] TEST: [10 26 11 27]
TRAIN: [ 7 23  8 24  9 25 10 26 11 27] TEST: [12 28 13 29]
TRAIN: [ 9 25 10 26 11 27 12 28 13 29] TEST: [14 30 15 31]
>>> # Bias the indicies to the start of the time column
>>> rwcv = RollingWindowCV(
...     time_column=time_col, train_prop=0.5, n_splits=5, bias='left'
... )
>>> for train_index, test_index in rwcv.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
TRAIN: [ 0 16  1 17  2 18  3 19  4 20] TEST: [ 5 21  6 22]
TRAIN: [ 2 18  3 19  4 20  5 21  6 22] TEST: [ 7 23  8 24]
TRAIN: [ 4 20  5 21  6 22  7 23  8 24] TEST: [ 9 25 10 26]
TRAIN: [ 6 22  7 23  8 24  9 25 10 26] TEST: [11 27 12 28]
TRAIN: [ 8 24  9 25 10 26 11 27 12 28] TEST: [13 29 14 30]
>>> # Introduce a buffer zone between train and validation, and slide window
>>> # by an additional validation size between windows.
>>> X = np.arange(25)
>>> Y = np.arange(25)[::-1]
>>> rwcv = RollingWindowCV(train_prop=0.6, n_splits=2, buffer_prop=0.2, slide=1.0)
>>> for train_index, test_index in rwcv.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
...
TRAIN: [2 3 4 5 6 7] TEST: [10 11 12 13 14]
TRAIN: [12 13 14 15 16 17] TEST: [20 21 22 23 24]

Any other comments?

Wrote this about 2 years ago and finally got around to making a PR.

MSchmidt99 · 2022-10-05T15:40:47Z

Visual example:

MichaelKarpe · 2022-10-07T21:42:18Z

Given @glemaitre's comment here, I am not sure more time series cross validators (TSCVs) will be added in the short-term in scikit-learn, however this remains a simpler TSCV than CPCV and it could be great to see it in scikit-learn already.

Maybe should RollingWindowCV be renamed RollingTimeSeriesSplit (and thus rwcv into rtscv in your PR) so that the name is more in line with scikit-learn terminology?

MSchmidt99 · 2022-10-07T23:24:39Z

If merging is decided against in favor of having fewer TSCVs as @glemaitre has stated at the link I would be a bit bummed, but I have also discussed the steps required for running the class as a standalone module here (below the image) for those who prefer the appeal of simply passing the number of splits and the training ratio of each window.

Though if we are in favor of merging after a few naming or other changes I would be happy to cooperate.

glemaitre · 2022-10-17T12:38:24Z

As stated in the original issue, I would prefer to have n_split="walk_forward" (#22523 (comment)) that would set n_split to get exactly the behaviour that you have with the proposed class.

I remembered to have started to review #23780. I did not get the chance to go back to it but from my guess, we only needed a couple of unit tests to ensure that we get the expected behaviour.

I will try to give a review and finish up this PR.

glemaitre · 2022-11-03T23:42:40Z

closing in favor of #23780

MSchmidt99 force-pushed the main branch 3 times, most recently from fb7e8e8 to 00661e4 Compare October 5, 2022 17:06

MSchmidt99 changed the title ~~ENH: added RollingWindowCV to sklearn.model_selection~~ ENH added RollingWindowCV to sklearn.model_selection Oct 5, 2022

MSchmidt99 force-pushed the main branch 4 times, most recently from e0d7ebf to 149da77 Compare October 6, 2022 18:25

MSchmidt99 mentioned this pull request Oct 6, 2022

Add rolling window to sklearn.model_selection.TimeSeriesSplit #22523

Open

MSchmidt99 force-pushed the main branch 7 times, most recently from 3aee974 to 42486fb Compare October 7, 2022 16:53

MSchmidt99 marked this pull request as draft October 7, 2022 16:53

MSchmidt99 force-pushed the main branch 2 times, most recently from a960526 to cf6047e Compare October 7, 2022 17:52

MSchmidt99 marked this pull request as ready for review October 7, 2022 18:37

MSchmidt99 force-pushed the main branch from cf6047e to 1270c25 Compare October 7, 2022 18:37

ENH Added RollingWindowCV

22ef989

MSchmidt99 force-pushed the main branch from 1270c25 to 22ef989 Compare October 11, 2022 20:45

MSchmidt99 changed the title ~~ENH added RollingWindowCV to sklearn.model_selection~~ [MRG] ENH added RollingWindowCV to sklearn.model_selection Oct 11, 2022

glemaitre added the Needs Decision Requires decision label Oct 17, 2022

glemaitre changed the title ~~[MRG] ENH added RollingWindowCV to sklearn.model_selection~~ ENH added RollingWindowCV to sklearn.model_selection Oct 17, 2022

glemaitre closed this Nov 3, 2022

MSchmidt99 mentioned this pull request Nov 5, 2022

ENH Add option n_splits='walk_forward' in TimeSeriesSplit #23780

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH added RollingWindowCV to sklearn.model_selection #24589

ENH added RollingWindowCV to sklearn.model_selection #24589

MSchmidt99 commented Oct 5, 2022 •

edited

MSchmidt99 commented Oct 5, 2022

MichaelKarpe commented Oct 7, 2022

MSchmidt99 commented Oct 7, 2022 •

edited

glemaitre commented Oct 17, 2022

glemaitre commented Nov 3, 2022

ENH added RollingWindowCV to sklearn.model_selection #24589

ENH added RollingWindowCV to sklearn.model_selection #24589

Conversation

MSchmidt99 commented Oct 5, 2022 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

MSchmidt99 commented Oct 5, 2022

MichaelKarpe commented Oct 7, 2022

MSchmidt99 commented Oct 7, 2022 • edited

glemaitre commented Oct 17, 2022

glemaitre commented Nov 3, 2022

MSchmidt99 commented Oct 5, 2022 •

edited

MSchmidt99 commented Oct 7, 2022 •

edited