Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH added RollingWindowCV to sklearn.model_selection #24589

Closed
wants to merge 1 commit into from

Conversation

MSchmidt99
Copy link

@MSchmidt99 MSchmidt99 commented Oct 5, 2022

Reference Issues/PRs

Addresses #22523, similar to #23780.
Addresses #24243.
Addresses homogeneous variant of #6322.

What does this implement/fix? Explain your changes.

Implementation of rolling window cross validation with support for longitudinal time series data.

A variant of TimeSeriesSplit which yields equally sized rolling windows, which
allows for more consistent parameter tuning.
If a time column is passed then the windows will be sized according to the time
steps given without blending (this is useful for longitudinal data).
Parameters
----------
n_splits : int, default=5
    Number of splits.
time_column : Iterable, default=None
    Column of the dataset containing dates. Will function identically with `None`
    when observations are not longitudinal. If observations are longitudinal then
    will facilitate splitting train and validation without date bleeding.
train_prop : float, default=0.8
    Proportion of each window which should be allocated to training. If
    `buffer_prop` is given then true training proportion will be
    `train_prop - buffer_prop`.
    Validation proportion will always be `1 - train_prop`.
buffer_prop : float, default=0.0
    The proportion of each window which should be allocated to nothing. Cuts into
    `train_prop`.
slide : float, default=0.0
    `slide + 1` is the number of validation lenghts to step by when generating
    windows. A value between -1.0 and 0.0 will create nearly stationary windows,
    and should be avoided unless for some odd reason it is needed.
bias : {'left', 'right'}, default='right'
    A 'left' `bias` will yeld indicies beginning at 0 and not necessarily ending
    at N. A 'right' `bias` will yield indicies not necessarily beginning with 0 but
    will however end at N.
max_long_samples : int, default=None
    If the data is longitudinal and this variable is given, the number of
    observations at each time step will be limited to the first `max_long_samples`
    samples.
Examples
--------
>>> import numpy as np
>>> from sklearn.model_selection import RollingWindowCV
>>> X = np.random.randn(20, 2)
>>> y = np.random.randint(0, 2, 20)
>>> rwcv = RollingWindowCV(n_splits=5)
>>> for train_index, test_index in rwcv.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
TRAIN: [1 2 3 4 5 6 7 8 9] TEST: [10 11]
TRAIN: [ 3  4  5  6  7  8  9 10 11] TEST: [12 13]
TRAIN: [ 5  6  7  8  9 10 11 12 13] TEST: [14 15]
TRAIN: [ 7  8  9 10 11 12 13 14 15] TEST: [16 17]
TRAIN: [ 9 10 11 12 13 14 15 16 17] TEST: [18 19]
>>> # Use a time column with longitudinal data and reduce train proportion
>>> time_col = np.tile(np.arange(16), 2)
>>> X = np.arange(64).reshape(32, 2)
>>> y = np.arange(32)
>>> rwcv = RollingWindowCV(
...     time_column=time_col, train_prop=0.5, n_splits=5, bias='right'
... )
>>> for train_index, test_index in rwcv.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
TRAIN: [ 1 17  2 18  3 19  4 20  5 21] TEST: [ 6 22  7 23]
TRAIN: [ 3 19  4 20  5 21  6 22  7 23] TEST: [ 8 24  9 25]
TRAIN: [ 5 21  6 22  7 23  8 24  9 25] TEST: [10 26 11 27]
TRAIN: [ 7 23  8 24  9 25 10 26 11 27] TEST: [12 28 13 29]
TRAIN: [ 9 25 10 26 11 27 12 28 13 29] TEST: [14 30 15 31]
>>> # Bias the indicies to the start of the time column
>>> rwcv = RollingWindowCV(
...     time_column=time_col, train_prop=0.5, n_splits=5, bias='left'
... )
>>> for train_index, test_index in rwcv.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
TRAIN: [ 0 16  1 17  2 18  3 19  4 20] TEST: [ 5 21  6 22]
TRAIN: [ 2 18  3 19  4 20  5 21  6 22] TEST: [ 7 23  8 24]
TRAIN: [ 4 20  5 21  6 22  7 23  8 24] TEST: [ 9 25 10 26]
TRAIN: [ 6 22  7 23  8 24  9 25 10 26] TEST: [11 27 12 28]
TRAIN: [ 8 24  9 25 10 26 11 27 12 28] TEST: [13 29 14 30]
>>> # Introduce a buffer zone between train and validation, and slide window
>>> # by an additional validation size between windows.
>>> X = np.arange(25)
>>> Y = np.arange(25)[::-1]
>>> rwcv = RollingWindowCV(train_prop=0.6, n_splits=2, buffer_prop=0.2, slide=1.0)
>>> for train_index, test_index in rwcv.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
...
TRAIN: [2 3 4 5 6 7] TEST: [10 11 12 13 14]
TRAIN: [12 13 14 15 16 17] TEST: [20 21 22 23 24]

Any other comments?

Wrote this about 2 years ago and finally got around to making a PR.

@MSchmidt99
Copy link
Author

Visual example:
image

@MSchmidt99 MSchmidt99 changed the title ENH: added RollingWindowCV to sklearn.model_selection ENH added RollingWindowCV to sklearn.model_selection Oct 5, 2022
@MSchmidt99 MSchmidt99 marked this pull request as draft October 7, 2022 16:53
@MSchmidt99 MSchmidt99 marked this pull request as ready for review October 7, 2022 18:37
@MichaelKarpe
Copy link

Given @glemaitre's comment here, I am not sure more time series cross validators (TSCVs) will be added in the short-term in scikit-learn, however this remains a simpler TSCV than CPCV and it could be great to see it in scikit-learn already.

Maybe should RollingWindowCV be renamed RollingTimeSeriesSplit (and thus rwcv into rtscv in your PR) so that the name is more in line with scikit-learn terminology?

@MSchmidt99
Copy link
Author

MSchmidt99 commented Oct 7, 2022

If merging is decided against in favor of having fewer TSCVs as @glemaitre has stated at the link I would be a bit bummed, but I have also discussed the steps required for running the class as a standalone module here (below the image) for those who prefer the appeal of simply passing the number of splits and the training ratio of each window.

Though if we are in favor of merging after a few naming or other changes I would be happy to cooperate.

@MSchmidt99 MSchmidt99 changed the title ENH added RollingWindowCV to sklearn.model_selection [MRG] ENH added RollingWindowCV to sklearn.model_selection Oct 11, 2022
@glemaitre glemaitre added the Needs Decision Requires decision label Oct 17, 2022
@glemaitre glemaitre changed the title [MRG] ENH added RollingWindowCV to sklearn.model_selection ENH added RollingWindowCV to sklearn.model_selection Oct 17, 2022
@glemaitre
Copy link
Member

As stated in the original issue, I would prefer to have n_split="walk_forward" (#22523 (comment)) that would set n_split to get exactly the behaviour that you have with the proposed class.

I remembered to have started to review #23780. I did not get the chance to go back to it but from my guess, we only needed a couple of unit tests to ensure that we get the expected behaviour.

I will try to give a review and finish up this PR.

@glemaitre
Copy link
Member

closing in favor of #23780

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision Requires decision
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants