Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

@WenjieZ
Copy link

@WenjieZ WenjieZ commented May 1, 2019

Time series have temporal dependence, which may cause information leaks during the cross-validation.
One way to mitigate this risk is by introducing gaps between the training set and the testing set.
This PR implements such a feature for leave-p-out, K-fold, and the naive train-test split.
As for the walk-forward one, @kykosic is implementing, among others, a similar feature for the class TimeSeriesSplit in #13204. I reckon his implementation promising, so I refrain from reinventing the wheel.

Concerning my implementation, I "refactored" the whole structure while still keeping the same public API. GapCrossValidator replaces BaseCrossValidator and becomes the base where GapLeavePOut and GapKFold derive from. Although not tested, all other subclasses, I believe, can derive from the new GapCrossValidator. I put the quotation marks on the word refactor because I didn't really touch the original code. Instead, my code currently coexists with the original one.

Classes and functions added:

  • GapCrossValidator
    • GapLeavePOut
    • GapKFold
  • gap_train_test_split

Related issues and PRs

#6322, #13204

Related users

@kykosic, @amueller, @jnothman, @cbrummitt

@WenjieZ WenjieZ changed the title Feature: Cross-validation for time series (inserting gaps between the training and the testing) [WIP] Cross-validation for time series (inserting gaps between the training and the testing) May 1, 2019
@cbrummitt
Copy link
Contributor

@cbrummitt cbrummitt commented May 2, 2019

In your examples in the docstrings, why do the training sets sometimes have larger values than the testing sets? That would mean training a model on the future and predicting data from the past.

>>> for train_index, test_index in kf.split(np.arange(10)):
    ...    print("TRAIN:", train_index, "TEST:", test_index)
    TRAIN: [6 7 8 9] TEST: [0 1]
    TRAIN: [8 9] TEST: [2 3]
    TRAIN: [0] TEST: [4 5]
    TRAIN: [0 1 2] TEST: [6 7]
    TRAIN: [0 1 2 3 4] TEST: [8 9]

Notice how for TimeSeriesSplit all the training indices precede the test indices:

>>> for train_index, test_index in tscv.split(X):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
TRAIN: [0 1 2 3] TEST: [4]
TRAIN: [0 1 2 3 4] TEST: [5]

@WenjieZ
Copy link
Author

@WenjieZ WenjieZ commented May 2, 2019

Yes, it is totally possible to train models on the future data and validate them on the past data.
It is theoretically valid for, say, stationary time series.

@WenjieZ
Copy link
Author

@WenjieZ WenjieZ commented May 7, 2019

Help needed. The tests passed locally in my build but failed in some other builds. What could be the cause?

@jnothman
Copy link
Member

@jnothman jnothman commented May 8, 2019

@WenjieZ
Copy link
Author

@WenjieZ WenjieZ commented May 10, 2019

I finally found the cause, which is the different interpretations of

a[[False, True, True, False, True]]

, where a is a numpy ndarray.

Linux pylatest_conda interprets it as

a[[1, 2, 4]]

Linux py35_conda_openblas and Linux py35_np_atlas interpret it as

a[[0, 1, 1, 0, 1]]

According to the numpy manual, the first one is the correct interpretation, even for numpy v1.11.

@WenjieZ WenjieZ changed the title [WIP] Cross-validation for time series (inserting gaps between the training and the testing) [MRG] Cross-validation for time series (inserting gaps between the training set and the test set) May 10, 2019
@jnothman
Copy link
Member

@jnothman jnothman commented May 29, 2019

  • What benefit does gap_after provide?
  • Why can't you implement GapKFold as a small change to TimeSeriesSplit?

@WenjieZ
Copy link
Author

@WenjieZ WenjieZ commented May 29, 2019

gap_before provides a gap before the test set, and gap_after provides a gap after the test set. The subset after this last gap is a part of the training set.

ooooooooooooooo|||||||||||||xxxxxxxxxxxxx|||||||||||||||||||||||oooooooooooooooooooooo
----training set---------gap-------test set------------gap-----------------training set

See here for more explanation.

@amueller
Copy link
Member

@amueller amueller commented May 29, 2019

btw people have told me we should just implement https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series

though I don't think it has gaps?

@jnothman
Copy link
Member

@jnothman jnothman commented May 29, 2019

@WenjieZ
Copy link
Author

@WenjieZ WenjieZ commented May 30, 2019

though I don't think it has gaps?

No, it doesn't. Another R package named blockCV provides this functionality. It is about space series, but it also applies to time series (considering the time series as 1-D space series).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants