Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

Open
wants to merge 17 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@WenjieZ
Copy link

commented May 1, 2019

Time series have temporal dependence, which may cause information leaks during the cross-validation.
One way to mitigate this risk is by introducing gaps between the training set and the testing set.
This PR implements such a feature for leave-p-out, K-fold, and the naive train-test split.
As for the walk-forward one, @kykosic is implementing, among others, a similar feature for the class TimeSeriesSplit in #13204. I reckon his implementation promising, so I refrain from reinventing the wheel.

Concerning my implementation, I "refactored" the whole structure while still keeping the same public API. GapCrossValidator replaces BaseCrossValidator and becomes the base where GapLeavePOut and GapKFold derive from. Although not tested, all other subclasses, I believe, can derive from the new GapCrossValidator. I put the quotation marks on the word refactor because I didn't really touch the original code. Instead, my code currently coexists with the original one.

Classes and functions added:

  • GapCrossValidator
    • GapLeavePOut
    • GapKFold
  • gap_train_test_split

Related issues and PRs

#6322, #13204

Related users

@kykosic, @amueller, @jnothman, @cbrummitt

@WenjieZ WenjieZ changed the title Feature: Cross-validation for time series (inserting gaps between the training and the testing) [WIP] Cross-validation for time series (inserting gaps between the training and the testing) May 1, 2019

@cbrummitt

This comment has been minimized.

Copy link
Contributor

commented May 2, 2019

In your examples in the docstrings, why do the training sets sometimes have larger values than the testing sets? That would mean training a model on the future and predicting data from the past.

>>> for train_index, test_index in kf.split(np.arange(10)):
    ...    print("TRAIN:", train_index, "TEST:", test_index)
    TRAIN: [6 7 8 9] TEST: [0 1]
    TRAIN: [8 9] TEST: [2 3]
    TRAIN: [0] TEST: [4 5]
    TRAIN: [0 1 2] TEST: [6 7]
    TRAIN: [0 1 2 3 4] TEST: [8 9]

Notice how for TimeSeriesSplit all the training indices precede the test indices:

>>> for train_index, test_index in tscv.split(X):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
TRAIN: [0 1 2 3] TEST: [4]
TRAIN: [0 1 2 3 4] TEST: [5]
@WenjieZ

This comment has been minimized.

Copy link
Author

commented May 2, 2019

Yes, it is totally possible to train models on the future data and validate them on the past data.
It is theoretically valid for, say, stationary time series.

@WenjieZ

This comment has been minimized.

Copy link
Author

commented May 7, 2019

Help needed. The tests passed locally in my build but failed in some other builds. What could be the cause?

@jnothman

This comment has been minimized.

Copy link
Member

commented May 8, 2019

@WenjieZ

This comment has been minimized.

Copy link
Author

commented May 10, 2019

I finally found the cause, which is the different interpretations of

a[[False, True, True, False, True]]

, where a is a numpy ndarray.

Linux pylatest_conda interprets it as

a[[1, 2, 4]]

Linux py35_conda_openblas and Linux py35_np_atlas interpret it as

a[[0, 1, 1, 0, 1]]

According to the numpy manual, the first one is the correct interpretation, even for numpy v1.11.

@WenjieZ WenjieZ changed the title [WIP] Cross-validation for time series (inserting gaps between the training and the testing) [MRG] Cross-validation for time series (inserting gaps between the training set and the test set) May 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.