Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add option n_splits='walk_forward' in TimeSeriesSplit #23780

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

ShehanAT
Copy link
Contributor

@ShehanAT ShehanAT commented Jun 28, 2022

Reference Issues/PRs

Fixes #22523

What does this implement/fix? Explain your changes.

This pull request adds a requested feature in the Issue #22523: "walk_forward" for the TimeSeriesSplit class.

Based on what I understand about this feature, here are some important details:

  • Walking forward is defined by the first element of the first train set starting from 0 and being incremented by 1 in the subsequent train and test indices while adhering to the max train size and test size values. For example:

    # Walk Forward Example where max_train_size = 10, test_size = 2, x.shape[0] = 15 
    # n_splits is automatically computed to 3 in this case 
    TRAIN:  [0 1 2 3 4 5 6 7 8] TEST:  [ 9 10]
    TRAIN:  [ 1  2  3  4  5  6  7  8  9 10] TEST:  [11 12]
    TRAIN:  [ 3  4  5  6  7  8  9 10 11 12] TEST:  [13 14]
    
  • To make waking forward possible the n_splits value is set to walk_forward:
    cv = TimeSeriesSplit(n_splits="walk_forward", x_shape=x.shape[0], max_train_size=10 ,test_size=2).
    However, as this parameter was originally designed only to receive integer values I had to make some modifications to the constructor of the _BaseKFold() super class to make this possible. Also, the x.shape[0] value from x = np.arange(15) is needed to be passed into the constructor in order to automatically compute values for n_splits parameter

  • Once the n_splits value is set to walk_forward, the find_walk_forward_n_splits_value() method is called in order to calculate an appropriate integer value for n_splits so that walking forward is possible. In the case of multiple potential n_splits values, the first element in an array containing those values will be returned:

    if len(n_splits_arrays_first_element_zero) > 0:
                return n_splits_arrays_first_element_zero[0]
    

Any other comments?

I have regression tested manually with the original functionality and it seems to work fine.
There are two tests in the sklearn/model_selection/_split.py file that are failing(I'm looking into why now)

Let me know if this potential solution works, needs modifications or if you have any feedback.

n_splits = self.n_splits
n_folds = n_splits + 1
gap = self.gap
test_size = (
self.test_size if self.test_size is not None else n_samples // n_folds
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this change

max_train_size=None,
test_size=None,
):
if isinstance(n_splits, int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, I think that we can keep the base class as is and make the base class accept a str without raising a warning.

Then, we can just specialize the get_n_splits for the TimeSeriesSplit that should not return directly self.n_splits but instead make the computation of n_splits if a string is provided.

Therefore, we only specialize the TimeSeriesSplit class.

Copy link
Contributor Author

@ShehanAT ShehanAT Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've replaced the previous method with this get_n_splits() method now.

@@ -1066,13 +1103,13 @@ def split(self, X, y=None, groups=None):
"""
X, y, groups = indexable(X, y, groups)
n_samples = _num_samples(X)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this change

Copy link
Contributor Author

@ShehanAT ShehanAT Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -1066,13 +1103,13 @@ def split(self, X, y=None, groups=None):
"""
X, y, groups = indexable(X, y, groups)
n_samples = _num_samples(X)

n_splits = self.n_splits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is here that we can do:

n_splits = self.get_n_splits(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, added the line: n_splits = self.get_n_splits(X, y, groups) now

def __init__(self, n_splits=5, *, max_train_size=None, test_size=None, gap=0):
super().__init__(n_splits, shuffle=False, random_state=None)
def __init__(
self, n_splits=5, x_shape=None, *, max_train_size=None, test_size=None, gap=0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need to have x_shape at the initialization.

Having X at the split call should be enough to get this information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed x_shape now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShehanAT Consider updating the pull request description as well.

@@ -1101,6 +1138,105 @@ def split(self, X, y=None, groups=None):
indices[test_start : test_start + test_size],
)

def find_walk_forward_n_splits_value(self, x_value, max_train_size, test_size):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we replace this by get_n_splits, we will have the following signature:

def get_n_splits(self, X=None, y=None, groups=None)

Since we have X and self, we will have all the necessary information to compute the number of splits required to make the rolling windows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

TRAIN: [10 11 12] TEST: [13]
TRAIN: [11 12 13] TEST: [14]
"""
x = np.arange(x_value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we only need to have:

def get_n_splits(self, X=None, y=None, groups=None):
    if isinstance(self.n_splits, str):
        return X.shape[0] - (self.max_train_size + self.test_size) + 1
    return self.n_splits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is a lot simpler and readable than the previous method

@glemaitre
Copy link
Member

glemaitre commented Jun 28, 2022

I just wanted to confirm my intuition. Here is a quick diff that I could come with without making extensive tests:

diff --git a/sklearn/model_selection/_split.py b/sklearn/model_selection/_split.py
index d2a0b5e1fc..609fda7c63 100644
--- a/sklearn/model_selection/_split.py
+++ b/sklearn/model_selection/_split.py
@@ -275,22 +275,21 @@ class _BaseKFold(BaseCrossValidator, metaclass=ABCMeta):
 
     @abstractmethod
     def __init__(self, n_splits, *, shuffle, random_state):
-        if not isinstance(n_splits, numbers.Integral):
-            raise ValueError(
-                "The number of folds must be of Integral type. "
-                "%s of type %s was passed." % (n_splits, type(n_splits))
-            )
-        n_splits = int(n_splits)
-
-        if n_splits <= 1:
+        if isinstance(n_splits, numbers.Integral):
+            if n_splits <= 1:
+                raise ValueError(
+                    "k-fold cross-validation requires at least one"
+                    " train/test split by setting n_splits=2 or more,"
+                    f" got n_splits={n_splits}."
+                )
+        elif n_splits != "walk_forward":
             raise ValueError(
-                "k-fold cross-validation requires at least one"
-                " train/test split by setting n_splits=2 or more,"
-                " got n_splits={0}.".format(n_splits)
+                "n_splits should be an integer number or 'walk_forward' for "
+                "the TimeSeriesSplit cross-validator."
             )
 
         if not isinstance(shuffle, bool):
-            raise TypeError("shuffle must be True or False; got {0}".format(shuffle))
+            raise TypeError(f"shuffle must be True or False; got {shuffle}")
 
         if not shuffle and random_state is not None:  # None is the default
             raise ValueError(
@@ -1037,6 +1036,13 @@ class TimeSeriesSplit(_BaseKFold):
 
     def __init__(self, n_splits=5, *, max_train_size=None, test_size=None, gap=0):
         super().__init__(n_splits, shuffle=False, random_state=None)
+        if self.n_splits == "walk_forward" and (
+            max_train_size is None or test_size is None
+        ):
+            raise ValueError(
+                "If n_splits is 'walk_forward', then max_train_size and test_size must"
+                " be specified."
+            )
         self.max_train_size = max_train_size
         self.test_size = test_size
         self.gap = gap
@@ -1066,7 +1072,7 @@ class TimeSeriesSplit(_BaseKFold):
         """
         X, y, groups = indexable(X, y, groups)
         n_samples = _num_samples(X)
-        n_splits = self.n_splits
+        n_splits = self.get_n_splits(X, y, groups)
         n_folds = n_splits + 1
         gap = self.gap
         test_size = (
@@ -1101,6 +1107,29 @@ class TimeSeriesSplit(_BaseKFold):
                     indices[test_start : test_start + test_size],
                 )
 
+    def get_n_splits(self, X=None, y=None, groups=None):
+        """Returns the number of splitting iterations in the cross-validator
+
+        Parameters
+        ----------
+        X : object
+            Always ignored, exists for compatibility.
+
+        y : object
+            Always ignored, exists for compatibility.
+
+        groups : object
+            Always ignored, exists for compatibility.
+
+        Returns
+        -------
+        n_splits : int
+            Returns the number of splitting iterations in the cross-validator.
+        """
+        if self.n_splits == "walk_forward":
+            return X.shape[0] - (self.max_train_size + self.test_size) + 1
+        return self.n_splits
+
 
 class LeaveOneGroupOut(BaseCrossValidator):
     """Leave One Group Out cross-validator

@ShehanAT
Copy link
Contributor Author

ShehanAT commented Jun 28, 2022

Thanks for looking into this.
I've made the changes you've requested now.
Let me know if I need to make further modifications.

@glemaitre
Copy link
Member

So now, we need to add unit tests to check the different changes and new behaviour.
We as well need an entry in the changelog.
We also need to update the documentation in docstring and I think that it should be great to update an example and the user guide.

@ShehanAT
Copy link
Contributor Author

I've added an entry to the doc/whats_new/v1.2.rst file and added two unit tests.

Here is the my updated docstring for the TimeSeriesSplit class:

>>> # Add in a 2 period gap
    >>> tscv = TimeSeriesSplit(n_splits=3, test_size=2, gap=2)
    >>> for train_index, test_index in tscv.split(X):
    ...    print("TRAIN:", train_index, "TEST:", test_index)
    ...    X_train, X_test = X[train_index], X[test_index]
    ...    y_train, y_test = y[train_index], y[test_index]
    TRAIN: [0 1 2 3] TEST: [6 7]
    TRAIN: [0 1 2 3 4 5] TEST: [8 9]
    TRAIN: [0 1 2 3 4 5 6 7] TEST: [10 11]
    >>> # Showing rolling window support with via `n_splits='walk_forward'`
    >>> x = np.arange(15)
    >>> cv = TimeSeriesSplit(n_splits='walk_forward', max_train_size=10, test_size=3)
    >>> for train_index, test_index in cv.split(x):
    ...     print("TRAIN: ", train_index, "TEST: ", test_index)
    TRAIN:  [0 1 2 3 4 5] TEST:  [6 7 8]
    TRAIN:  [0 1 2 3 4 5 6 7 8] TEST:  [ 9 10 11]
    TRAIN:  [ 2  3  4  5  6  7  8  9 10 11] TEST:  [12 13 14]

    Notes
    -----
    -   The training set has size ``i * n_samples // (n_splits + 1)
        + n_samples % (n_splits + 1)`` in the ``i`` th split,
        with a test set of size ``n_samples//(n_splits + 1)`` by default,
        where ``n_samples`` is the number of samples.
    -   To use the rolling window support where the train set does not grow
        and the `n_splits` value is automatically computed: 
        Set `n_splits='walk_forward'`.
    """

Let me know if any changes are required.

Returns the number of splitting iterations in the cross-validator.
"""
if self.n_splits == "walk_forward":
return X.shape[0] - (self.max_train_size + self.test_size) + 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something here?

If n_splits="walk_forward", X.shape[0]=20, self.max_train_size=10, and self.test_size=5 we get a calculated n_splits of 20 - (10 + 5) + 1 = 6. Thus across all splits we have a total test_size of 5 * 6 = 30 which is greater than our number of samples, and as such we will hit a ValueError on line 1099 which is non-descript for the n_splits="walk_forward" use case.

Is it not the expectation that these inputs would yield indices similar to the following?

[0 1 2 3 4] [5 6 7 8 9]
[0 1 2 3 4 5  6  7  8  9 ] [10 11 12 13 14]
[5 6 7 8 9 10 11 12 13 14] [15 16 17 18 19]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we should return:

return (X.shape[0] - self.max_train_size) // self.test_size

Copy link

@MSchmidt99 MSchmidt99 Nov 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new equation only works for the case when the minimum train size is the maximum train size. In the docstring for tscv = TimeSeriesSplit(n_splits='walk_forward', max_train_size=10, test_size=3) it is shown that this is not always the case.

For the case in the docstring we calculate (15 - 10) // 3 = 1, though the examples given show that we would expect 4. See transcription below (this is why docstring tests are failing currently):

    Fold 0:
      Train: index=[0 1 2]
      Test:  index=[3 4 5]
    Fold 1:
      Train: index=[0 1 2 3 4 5]
      Test:  index=[6 7 8]
    Fold 2:
      Train: index=[0 1 2 3 4 5 6 7 8]
      Test:  index=[ 9 10 11]
    Fold 3:
      Train: index=[ 2  3  4  5  6  7  8  9 10 11]
      Test:  index=[12 13 14]

The calculation should instead be (X.shape[0] - self.min_train_size - self.gap) // self.test_size, where perhaps the default for self.min_train_size should be self.test_size for docstring accuracy. We could also allow for setting self.min_train_size on initialization (to allow for similar behavior to #24589), which would allow users to force each training batch to be of equal size (when min_train_size==max_train_size) for consistent CV scores on parameter hypotheses.

I created a PR with fixes to merge into this branch of ShehanAT's fork at #1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the behaviour that we would expect with walk_forward is to get train sets of constant size. So here, max_train_size is just too large to have more than one fold, and I think this is completely fine.

@glemaitre glemaitre changed the title ENH: Added "walk forward" feature to sklearn.model_selection.TimeSeriesSplit ENH Add option n_splits='walk_forward' in TimeSeriesSplit Nov 3, 2022
@glemaitre glemaitre added this to the 1.2 milestone Nov 3, 2022
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed fixes and improve the documentation directly.
This seems good on my side.

@thomasjpfan do you mind having a look at this?

@glemaitre glemaitre removed their assignment Nov 10, 2022
@jeremiedbb
Copy link
Member

We won't have time to finish the review on this one before the 1.2 release. Moving it to 1.3

@jeremiedbb jeremiedbb modified the milestones: 1.2, 1.3 Nov 24, 2022
@msat59
Copy link

msat59 commented May 25, 2023

I think we shouldn't define n_splits=walk_forward. It can be a new argument, window_method. I provided the code here. It only needs two if and how to get the train_start index.

@thomasjpfan
Copy link
Member

@msat59 window_method="rolling" in combination with n_splits makes it harder to reasonable about. For example, with your implementation of TimeSeriesSplit overlaps and is restricted by the n_splits=5 default:

import numpy as np
splitter = TimeSeriesSplit(max_train_size=10, test_size=3, window_method="rolling")
X = np.arange(40)
list(splitter.split(X))

# [(array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24]), array([25, 26, 27])),
# (array([18, 19, 20, 21, 22, 23, 24, 25, 26, 27]), array([28, 29, 30])),
# (array([21, 22, 23, 24, 25, 26, 27, 28, 29, 30]), array([31, 32, 33])),
# (array([24, 25, 26, 27, 28, 29, 30, 31, 32, 33]), array([34, 35, 36])),
# (array([27, 28, 29, 30, 31, 32, 33, 34, 35, 36]), array([37, 38, 39]))]

To include all the data in the split, one needs to manually figure out what n_splits needs to be.

With this PR's walk_foward, the whole dataset is used by default:

import numpy as np
splitter = TimeSeriesSplit(max_train_size=10, test_size=3, n_splits="walk_forward")
X = np.arange(40)
list(splitter.split(X))

# [(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([10, 11, 12])),
# (array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12]), array([13, 14, 15])),
# (array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15]), array([16, 17, 18])),
# (array([ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]), array([19, 20, 21])),
# (array([12, 13, 14, 15, 16, 17, 18, 19, 20, 21]), array([22, 23, 24])),
# (array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24]), array([25, 26, 27])),
# (array([18, 19, 20, 21, 22, 23, 24, 25, 26, 27]), array([28, 29, 30])),
# (array([21, 22, 23, 24, 25, 26, 27, 28, 29, 30]), array([31, 32, 33])),
# (array([24, 25, 26, 27, 28, 29, 30, 31, 32, 33]), array([34, 35, 36])),
# (array([27, 28, 29, 30, 31, 32, 33, 34, 35, 36]), array([37, 38, 39]))]

@glemaitre I'll take a look at this PR this week.

@msat59
Copy link

msat59 commented May 31, 2023

window_method="rolling" in combination with n_splits makes it harder to reasonable about.

You are right, but it doesn't make sense you set both n_split and max_train_size and expect to have all data in your splits. It may reasonable only if you want to use the most recent data in your cross-validation process, however, it doesn't still make sense to use both. Most of the time, you set n_split/n-folds and you ignore max_train_size so you don't care about the number of samples in each split. Your focus is on the number of folds (model training) mainly to increase the cross-validation speed.

Does your implementation work with n_split=5 only, without specifying max_train_size?

I tested a quick fix and it seems we can calculate n_split and override its given value if max_train_size is non-zero. What is the output of your code if the array length is 38 for instance?

X = np.arange(38)

@glemaitre
Copy link
Member

@thomasjpfan Do you think that we can merge this one for 1.3?

@thomasjpfan thomasjpfan modified the milestones: 1.3, 1.4 Jun 15, 2023
@thomasjpfan
Copy link
Member

@glemaitre I do not think so right now. I want to properly evaluate #22523 (comment).

I moved this PR to 1.4

@diegolovison
Copy link

diegolovison commented Aug 22, 2023

I did the following test:

model = TimeSeriesSplit(max_train_size=11, test_size=3, n_splits="walk_forward")
X: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35]
TRAIN: [ 1  2  3  4  5  6  7  8  9 10 11] TEST: [12 13 14]
TRAIN: [ 4  5  6  7  8  9 10 11 12 13 14] TEST: [15 16 17]
TRAIN: [ 7  8  9 10 11 12 13 14 15 16 17] TEST: [18 19 20]
TRAIN: [10 11 12 13 14 15 16 17 18 19 20] TEST: [21 22 23]
TRAIN: [13 14 15 16 17 18 19 20 21 22 23] TEST: [24 25 26]
TRAIN: [16 17 18 19 20 21 22 23 24 25 26] TEST: [27 28 29]
TRAIN: [19 20 21 22 23 24 25 26 27 28 29] TEST: [30 31 32]
TRAIN: [22 23 24 25 26 27 28 29 30 31 32] TEST: [33 34 35]

I am not expecting that.

I would like to have:

0..10 11,12,13
1..11 12,13,14
2..12 13,14,15
....

The above seems like a rolling feature.
What do you think?

@msat59
Copy link

msat59 commented Aug 24, 2023

The above seems like a rolling feature.

@diegolovison, you are right. If you specify the max_train_size it will be the rolling window method. Setting max_train_size doesn't make sense in using the expanding window method.

In my opinion, using max_train_size is pointless in the expanding window method, as we want the window to expand. In addition, we want to limit the number of splits by n_splits to due the computational power restrictions. As a result, I still believe
using n_splits="walk_forward" is wrong. The function must be used like this

TimeSeriesSplit(n_splits=5, test_size=3, window_method="expanding")
TimeSeriesSplit(n_splits=5, test_size=3, window_method="rolling")

@thomasjpfan, I had modified my code after you reported its bug here. So if you use max_train_size, the function ignores the given/default n_splits and the output would be:

splitter = TimeSeriesSplit(max_train_size=10, test_size=3, window_method="rolling")
X = np.arange(40)
list(splitter.split(X))

[(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9]), array([10, 11, 12])),
 (array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12]), array([13, 14, 15])),
 (array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15]), array([16, 17, 18])),
 (array([ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]), array([19, 20, 21])),
 (array([12, 13, 14, 15, 16, 17, 18, 19, 20, 21]), array([22, 23, 24])),
 (array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24]), array([25, 26, 27])),
 (array([18, 19, 20, 21, 22, 23, 24, 25, 26, 27]), array([28, 29, 30])),
 (array([21, 22, 23, 24, 25, 26, 27, 28, 29, 30]), array([31, 32, 33])),
 (array([24, 25, 26, 27, 28, 29, 30, 31, 32, 33]), array([34, 35, 36])),
 (array([27, 28, 29, 30, 31, 32, 33, 34, 35, 36]), array([37, 38, 39]))]

@glemaitre
Copy link
Member

Moving the milestone to 1.5 since we did not get time to look at this one.
@thomasjpfan if you have spare time to have a look at this one, I think that we can target next release.

@glemaitre glemaitre modified the milestones: 1.4, 1.5 Dec 7, 2023
Copy link

github-actions bot commented Dec 7, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 8efb3aa. Link to the linter CI: here

@jeremiedbb jeremiedbb modified the milestones: 1.5, 1.6 May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add rolling window to sklearn.model_selection.TimeSeriesSplit
8 participants