New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add option n_splits='walk_forward'
in TimeSeriesSplit
#23780
base: main
Are you sure you want to change the base?
ENH Add option n_splits='walk_forward'
in TimeSeriesSplit
#23780
Conversation
sklearn/model_selection/_split.py
Outdated
n_splits = self.n_splits | ||
n_folds = n_splits + 1 | ||
gap = self.gap | ||
test_size = ( | ||
self.test_size if self.test_size is not None else n_samples // n_folds | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for this change
sklearn/model_selection/_split.py
Outdated
max_train_size=None, | ||
test_size=None, | ||
): | ||
if isinstance(n_splits, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, I think that we can keep the base class as is and make the base class accept a str
without raising a warning.
Then, we can just specialize the get_n_splits
for the TimeSeriesSplit
that should not return directly self.n_splits
but instead make the computation of n_splits
if a string is provided.
Therefore, we only specialize the TimeSeriesSplit
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've replaced the previous method with this get_n_splits()
method now.
sklearn/model_selection/_split.py
Outdated
@@ -1066,13 +1103,13 @@ def split(self, X, y=None, groups=None): | |||
""" | |||
X, y, groups = indexable(X, y, groups) | |||
n_samples = _num_samples(X) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
sklearn/model_selection/_split.py
Outdated
@@ -1066,13 +1103,13 @@ def split(self, X, y=None, groups=None): | |||
""" | |||
X, y, groups = indexable(X, y, groups) | |||
n_samples = _num_samples(X) | |||
|
|||
n_splits = self.n_splits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is here that we can do:
n_splits = self.get_n_splits(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, added the line: n_splits = self.get_n_splits(X, y, groups)
now
sklearn/model_selection/_split.py
Outdated
def __init__(self, n_splits=5, *, max_train_size=None, test_size=None, gap=0): | ||
super().__init__(n_splits, shuffle=False, random_state=None) | ||
def __init__( | ||
self, n_splits=5, x_shape=None, *, max_train_size=None, test_size=None, gap=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not need to have x_shape
at the initialization.
Having X
at the split
call should be enough to get this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed x_shape
now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShehanAT Consider updating the pull request description as well.
sklearn/model_selection/_split.py
Outdated
@@ -1101,6 +1138,105 @@ def split(self, X, y=None, groups=None): | |||
indices[test_start : test_start + test_size], | |||
) | |||
|
|||
def find_walk_forward_n_splits_value(self, x_value, max_train_size, test_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we replace this by get_n_splits
, we will have the following signature:
def get_n_splits(self, X=None, y=None, groups=None)
Since we have X
and self
, we will have all the necessary information to compute the number of splits required to make the rolling windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
sklearn/model_selection/_split.py
Outdated
TRAIN: [10 11 12] TEST: [13] | ||
TRAIN: [11 12 13] TEST: [14] | ||
""" | ||
x = np.arange(x_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we only need to have:
def get_n_splits(self, X=None, y=None, groups=None):
if isinstance(self.n_splits, str):
return X.shape[0] - (self.max_train_size + self.test_size) + 1
return self.n_splits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is a lot simpler and readable than the previous method
I just wanted to confirm my intuition. Here is a quick diff that I could come with without making extensive tests: diff --git a/sklearn/model_selection/_split.py b/sklearn/model_selection/_split.py
index d2a0b5e1fc..609fda7c63 100644
--- a/sklearn/model_selection/_split.py
+++ b/sklearn/model_selection/_split.py
@@ -275,22 +275,21 @@ class _BaseKFold(BaseCrossValidator, metaclass=ABCMeta):
@abstractmethod
def __init__(self, n_splits, *, shuffle, random_state):
- if not isinstance(n_splits, numbers.Integral):
- raise ValueError(
- "The number of folds must be of Integral type. "
- "%s of type %s was passed." % (n_splits, type(n_splits))
- )
- n_splits = int(n_splits)
-
- if n_splits <= 1:
+ if isinstance(n_splits, numbers.Integral):
+ if n_splits <= 1:
+ raise ValueError(
+ "k-fold cross-validation requires at least one"
+ " train/test split by setting n_splits=2 or more,"
+ f" got n_splits={n_splits}."
+ )
+ elif n_splits != "walk_forward":
raise ValueError(
- "k-fold cross-validation requires at least one"
- " train/test split by setting n_splits=2 or more,"
- " got n_splits={0}.".format(n_splits)
+ "n_splits should be an integer number or 'walk_forward' for "
+ "the TimeSeriesSplit cross-validator."
)
if not isinstance(shuffle, bool):
- raise TypeError("shuffle must be True or False; got {0}".format(shuffle))
+ raise TypeError(f"shuffle must be True or False; got {shuffle}")
if not shuffle and random_state is not None: # None is the default
raise ValueError(
@@ -1037,6 +1036,13 @@ class TimeSeriesSplit(_BaseKFold):
def __init__(self, n_splits=5, *, max_train_size=None, test_size=None, gap=0):
super().__init__(n_splits, shuffle=False, random_state=None)
+ if self.n_splits == "walk_forward" and (
+ max_train_size is None or test_size is None
+ ):
+ raise ValueError(
+ "If n_splits is 'walk_forward', then max_train_size and test_size must"
+ " be specified."
+ )
self.max_train_size = max_train_size
self.test_size = test_size
self.gap = gap
@@ -1066,7 +1072,7 @@ class TimeSeriesSplit(_BaseKFold):
"""
X, y, groups = indexable(X, y, groups)
n_samples = _num_samples(X)
- n_splits = self.n_splits
+ n_splits = self.get_n_splits(X, y, groups)
n_folds = n_splits + 1
gap = self.gap
test_size = (
@@ -1101,6 +1107,29 @@ class TimeSeriesSplit(_BaseKFold):
indices[test_start : test_start + test_size],
)
+ def get_n_splits(self, X=None, y=None, groups=None):
+ """Returns the number of splitting iterations in the cross-validator
+
+ Parameters
+ ----------
+ X : object
+ Always ignored, exists for compatibility.
+
+ y : object
+ Always ignored, exists for compatibility.
+
+ groups : object
+ Always ignored, exists for compatibility.
+
+ Returns
+ -------
+ n_splits : int
+ Returns the number of splitting iterations in the cross-validator.
+ """
+ if self.n_splits == "walk_forward":
+ return X.shape[0] - (self.max_train_size + self.test_size) + 1
+ return self.n_splits
+
class LeaveOneGroupOut(BaseCrossValidator):
"""Leave One Group Out cross-validator |
Thanks for looking into this. |
So now, we need to add unit tests to check the different changes and new behaviour. |
…nto walk-forward-time-series-split-ShehanAT
…on.TimeSeriesSplit
I've added an entry to the Here is the my updated docstring for the
Let me know if any changes are required. |
sklearn/model_selection/_split.py
Outdated
Returns the number of splitting iterations in the cross-validator. | ||
""" | ||
if self.n_splits == "walk_forward": | ||
return X.shape[0] - (self.max_train_size + self.test_size) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something here?
If n_splits="walk_forward"
, X.shape[0]=20
, self.max_train_size=10
, and self.test_size=5
we get a calculated n_splits
of 20 - (10 + 5) + 1 = 6
. Thus across all splits we have a total test_size
of 5 * 6 = 30
which is greater than our number of samples, and as such we will hit a ValueError on line 1099 which is non-descript for the n_splits="walk_forward"
use case.
Is it not the expectation that these inputs would yield indices similar to the following?
[0 1 2 3 4] [5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9 ] [10 11 12 13 14]
[5 6 7 8 9 10 11 12 13 14] [15 16 17 18 19]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep we should return:
return (X.shape[0] - self.max_train_size) // self.test_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new equation only works for the case when the minimum train size is the maximum train size. In the docstring for tscv = TimeSeriesSplit(n_splits='walk_forward', max_train_size=10, test_size=3)
it is shown that this is not always the case.
For the case in the docstring we calculate (15 - 10) // 3 = 1
, though the examples given show that we would expect 4
. See transcription below (this is why docstring tests are failing currently):
Fold 0:
Train: index=[0 1 2]
Test: index=[3 4 5]
Fold 1:
Train: index=[0 1 2 3 4 5]
Test: index=[6 7 8]
Fold 2:
Train: index=[0 1 2 3 4 5 6 7 8]
Test: index=[ 9 10 11]
Fold 3:
Train: index=[ 2 3 4 5 6 7 8 9 10 11]
Test: index=[12 13 14]
The calculation should instead be (X.shape[0] - self.min_train_size - self.gap) // self.test_size
, where perhaps the default for self.min_train_size
should be self.test_size
for docstring accuracy. We could also allow for setting self.min_train_size
on initialization (to allow for similar behavior to #24589), which would allow users to force each training batch to be of equal size (when min_train_size==max_train_size
) for consistent CV scores on parameter hypotheses.
I created a PR with fixes to merge into this branch of ShehanAT's fork at #1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the behaviour that we would expect with walk_forward
is to get train sets of constant size. So here, max_train_size
is just too large to have more than one fold, and I think this is completely fine.
n_splits='walk_forward'
in TimeSeriesSplit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed fixes and improve the documentation directly.
This seems good on my side.
@thomasjpfan do you mind having a look at this?
We won't have time to finish the review on this one before the 1.2 release. Moving it to 1.3 |
I think we shouldn't define |
@msat59 import numpy as np
splitter = TimeSeriesSplit(max_train_size=10, test_size=3, window_method="rolling")
X = np.arange(40)
list(splitter.split(X))
# [(array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24]), array([25, 26, 27])),
# (array([18, 19, 20, 21, 22, 23, 24, 25, 26, 27]), array([28, 29, 30])),
# (array([21, 22, 23, 24, 25, 26, 27, 28, 29, 30]), array([31, 32, 33])),
# (array([24, 25, 26, 27, 28, 29, 30, 31, 32, 33]), array([34, 35, 36])),
# (array([27, 28, 29, 30, 31, 32, 33, 34, 35, 36]), array([37, 38, 39]))] To include all the data in the split, one needs to manually figure out what With this PR's import numpy as np
splitter = TimeSeriesSplit(max_train_size=10, test_size=3, n_splits="walk_forward")
X = np.arange(40)
list(splitter.split(X))
# [(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([10, 11, 12])),
# (array([ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]), array([13, 14, 15])),
# (array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]), array([16, 17, 18])),
# (array([ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]), array([19, 20, 21])),
# (array([12, 13, 14, 15, 16, 17, 18, 19, 20, 21]), array([22, 23, 24])),
# (array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24]), array([25, 26, 27])),
# (array([18, 19, 20, 21, 22, 23, 24, 25, 26, 27]), array([28, 29, 30])),
# (array([21, 22, 23, 24, 25, 26, 27, 28, 29, 30]), array([31, 32, 33])),
# (array([24, 25, 26, 27, 28, 29, 30, 31, 32, 33]), array([34, 35, 36])),
# (array([27, 28, 29, 30, 31, 32, 33, 34, 35, 36]), array([37, 38, 39]))] @glemaitre I'll take a look at this PR this week. |
You are right, but it doesn't make sense you set both Does your implementation work with I tested a quick fix and it seems we can calculate
|
@thomasjpfan Do you think that we can merge this one for 1.3? |
@glemaitre I do not think so right now. I want to properly evaluate #22523 (comment). I moved this PR to 1.4 |
I did the following test:
I am not expecting that. I would like to have:
The above seems like a rolling feature. |
@diegolovison, you are right. If you specify the In my opinion, using
@thomasjpfan, I had modified my code after you reported its bug here. So if you use
|
Moving the milestone to 1.5 since we did not get time to look at this one. |
Reference Issues/PRs
Fixes #22523
What does this implement/fix? Explain your changes.
This pull request adds a requested feature in the Issue #22523: "walk_forward" for the
TimeSeriesSplit
class.Based on what I understand about this feature, here are some important details:
Walking forward is defined by the first element of the first train set starting from 0 and being incremented by 1 in the subsequent train and test indices while adhering to the max train size and test size values. For example:
To make waking forward possible the
n_splits
value is set towalk_forward
:cv = TimeSeriesSplit(n_splits="walk_forward", x_shape=x.shape[0], max_train_size=10 ,test_size=2)
.However, as this parameter was originally designed only to receive integer values I had to make some modifications to the constructor of the
_BaseKFold()
super class to make this possible. Also, thex.shape[0]
value fromx = np.arange(15)
is needed to be passed into the constructor in order to automatically compute values forn_splits
parameterOnce the
n_splits
value is set towalk_forward
, thefind_walk_forward_n_splits_value()
method is called in order to calculate an appropriate integer value forn_splits
so that walking forward is possible. In the case of multiple potentialn_splits
values, the first element in an array containing those values will be returned:Any other comments?
I have regression tested manually with the original functionality and it seems to work fine.
There are two tests in the
sklearn/model_selection/_split.py
file that are failing(I'm looking into why now)Let me know if this potential solution works, needs modifications or if you have any feedback.