Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] ENH: added max_train_size to TimeSeriesSplit #8282

Merged
merged 8 commits into from Jun 8, 2017

Conversation

@dalmia
Copy link
Contributor

@dalmia dalmia commented Feb 3, 2017

Reference Issue

Fixes #8249

What does this implement/fix? Explain your changes.

This adds a parameter max_train_size to TimeSeriesSplit that puts an upper limit on the size of each training fold.

Any other comments?

There is one corner case where the size of the first train fold is smaller than the max_train_size. In my implementation, I have taken the last of those. Please check the tests for a more elaborate description.

@dalmia
Copy link
Contributor Author

@dalmia dalmia commented Feb 3, 2017

The build is successful. Travis is failing due to unknown reasons also encountered in #8040.
Please review.

@dalmia
Copy link
Contributor Author

@dalmia dalmia commented Feb 3, 2017

Restarting the build a few times worked there.

@@ -687,7 +690,8 @@ class TimeSeriesSplit(_BaseKFold):
with a test set of size ``n_samples//(n_splits + 1)``,
where ``n_samples`` is the number of samples.
"""
def __init__(self, n_splits=3):
def __init__(self, n_splits=3, max_train_size=0):
self.max_train_size = max_train_size

This comment has been minimized.

@dalmia

dalmia Feb 16, 2017
Author Contributor

This should be below the super call.

assert_array_equal(test, [4])



This comment has been minimized.

@dalmia

dalmia Feb 16, 2017
Author Contributor

flake8 error.

@codecov
Copy link

@codecov codecov bot commented Feb 16, 2017

Codecov Report

Merging #8282 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8282      +/-   ##
==========================================
+ Coverage   94.75%   94.75%   +<.01%     
==========================================
  Files         342      342              
  Lines       60809    60920     +111     
==========================================
+ Hits        57617    57726     +109     
- Misses       3192     3194       +2
Impacted Files Coverage Δ
sklearn/model_selection/_split.py 98.77% <100%> (ø)
sklearn/model_selection/tests/test_split.py 95.65% <100%> (+0.09%)
sklearn/utils/tests/test_validation.py 97.38% <ø> (-0.94%)
sklearn/utils/tests/test_class_weight.py 100% <ø> (ø)
sklearn/neighbors/tests/test_ball_tree.py 98% <ø> (ø)
sklearn/ensemble/tests/test_weight_boosting.py 100% <ø> (ø)
sklearn/linear_model/tests/test_randomized_l1.py 100% <ø> (ø)
sklearn/neighbors/tests/test_kde.py 98.85% <ø> (ø)
sklearn/neighbors/tests/test_kd_tree.py 97.45% <ø> (ø)
sklearn/gaussian_process/tests/test_kernels.py 98.85% <ø> (+0.02%)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 676e863...ac4b88a. Read the comment docs.

@@ -687,10 +690,11 @@ class TimeSeriesSplit(_BaseKFold):
with a test set of size ``n_samples//(n_splits + 1)``,
where ``n_samples`` is the number of samples.
"""
def __init__(self, n_splits=3):
def __init__(self, n_splits=3, max_train_size=0):

This comment has been minimized.

@jnothman

jnothman Feb 17, 2017
Member

conventionally we use None to mean "feature disabled"

yield (indices[:test_start],
indices[test_start:test_start + test_size])
if self.max_train_size > 0 and self.max_train_size < test_start:
yield (indices[test_start - self.max_train_size:test_start],

This comment has been minimized.

@jnothman

jnothman Feb 17, 2017
Member

if this becomes negative it means something else. I think you need a max

This comment has been minimized.

@dalmia

dalmia Feb 17, 2017
Author Contributor

The check ensures that only when the current fold's index exceeds the max_train_size will this be yielded. Else, it'll revert back to the default. I have updated the tests to capture what I felt you were mentioning.

@lesteve
Copy link
Member

@lesteve lesteve commented Feb 17, 2017

Not sure why codecov is red, try to rebase on master and see what happens.

@dalmia dalmia force-pushed the dalmia:8249 branch from 979e80d to 84278be Feb 17, 2017
Copy link
Member

@jnothman jnothman left a comment

I would have considered writing tests that avoided writing out each split, but instead checked something like the following invariant: test set is same as without max_train_size, train with max_train_size is a suffix of train without max_train_size but limited to that length.

But this looks fine to me apart from those nitpicks.

@@ -664,14 +664,17 @@ class TimeSeriesSplit(_BaseKFold):
n_splits : int, default=3
Number of splits. Must be at least 1.
max_train_size : int, optional
Maximum size for a single training fold.

This comment has been minimized.

@jnothman

jnothman Feb 21, 2017
Member

I'm never sure about the correct use of "fold". Let's call it a "training set" or a "training sample".

train, test = next(splits)
assert_array_equal(train, [0, 1, 2, 3, 4])
assert_array_equal(test, [5])

This comment has been minimized.

@jnothman

jnothman Feb 21, 2017
Member

Should really do assert_raises(StopIteration, next(splits)).

@@ -1176,6 +1176,46 @@ def test_time_series_cv():
assert_equal(n_splits_actual, 2)


def test_time_series_max_train_size():
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

This comment has been minimized.

@jnothman

jnothman Feb 21, 2017
Member

How about X = np.zeros((6, 1)); then it is clear how many samples in X.

train, test = next(splits)
assert_array_equal(train, [2, 3])
assert_array_equal(test, [4])

This comment has been minimized.

@jnothman

jnothman Feb 21, 2017
Member

Another split?

@dalmia
Copy link
Contributor Author

@dalmia dalmia commented Feb 22, 2017

Yes, that seems much cleaner, writing out the splits should have been avoided in the first place. Thank you for the suggestion 👍

dalmia added 2 commits Feb 22, 2017
@dalmia
Copy link
Contributor Author

@dalmia dalmia commented Feb 22, 2017

CircleCI needs a rebuild.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 8, 2017

LGTM. Please add a what's new entry.

@jnothman jnothman changed the title [MRG] ENH: added max_train_size to TimeSeriesSplit [MRG+1] ENH: added max_train_size to TimeSeriesSplit Jun 8, 2017
@agramfort
Copy link
Member

@agramfort agramfort commented Jun 8, 2017

merging. I'll add the what's new entry in master directly

@agramfort agramfort merged commit 511c9a8 into scikit-learn:master Jun 8, 2017
5 checks passed
5 checks passed
ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 94.75%)
Details
codecov/project 94.75% (+<.01%) compared to 676e863
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman
Copy link
Member

@jnothman jnothman commented Jun 8, 2017

Thanks @dalmia!

@agramfort
Copy link
Member

@agramfort agramfort commented Jun 8, 2017

done in 305ed51

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants
You can’t perform that action at this time.