[MRG+1] ENH: added max_train_size to TimeSeriesSplit #8282

Merged
merged 8 commits into from Jun 8, 2017

Conversation

Projects
None yet
4 participants
@dalmia
Contributor

dalmia commented Feb 3, 2017

Reference Issue

Fixes #8249

What does this implement/fix? Explain your changes.

This adds a parameter max_train_size to TimeSeriesSplit that puts an upper limit on the size of each training fold.

Any other comments?

There is one corner case where the size of the first train fold is smaller than the max_train_size. In my implementation, I have taken the last of those. Please check the tests for a more elaborate description.

@dalmia

This comment has been minimized.

Show comment
Hide comment
@dalmia

dalmia Feb 3, 2017

Contributor

The build is successful. Travis is failing due to unknown reasons also encountered in #8040.
Please review.

Contributor

dalmia commented Feb 3, 2017

The build is successful. Travis is failing due to unknown reasons also encountered in #8040.
Please review.

@dalmia

This comment has been minimized.

Show comment
Hide comment
@dalmia

dalmia Feb 3, 2017

Contributor

Restarting the build a few times worked there.

Contributor

dalmia commented Feb 3, 2017

Restarting the build a few times worked there.

sklearn/model_selection/_split.py
@@ -687,7 +690,8 @@ class TimeSeriesSplit(_BaseKFold):
with a test set of size ``n_samples//(n_splits + 1)``,
where ``n_samples`` is the number of samples.
"""
- def __init__(self, n_splits=3):
+ def __init__(self, n_splits=3, max_train_size=0):
+ self.max_train_size = max_train_size

This comment has been minimized.

@dalmia

dalmia Feb 16, 2017

Contributor

This should be below the super call.

@dalmia

dalmia Feb 16, 2017

Contributor

This should be below the super call.

+ assert_array_equal(test, [4])
+
+
+

This comment has been minimized.

@dalmia

dalmia Feb 16, 2017

Contributor

flake8 error.

@dalmia

dalmia Feb 16, 2017

Contributor

flake8 error.

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Feb 16, 2017

Codecov Report

Merging #8282 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8282      +/-   ##
==========================================
+ Coverage   94.75%   94.75%   +<.01%     
==========================================
  Files         342      342              
  Lines       60809    60920     +111     
==========================================
+ Hits        57617    57726     +109     
- Misses       3192     3194       +2
Impacted Files Coverage Δ
sklearn/model_selection/_split.py 98.77% <100%> (ø)
sklearn/model_selection/tests/test_split.py 95.65% <100%> (+0.09%)
sklearn/utils/tests/test_validation.py 97.38% <ø> (-0.94%)
sklearn/utils/tests/test_class_weight.py 100% <ø> (ø)
sklearn/neighbors/tests/test_ball_tree.py 98% <ø> (ø)
sklearn/ensemble/tests/test_weight_boosting.py 100% <ø> (ø)
sklearn/linear_model/tests/test_randomized_l1.py 100% <ø> (ø)
sklearn/neighbors/tests/test_kde.py 98.85% <ø> (ø)
sklearn/neighbors/tests/test_kd_tree.py 97.45% <ø> (ø)
sklearn/gaussian_process/tests/test_kernels.py 98.85% <ø> (+0.02%)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 676e863...ac4b88a. Read the comment docs.

codecov bot commented Feb 16, 2017

Codecov Report

Merging #8282 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8282      +/-   ##
==========================================
+ Coverage   94.75%   94.75%   +<.01%     
==========================================
  Files         342      342              
  Lines       60809    60920     +111     
==========================================
+ Hits        57617    57726     +109     
- Misses       3192     3194       +2
Impacted Files Coverage Δ
sklearn/model_selection/_split.py 98.77% <100%> (ø)
sklearn/model_selection/tests/test_split.py 95.65% <100%> (+0.09%)
sklearn/utils/tests/test_validation.py 97.38% <ø> (-0.94%)
sklearn/utils/tests/test_class_weight.py 100% <ø> (ø)
sklearn/neighbors/tests/test_ball_tree.py 98% <ø> (ø)
sklearn/ensemble/tests/test_weight_boosting.py 100% <ø> (ø)
sklearn/linear_model/tests/test_randomized_l1.py 100% <ø> (ø)
sklearn/neighbors/tests/test_kde.py 98.85% <ø> (ø)
sklearn/neighbors/tests/test_kd_tree.py 97.45% <ø> (ø)
sklearn/gaussian_process/tests/test_kernels.py 98.85% <ø> (+0.02%)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 676e863...ac4b88a. Read the comment docs.

sklearn/model_selection/_split.py
@@ -687,10 +690,11 @@ class TimeSeriesSplit(_BaseKFold):
with a test set of size ``n_samples//(n_splits + 1)``,
where ``n_samples`` is the number of samples.
"""
- def __init__(self, n_splits=3):
+ def __init__(self, n_splits=3, max_train_size=0):

This comment has been minimized.

@jnothman

jnothman Feb 17, 2017

Member

conventionally we use None to mean "feature disabled"

@jnothman

jnothman Feb 17, 2017

Member

conventionally we use None to mean "feature disabled"

sklearn/model_selection/_split.py
- yield (indices[:test_start],
- indices[test_start:test_start + test_size])
+ if self.max_train_size > 0 and self.max_train_size < test_start:
+ yield (indices[test_start - self.max_train_size:test_start],

This comment has been minimized.

@jnothman

jnothman Feb 17, 2017

Member

if this becomes negative it means something else. I think you need a max

@jnothman

jnothman Feb 17, 2017

Member

if this becomes negative it means something else. I think you need a max

This comment has been minimized.

@dalmia

dalmia Feb 17, 2017

Contributor

The check ensures that only when the current fold's index exceeds the max_train_size will this be yielded. Else, it'll revert back to the default. I have updated the tests to capture what I felt you were mentioning.

@dalmia

dalmia Feb 17, 2017

Contributor

The check ensures that only when the current fold's index exceeds the max_train_size will this be yielded. Else, it'll revert back to the default. I have updated the tests to capture what I felt you were mentioning.

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Feb 17, 2017

Member

Not sure why codecov is red, try to rebase on master and see what happens.

Member

lesteve commented Feb 17, 2017

Not sure why codecov is red, try to rebase on master and see what happens.

@jnothman

I would have considered writing tests that avoided writing out each split, but instead checked something like the following invariant: test set is same as without max_train_size, train with max_train_size is a suffix of train without max_train_size but limited to that length.

But this looks fine to me apart from those nitpicks.

sklearn/model_selection/_split.py
@@ -664,14 +664,17 @@ class TimeSeriesSplit(_BaseKFold):
n_splits : int, default=3
Number of splits. Must be at least 1.
+ max_train_size : int, optional
+ Maximum size for a single training fold.

This comment has been minimized.

@jnothman

jnothman Feb 21, 2017

Member

I'm never sure about the correct use of "fold". Let's call it a "training set" or a "training sample".

@jnothman

jnothman Feb 21, 2017

Member

I'm never sure about the correct use of "fold". Let's call it a "training set" or a "training sample".

+ train, test = next(splits)
+ assert_array_equal(train, [0, 1, 2, 3, 4])
+ assert_array_equal(test, [5])
+

This comment has been minimized.

@jnothman

jnothman Feb 21, 2017

Member

Should really do assert_raises(StopIteration, next(splits)).

@jnothman

jnothman Feb 21, 2017

Member

Should really do assert_raises(StopIteration, next(splits)).

@@ -1176,6 +1176,46 @@ def test_time_series_cv():
assert_equal(n_splits_actual, 2)
+def test_time_series_max_train_size():
+ X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

This comment has been minimized.

@jnothman

jnothman Feb 21, 2017

Member

How about X = np.zeros((6, 1)); then it is clear how many samples in X.

@jnothman

jnothman Feb 21, 2017

Member

How about X = np.zeros((6, 1)); then it is clear how many samples in X.

+ train, test = next(splits)
+ assert_array_equal(train, [2, 3])
+ assert_array_equal(test, [4])
+

This comment has been minimized.

@jnothman

jnothman Feb 21, 2017

Member

Another split?

@jnothman

jnothman Feb 21, 2017

Member

Another split?

@dalmia

This comment has been minimized.

Show comment
Hide comment
@dalmia

dalmia Feb 22, 2017

Contributor

Yes, that seems much cleaner, writing out the splits should have been avoided in the first place. Thank you for the suggestion 👍

Contributor

dalmia commented Feb 22, 2017

Yes, that seems much cleaner, writing out the splits should have been avoided in the first place. Thank you for the suggestion 👍

dalmia added some commits Feb 22, 2017

@dalmia

This comment has been minimized.

Show comment
Hide comment
@dalmia

dalmia Feb 22, 2017

Contributor

CircleCI needs a rebuild.

Contributor

dalmia commented Feb 22, 2017

CircleCI needs a rebuild.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 8, 2017

Member

LGTM. Please add a what's new entry.

Member

jnothman commented Jun 8, 2017

LGTM. Please add a what's new entry.

@jnothman jnothman changed the title from [MRG] ENH: added max_train_size to TimeSeriesSplit to [MRG+1] ENH: added max_train_size to TimeSeriesSplit Jun 8, 2017

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Jun 8, 2017

Member

merging. I'll add the what's new entry in master directly

Member

agramfort commented Jun 8, 2017

merging. I'll add the what's new entry in master directly

@agramfort agramfort merged commit 511c9a8 into scikit-learn:master Jun 8, 2017

5 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 94.75%)
Details
codecov/project 94.75% (+<.01%) compared to 676e863
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 8, 2017

Member

Thanks @dalmia!

Member

jnothman commented Jun 8, 2017

Thanks @dalmia!

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Jun 8, 2017

Member

done in 305ed51

Member

agramfort commented Jun 8, 2017

done in 305ed51

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] ENH: added max_train_size to TimeSeriesSplit (#8282)
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] ENH: added max_train_size to TimeSeriesSplit (#8282)
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] ENH: added max_train_size to TimeSeriesSplit (#8282)
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+1] ENH: added max_train_size to TimeSeriesSplit (#8282)
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] ENH: added max_train_size to TimeSeriesSplit (#8282)
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG+1] ENH: added max_train_size to TimeSeriesSplit (#8282)
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] ENH: added max_train_size to TimeSeriesSplit (#8282)
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+1] ENH: added max_train_size to TimeSeriesSplit (#8282)
* ENH: added max_train_size to TimeSeriesSplit

* FIX: update doctest

* FIX: correct error in the previous update

* FIX: added doctest fix for cross_validation.rst

* FIX: remove errors

* TST: tests updated and default value changed to None

* TST: improve split tests

* FIX: reduce code length
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment