New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2?] Add homogeneous time series cross validation #6586

Merged
merged 33 commits into from Aug 24, 2016

Conversation

Projects
None yet
8 participants
@yenchenlin
Contributor

yenchenlin commented Mar 24, 2016

This PR is a separated PR from #6351 that tries to solve #6322 ,
and it implements homogeneous time series cross validation and add the corresponding test.

However, there's a discussion on n_folds's meaning here, may reviewers please give me some suggestion?

ping @MechCoder @rvraghav93 @amueller @jnothman 馃檹

@jnothman

This comment has been minimized.

Member

jnothman commented Mar 24, 2016

Rename to n_splits across the board? 馃槢

@yenchenlin

This comment has been minimized.

Contributor

yenchenlin commented Mar 24, 2016

Hi @jnothman, are you serious?
For me, it sounds like a good solution 馃槃

@jnothman

This comment has been minimized.

Member

jnothman commented Mar 24, 2016

I'm serious in that that's how I might name it it in hindsight, especially since we now have get_n_splits. Such change and deprecation should generally be avoided to not force users to change. However, if we were to make that change, now would be the time, given that users of this part of the codebase need to update their code anyway (sklearn.model_selection doesn't exist in master).

Note however, that there's also the further inconsistency with the ShuffleSplit family's use of n_iter for the same notion as n_folds. Indeed, for a consistent and unambiguous interface, [get_]n_iter may be preferable to [get_]n_splits. Unless there is wholesale agreement among devs that we should piggyback further consistency improvements on the model_selection change, I think the choice of parameter name should be limited to the new splitter.

Ordinarily, @GaelVaroquaux righteously champions a conservative position on such change, so it's useful to have his input on whether that piggybacking is at all worth considering.

@yenchenlin

This comment has been minimized.

Contributor

yenchenlin commented Apr 5, 2016

Any update on this?

Would @GaelVaroquaux please share your opinions?
Thanks!

@yenchenlin

This comment has been minimized.

Contributor

yenchenlin commented Aug 10, 2016

Sorry to interrupt, @jnothman and @MechCoder .
I think the functionality of time series cross validation is done.

Problem remains now is that get_n_splits() will return self.n_folds -1, which may be confused since users used to see self.n_folds as how many test folds.

n_splits : int
Returns the number of splitting iterations in the cross-validator.
"""
return self.n_folds-1

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

space between operator and args

assert_array_equal(test, [5, 6])
# Check get_n_splits returns the number of folds - 1
assert_equal(2, homo_tscv.get_n_splits())

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

Should assert len(list(splits))) equals get_n_splits(). That is our contract.

This comment has been minimized.

@jnothman

This comment has been minimized.

@yenchenlin

yenchenlin Aug 15, 2016

Contributor

Sorry, and thanks for pointing it out!

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 10, 2016

See #7169, which I should have left to you to post, @yenchenlin !

@yenchenlin

This comment has been minimized.

Contributor

yenchenlin commented Aug 10, 2016

@jnothman thanks! 馃槃

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 14, 2016

Please add to classes.rst

Provides train/test indices to split time series data in train/test sets.
This cross-validation object is a variation of KFold.

This comment has been minimized.

@jnothman

jnothman Aug 14, 2016

Member

Might as well use a :class: link

Provides train/test indices to split time series data in train/test sets.
This cross-validation object is a variation of KFold.
In iteration k, it returns first k folds as train set and k+1 fold as

This comment has been minimized.

@jnothman

jnothman Aug 14, 2016

Member

iteration k -> the kth split. k+1 -> the (k+1)th

This cross-validation object is a variation of KFold.
In iteration k, it returns first k folds as train set and k+1 fold as
test set.

This comment has been minimized.

@jnothman

jnothman Aug 14, 2016

Member

Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them.

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 14, 2016

I wonder whether there's any benefit in allowing the user to sub-sample from the training set so that its size is not variable. You could imagine not wanting to choose model parameters that maximise the mean score, but a mean weighted to those with a larger sample of training data: if the amount of training data has an effect, cross-validation scores may not be normally distributed. Allowing the option to sub-sample to a fixed training set size does not fix this, though. It may be worth leaving a note for the user that due to learning curve effects the mean over CV scores might not be as informative as otherwise...

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 14, 2016

This needs some narrative documentation (i.e. content in cross_validation.rst), whose role is to guide the user to the kinds of problems involved in choosing a cross-validation strategy, and the available solutions.

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 14, 2016

Test failures pending rebase on #7187, I hope.

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 14, 2016

Ideally, this PR would also be supported by an example. Do we know of / have datasets where this would be appropriate?

class HomogeneousTimeSeriesCV(_BaseKFold):
"""Homogeneous Time Series cross-validator
Provides train/test indices to split time series data in train/test sets.

This comment has been minimized.

@jnothman

jnothman Aug 14, 2016

Member

Please clarify to the user that we assume samples from later times have higher indices (and thus shuffling in CV is inappropriate).

This comment has been minimized.

@yenchenlin

yenchenlin Aug 15, 2016

Contributor

done

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Aug 14, 2016

Before this is merged, I think that it is very important that there is a section / subsection added in the model selection documentation on time series data. It doesn't have to be long, but it should point out the particularities of cross-validation on time series, and point to this object.

@yenchenlin

This comment has been minimized.

Contributor

yenchenlin commented Aug 15, 2016

I temporarily pushing commits that rebase #7187 to make sure CI looks good.
Will resolve all these commits after #7187 is merged.

@yenchenlin

This comment has been minimized.

Contributor

yenchenlin commented Aug 15, 2016

I have to admit that I'm not familiar with time-series data, and the only time-series data I've played with is a private property of a startup.

I remembered @MechCoder has some experiences with this, would you please suggest some toy dataset?

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 16, 2016

I have to admit that I'm not familiar with time-series data, and the only time-series data I've played with is a private property of a startup.

You don't really need to be familiar with time-series data explicitly, as long as you understand why something like KFold is problematic and why this solution is better. "A note on shuffling" already introduces problems related to non-iid data. Similarly describe the time series case where samples are not identically distributed, rather the distribution changes as a function of the dataset ordering.

@yenchenlin

This comment has been minimized.

Contributor

yenchenlin commented Aug 16, 2016

@jnothman I just pushed a new commit to explain why we need time-series CV and what is it, please feel free to give me your opinion.

The first fold has size
``n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``,
and the other folds all have size ``n_samples // (n_splits + 1)``.
The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

You've used a tab character here. And gone over 79 characters. You can break lines within double-backticks if necessary.

for test_start in test_starts:
yield indices[:test_start], indices[test_start:test_start + test_size]
yield (indices[:test_start], \

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

no backslash needed here.

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

since it's inside parentheses

@@ -675,8 +675,9 @@ class TimeSeriesCV(_BaseKFold):
Notes
-----
The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``
in the ``i``th split, with a test set of size ``n_samples//(n_splits + 1)``,
The training set has size ``i * n_samples // (n_splits + 1)

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

still a tab character

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 24, 2016

@yenchenlin , can I suggest you install a pep8 checker?
flake8 tends to work well.

@yenchenlin

This comment has been minimized.

Contributor

yenchenlin commented Aug 24, 2016

@jnothman I feel embarrassed, I do have one ...

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 24, 2016

I had no intention to embarrass you. Here's a further hint:

~/repos/scikit-learn$ cat .git/hooks/pre-commit
#!/bin/sh

FILES=$(git diff --cached --name-status | grep -v ^D | awk '$1 $2 { print $2}' | grep -e .py$)
if [ -n "$FILES" ]; then
pep8 -r $FILES
fi

use git commit -n to bypass its warnings. (Though I'm sure that could be done without the greps and I can't be bothered to change that now...)

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 24, 2016

Hurrah! 馃嵒

@jnothman jnothman merged commit 234d256 into scikit-learn:master Aug 24, 2016

0 of 3 checks passed

ci/circleci CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@jnothman

This comment has been minimized.

Member

jnothman commented Aug 24, 2016

Ugh. forgot to add to classes.rst

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 24, 2016

Now that this is merged, I've realised that it's a bit unconventional for us to have the suffix CV on a splitter, and it has an altogether different conventional meaning. Should this be TimeSeriesSplit or TimeSeriesFold? @amueller??

@amueller

This comment has been minimized.

Member

amueller commented Aug 24, 2016

True. I'd say TimeSeriesSplit

@amueller

This comment has been minimized.

Member

amueller commented Aug 24, 2016

Wanna add to classes and do the rename?

@MechCoder

This comment has been minimized.

Member

MechCoder commented Aug 24, 2016

Thanks a lot @yenchenlin . Yes, I agree that it should be TimeSeriesSplit

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 25, 2016

@yenchenlin are you able to make a patch for these afterthoughts?

@yenchenlin

This comment has been minimized.

Contributor

yenchenlin commented Aug 25, 2016

@jnothman sure!

@raghavrv

This comment has been minimized.

Member

raghavrv commented Aug 25, 2016

TimeSeriesSplitter maybe?

@cbrummitt

This comment has been minimized.

Contributor

cbrummitt commented Nov 30, 2016

I'm not sure I understand what HeterogeneousTimeSeries means. It was mentioned here and in #6322. I think it means that the time stamps are not equally spaced.

I've implemented what I'm calling MultiTimeSeriesSplit for the case of multiple time-series with different time stamps present in different time-series (or, said differently, with missing data). I compute splits by calculating equally spaced quantiles of the distribution of time stamps.

Here's an illustration for a small dataset. There are four time-series (four countries), and not all countries have data in every year (e.g., it's common to have countries be "born" in the middle of a dataset).

image

Would this be of interest?

@yenchenlin what's the status of HeterogeneousTimeSeriesSplit (or whatever is the latest name)?

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 30, 2016

@amueller

This comment has been minimized.

Member

amueller commented Dec 1, 2016

@jnothman also check out the discussion here: #6322

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment