New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2?] Add homogeneous time series cross validation #6586

Merged
merged 33 commits into from Aug 24, 2016

Conversation

Projects
None yet
8 participants
@yenchenlin
Contributor

yenchenlin commented Mar 24, 2016

This PR is a separated PR from #6351 that tries to solve #6322 ,
and it implements homogeneous time series cross validation and add the corresponding test.

However, there's a discussion on n_folds's meaning here, may reviewers please give me some suggestion?

ping @MechCoder @rvraghav93 @amueller @jnothman 馃檹

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 24, 2016

Member

Rename to n_splits across the board? 馃槢

Member

jnothman commented Mar 24, 2016

Rename to n_splits across the board? 馃槢

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Mar 24, 2016

Contributor

Hi @jnothman, are you serious?
For me, it sounds like a good solution 馃槃

Contributor

yenchenlin commented Mar 24, 2016

Hi @jnothman, are you serious?
For me, it sounds like a good solution 馃槃

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 24, 2016

Member

I'm serious in that that's how I might name it it in hindsight, especially since we now have get_n_splits. Such change and deprecation should generally be avoided to not force users to change. However, if we were to make that change, now would be the time, given that users of this part of the codebase need to update their code anyway (sklearn.model_selection doesn't exist in master).

Note however, that there's also the further inconsistency with the ShuffleSplit family's use of n_iter for the same notion as n_folds. Indeed, for a consistent and unambiguous interface, [get_]n_iter may be preferable to [get_]n_splits. Unless there is wholesale agreement among devs that we should piggyback further consistency improvements on the model_selection change, I think the choice of parameter name should be limited to the new splitter.

Ordinarily, @GaelVaroquaux righteously champions a conservative position on such change, so it's useful to have his input on whether that piggybacking is at all worth considering.

Member

jnothman commented Mar 24, 2016

I'm serious in that that's how I might name it it in hindsight, especially since we now have get_n_splits. Such change and deprecation should generally be avoided to not force users to change. However, if we were to make that change, now would be the time, given that users of this part of the codebase need to update their code anyway (sklearn.model_selection doesn't exist in master).

Note however, that there's also the further inconsistency with the ShuffleSplit family's use of n_iter for the same notion as n_folds. Indeed, for a consistent and unambiguous interface, [get_]n_iter may be preferable to [get_]n_splits. Unless there is wholesale agreement among devs that we should piggyback further consistency improvements on the model_selection change, I think the choice of parameter name should be limited to the new splitter.

Ordinarily, @GaelVaroquaux righteously champions a conservative position on such change, so it's useful to have his input on whether that piggybacking is at all worth considering.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Apr 5, 2016

Contributor

Any update on this?

Would @GaelVaroquaux please share your opinions?
Thanks!

Contributor

yenchenlin commented Apr 5, 2016

Any update on this?

Would @GaelVaroquaux please share your opinions?
Thanks!

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 10, 2016

Contributor

Sorry to interrupt, @jnothman and @MechCoder .
I think the functionality of time series cross validation is done.

Problem remains now is that get_n_splits() will return self.n_folds -1, which may be confused since users used to see self.n_folds as how many test folds.

Contributor

yenchenlin commented Aug 10, 2016

Sorry to interrupt, @jnothman and @MechCoder .
I think the functionality of time series cross validation is done.

Problem remains now is that get_n_splits() will return self.n_folds -1, which may be confused since users used to see self.n_folds as how many test folds.

Show outdated Hide outdated sklearn/model_selection/_split.py
n_splits : int
Returns the number of splitting iterations in the cross-validator.
"""
return self.n_folds-1

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

space between operator and args

@jnothman

jnothman Aug 10, 2016

Member

space between operator and args

Show outdated Hide outdated sklearn/model_selection/tests/test_split.py
assert_array_equal(test, [5, 6])
# Check get_n_splits returns the number of folds - 1
assert_equal(2, homo_tscv.get_n_splits())

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

Should assert len(list(splits))) equals get_n_splits(). That is our contract.

@jnothman

jnothman Aug 10, 2016

Member

Should assert len(list(splits))) equals get_n_splits(). That is our contract.

This comment has been minimized.

@jnothman

jnothman Aug 14, 2016

Member

Ping

@jnothman

This comment has been minimized.

@yenchenlin

yenchenlin Aug 15, 2016

Contributor

Sorry, and thanks for pointing it out!

@yenchenlin

yenchenlin Aug 15, 2016

Contributor

Sorry, and thanks for pointing it out!

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 10, 2016

Member

See #7169, which I should have left to you to post, @yenchenlin !

Member

jnothman commented Aug 10, 2016

See #7169, which I should have left to you to post, @yenchenlin !

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 10, 2016

Contributor

@jnothman thanks! 馃槃

Contributor

yenchenlin commented Aug 10, 2016

@jnothman thanks! 馃槃

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 14, 2016

Member

Please add to classes.rst

Member

jnothman commented Aug 14, 2016

Please add to classes.rst

Show outdated Hide outdated sklearn/model_selection/_split.py
Provides train/test indices to split time series data in train/test sets.
This cross-validation object is a variation of KFold.

This comment has been minimized.

@jnothman

jnothman Aug 14, 2016

Member

Might as well use a :class: link

@jnothman

jnothman Aug 14, 2016

Member

Might as well use a :class: link

Show outdated Hide outdated sklearn/model_selection/_split.py
Provides train/test indices to split time series data in train/test sets.
This cross-validation object is a variation of KFold.
In iteration k, it returns first k folds as train set and k+1 fold as

This comment has been minimized.

@jnothman

jnothman Aug 14, 2016

Member

iteration k -> the kth split. k+1 -> the (k+1)th

@jnothman

jnothman Aug 14, 2016

Member

iteration k -> the kth split. k+1 -> the (k+1)th

Show outdated Hide outdated sklearn/model_selection/_split.py
This cross-validation object is a variation of KFold.
In iteration k, it returns first k folds as train set and k+1 fold as
test set.

This comment has been minimized.

@jnothman

jnothman Aug 14, 2016

Member

Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them.

@jnothman

jnothman Aug 14, 2016

Member

Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 14, 2016

Member

I wonder whether there's any benefit in allowing the user to sub-sample from the training set so that its size is not variable. You could imagine not wanting to choose model parameters that maximise the mean score, but a mean weighted to those with a larger sample of training data: if the amount of training data has an effect, cross-validation scores may not be normally distributed. Allowing the option to sub-sample to a fixed training set size does not fix this, though. It may be worth leaving a note for the user that due to learning curve effects the mean over CV scores might not be as informative as otherwise...

Member

jnothman commented Aug 14, 2016

I wonder whether there's any benefit in allowing the user to sub-sample from the training set so that its size is not variable. You could imagine not wanting to choose model parameters that maximise the mean score, but a mean weighted to those with a larger sample of training data: if the amount of training data has an effect, cross-validation scores may not be normally distributed. Allowing the option to sub-sample to a fixed training set size does not fix this, though. It may be worth leaving a note for the user that due to learning curve effects the mean over CV scores might not be as informative as otherwise...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 14, 2016

Member

This needs some narrative documentation (i.e. content in cross_validation.rst), whose role is to guide the user to the kinds of problems involved in choosing a cross-validation strategy, and the available solutions.

Member

jnothman commented Aug 14, 2016

This needs some narrative documentation (i.e. content in cross_validation.rst), whose role is to guide the user to the kinds of problems involved in choosing a cross-validation strategy, and the available solutions.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 14, 2016

Member

Test failures pending rebase on #7187, I hope.

Member

jnothman commented Aug 14, 2016

Test failures pending rebase on #7187, I hope.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 14, 2016

Member

Ideally, this PR would also be supported by an example. Do we know of / have datasets where this would be appropriate?

Member

jnothman commented Aug 14, 2016

Ideally, this PR would also be supported by an example. Do we know of / have datasets where this would be appropriate?

Show outdated Hide outdated sklearn/model_selection/_split.py
class HomogeneousTimeSeriesCV(_BaseKFold):
"""Homogeneous Time Series cross-validator
Provides train/test indices to split time series data in train/test sets.

This comment has been minimized.

@jnothman

jnothman Aug 14, 2016

Member

Please clarify to the user that we assume samples from later times have higher indices (and thus shuffling in CV is inappropriate).

@jnothman

jnothman Aug 14, 2016

Member

Please clarify to the user that we assume samples from later times have higher indices (and thus shuffling in CV is inappropriate).

This comment has been minimized.

@yenchenlin

yenchenlin Aug 15, 2016

Contributor

done

@yenchenlin

yenchenlin Aug 15, 2016

Contributor

done

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Aug 14, 2016

Member

Before this is merged, I think that it is very important that there is a section / subsection added in the model selection documentation on time series data. It doesn't have to be long, but it should point out the particularities of cross-validation on time series, and point to this object.

Member

GaelVaroquaux commented Aug 14, 2016

Before this is merged, I think that it is very important that there is a section / subsection added in the model selection documentation on time series data. It doesn't have to be long, but it should point out the particularities of cross-validation on time series, and point to this object.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 15, 2016

Contributor

I temporarily pushing commits that rebase #7187 to make sure CI looks good.
Will resolve all these commits after #7187 is merged.

Contributor

yenchenlin commented Aug 15, 2016

I temporarily pushing commits that rebase #7187 to make sure CI looks good.
Will resolve all these commits after #7187 is merged.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 15, 2016

Contributor

I have to admit that I'm not familiar with time-series data, and the only time-series data I've played with is a private property of a startup.

I remembered @MechCoder has some experiences with this, would you please suggest some toy dataset?

Contributor

yenchenlin commented Aug 15, 2016

I have to admit that I'm not familiar with time-series data, and the only time-series data I've played with is a private property of a startup.

I remembered @MechCoder has some experiences with this, would you please suggest some toy dataset?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 16, 2016

Member

I have to admit that I'm not familiar with time-series data, and the only time-series data I've played with is a private property of a startup.

You don't really need to be familiar with time-series data explicitly, as long as you understand why something like KFold is problematic and why this solution is better. "A note on shuffling" already introduces problems related to non-iid data. Similarly describe the time series case where samples are not identically distributed, rather the distribution changes as a function of the dataset ordering.

Member

jnothman commented Aug 16, 2016

I have to admit that I'm not familiar with time-series data, and the only time-series data I've played with is a private property of a startup.

You don't really need to be familiar with time-series data explicitly, as long as you understand why something like KFold is problematic and why this solution is better. "A note on shuffling" already introduces problems related to non-iid data. Similarly describe the time series case where samples are not identically distributed, rather the distribution changes as a function of the dataset ordering.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 16, 2016

Contributor

@jnothman I just pushed a new commit to explain why we need time-series CV and what is it, please feel free to give me your opinion.

Contributor

yenchenlin commented Aug 16, 2016

@jnothman I just pushed a new commit to explain why we need time-series CV and what is it, please feel free to give me your opinion.

Show outdated Hide outdated doc/modules/cross_validation.rst
Cross validation of time series data
====================================
Time series data is known for the correlation between observations, so

This comment has been minimized.

@jnothman

jnothman Aug 17, 2016

Member

"known for" -> "characterised by"

"between observations that are near in time (autocorrelation)

@jnothman

jnothman Aug 17, 2016

Member

"known for" -> "characterised by"

"between observations that are near in time (autocorrelation)

Show outdated Hide outdated doc/modules/cross_validation.rst
====================================
Time series data is known for the correlation between observations, so
it is very important to evaluate our models on the "future" observations that

This comment has been minimized.

@jnothman

jnothman Aug 17, 2016

Member

"that are not" -> "least like those that are"

Move "so ..." to after the next sentence.

@jnothman

jnothman Aug 17, 2016

Member

"that are not" -> "least like those that are"

Move "so ..." to after the next sentence.

Show outdated Hide outdated doc/modules/cross_validation.rst
test it. Therefore, we need to adopt the special time series cross-validation
to avoid the problem, which is described as follows:
1. In each iteration :math:`i`, use first :mata:`i` fold(s) as training set,

This comment has been minimized.

@jnothman

jnothman Aug 17, 2016

Member

I don't think steps help. Just do "We first split the data into n_splits + 1 folds, then in each iteration ...". Drop points 2 (where we also need to fit, and consider other evaluation metrics) and 3.

@jnothman

jnothman Aug 17, 2016

Member

I don't think steps help. Just do "We first split the data into n_splits + 1 folds, then in each iteration ...". Drop points 2 (where we also need to fit, and consider other evaluation metrics) and 3.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 17, 2016

Member

Thinking about autocorrelation, it might be useful to leave a gap between the last training instance and the first for testing in each split. Should we add a train_test_gap param? In a future version, perhaps?

Member

jnothman commented Aug 17, 2016

Thinking about autocorrelation, it might be useful to leave a gap between the last training instance and the first for testing in each split. Should we add a train_test_gap param? In a future version, perhaps?

Show outdated Hide outdated doc/modules/cross_validation.rst
it is very important to evaluate our models on the "future" observations that
are not seen by our model before. However, the "classical" k-times
Time series data is characterised by the correlation between observations
that are near in time (autocorrelation), so it is very important to

This comment has been minimized.

@jnothman

jnothman Aug 17, 2016

Member

use * to emphasise autocorrelation as a term

@jnothman

jnothman Aug 17, 2016

Member

use * to emphasise autocorrelation as a term

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 17, 2016

Contributor

Move "so ..." to after the next sentence.

Sorry may you explain this more explicitly?

Contributor

yenchenlin commented Aug 17, 2016

Move "so ..." to after the next sentence.

Sorry may you explain this more explicitly?

Show outdated Hide outdated doc/modules/cross_validation.rst
Time series data is characterised by the correlation between observations
that are near in time (autocorrelation), so it is very important to
evaluate our models on the "future" observations least like those that
are seen by our model before. However, the "classical" k-times

This comment has been minimized.

@jnothman

jnothman Aug 17, 2016

Member

A comment I made seems to have disappeared. I suggested "Classical cross-validation techniques assume the samples are independent and identically distributed, and would result in unreasonable correlation between training and testing instances (yielding poor estimates of generalisation error) on time-series data." Again, I think this belongs before "so ..."

@jnothman

jnothman Aug 17, 2016

Member

A comment I made seems to have disappeared. I suggested "Classical cross-validation techniques assume the samples are independent and identically distributed, and would result in unreasonable correlation between training and testing instances (yielding poor estimates of generalisation error) on time-series data." Again, I think this belongs before "so ..."

This comment has been minimized.

@yenchenlin

yenchenlin Aug 17, 2016

Contributor

Thanks for the opinion! I've changed the doc.

@yenchenlin

yenchenlin Aug 17, 2016

Contributor

Thanks for the opinion! I've changed the doc.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 24, 2016

Contributor

Comments addressed.

Contributor

yenchenlin commented Aug 24, 2016

Comments addressed.

Show outdated Hide outdated doc/modules/cross_validation.rst
returns first :math:`k` folds as train set and the :math:`(k+1)` th
fold as test set. Note that unlike standard cross-validation methods,
successive training sets are supersets of those that come before them.
Also, its foldsizes are not in accordance with :class:`KFold`,

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

I think this is more detail than needed here. If you want to keep the sentence, you can make it more succinct.

@jnothman

jnothman Aug 24, 2016

Member

I think this is more detail than needed here. If you want to keep the sentence, you can make it more succinct.

Show outdated Hide outdated doc/modules/cross_validation.rst
This class can be used to cross-validate time series data samples
that are observed at fixed time intervals.
Example of 3-split time series cross-validation on a dataset with 4 samples::

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

If you make it a dataset with 6 samples, the surplus behaviour becomes clearer.

@jnothman

jnothman Aug 24, 2016

Member

If you make it a dataset with 6 samples, the surplus behaviour becomes clearer.

Show outdated Hide outdated sklearn/model_selection/_split.py
Notes
-----
The first fold has size

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

"Folds" remains confusing. "The training set has size i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)" in the ith split, with a test set of size n_samples//(n_splits + 1)

@jnothman

jnothman Aug 24, 2016

Member

"Folds" remains confusing. "The training set has size i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)" in the ith split, with a test set of size n_samples//(n_splits + 1)

Show outdated Hide outdated sklearn/model_selection/_split.py
n_samples))
indices = np.arange(n_samples)
test_size = (n_samples // n_folds)
test_starts = range(test_size + n_samples % n_folds, n_samples, test_size)

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

PEP8 line length

@jnothman

jnothman Aug 24, 2016

Member

PEP8 line length

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

i.e. split after comma

@jnothman

jnothman Aug 24, 2016

Member

i.e. split after comma

Show outdated Hide outdated sklearn/model_selection/_split.py
test_size = (n_samples // n_folds)
test_starts = range(test_size + n_samples % n_folds, n_samples, test_size)
for test_start in test_starts:
yield indices[:test_start], indices[test_start:test_start + test_size]

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

ditto

@jnothman

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

Perhaps rename test_start to start

@jnothman

jnothman Aug 24, 2016

Member

Perhaps rename test_start to start

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 24, 2016

Member

I think we can merge after those minor changes, and a New Feature entry in whats_new.

We should also create an issue quoting @amueller's suggestion of a restructure of the docs. You are welcome to do this.

Member

jnothman commented Aug 24, 2016

I think we can merge after those minor changes, and a New Feature entry in whats_new.

We should also create an issue quoting @amueller's suggestion of a restructure of the docs. You are welcome to do this.

Show outdated Hide outdated sklearn/model_selection/_split.py
The first fold has size
``n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``,
and the other folds all have size ``n_samples // (n_splits + 1)``.
The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

You've used a tab character here. And gone over 79 characters. You can break lines within double-backticks if necessary.

@jnothman

jnothman Aug 24, 2016

Member

You've used a tab character here. And gone over 79 characters. You can break lines within double-backticks if necessary.

Show outdated Hide outdated sklearn/model_selection/_split.py
for test_start in test_starts:
yield indices[:test_start], indices[test_start:test_start + test_size]
yield (indices[:test_start], \

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

no backslash needed here.

@jnothman

jnothman Aug 24, 2016

Member

no backslash needed here.

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

since it's inside parentheses

@jnothman

jnothman Aug 24, 2016

Member

since it's inside parentheses

Show outdated Hide outdated sklearn/model_selection/_split.py
@@ -675,8 +675,9 @@ class TimeSeriesCV(_BaseKFold):
Notes
-----
The training set has size ``i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)``
in the ``i``th split, with a test set of size ``n_samples//(n_splits + 1)``,
The training set has size ``i * n_samples // (n_splits + 1)

This comment has been minimized.

@jnothman

jnothman Aug 24, 2016

Member

still a tab character

@jnothman

jnothman Aug 24, 2016

Member

still a tab character

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 24, 2016

Member

@yenchenlin , can I suggest you install a pep8 checker?
flake8 tends to work well.

Member

jnothman commented Aug 24, 2016

@yenchenlin , can I suggest you install a pep8 checker?
flake8 tends to work well.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 24, 2016

Contributor

@jnothman I feel embarrassed, I do have one ...

Contributor

yenchenlin commented Aug 24, 2016

@jnothman I feel embarrassed, I do have one ...

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 24, 2016

Member

I had no intention to embarrass you. Here's a further hint:

~/repos/scikit-learn$ cat .git/hooks/pre-commit
#!/bin/sh

FILES=$(git diff --cached --name-status | grep -v ^D | awk '$1 $2 { print $2}' | grep -e .py$)
if [ -n "$FILES" ]; then
pep8 -r $FILES
fi

use git commit -n to bypass its warnings. (Though I'm sure that could be done without the greps and I can't be bothered to change that now...)

Member

jnothman commented Aug 24, 2016

I had no intention to embarrass you. Here's a further hint:

~/repos/scikit-learn$ cat .git/hooks/pre-commit
#!/bin/sh

FILES=$(git diff --cached --name-status | grep -v ^D | awk '$1 $2 { print $2}' | grep -e .py$)
if [ -n "$FILES" ]; then
pep8 -r $FILES
fi

use git commit -n to bypass its warnings. (Though I'm sure that could be done without the greps and I can't be bothered to change that now...)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 24, 2016

Member

Hurrah! 馃嵒

Member

jnothman commented Aug 24, 2016

Hurrah! 馃嵒

@jnothman jnothman merged commit 234d256 into scikit-learn:master Aug 24, 2016

0 of 3 checks passed

ci/circleci CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 24, 2016

Member

Ugh. forgot to add to classes.rst

Member

jnothman commented Aug 24, 2016

Ugh. forgot to add to classes.rst

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 24, 2016

Member

Now that this is merged, I've realised that it's a bit unconventional for us to have the suffix CV on a splitter, and it has an altogether different conventional meaning. Should this be TimeSeriesSplit or TimeSeriesFold? @amueller??

Member

jnothman commented Aug 24, 2016

Now that this is merged, I've realised that it's a bit unconventional for us to have the suffix CV on a splitter, and it has an altogether different conventional meaning. Should this be TimeSeriesSplit or TimeSeriesFold? @amueller??

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 24, 2016

Member

True. I'd say TimeSeriesSplit

Member

amueller commented Aug 24, 2016

True. I'd say TimeSeriesSplit

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Aug 24, 2016

Member

Wanna add to classes and do the rename?

Member

amueller commented Aug 24, 2016

Wanna add to classes and do the rename?

@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Aug 24, 2016

Member

Thanks a lot @yenchenlin . Yes, I agree that it should be TimeSeriesSplit

Member

MechCoder commented Aug 24, 2016

Thanks a lot @yenchenlin . Yes, I agree that it should be TimeSeriesSplit

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 25, 2016

Member

@yenchenlin are you able to make a patch for these afterthoughts?

Member

jnothman commented Aug 25, 2016

@yenchenlin are you able to make a patch for these afterthoughts?

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 25, 2016

Contributor

@jnothman sure!

Contributor

yenchenlin commented Aug 25, 2016

@jnothman sure!

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Aug 25, 2016

Member

TimeSeriesSplitter maybe?

Member

raghavrv commented Aug 25, 2016

TimeSeriesSplitter maybe?

@cbrummitt

This comment has been minimized.

Show comment
Hide comment
@cbrummitt

cbrummitt Nov 30, 2016

Contributor

I'm not sure I understand what HeterogeneousTimeSeries means. It was mentioned here and in #6322. I think it means that the time stamps are not equally spaced.

I've implemented what I'm calling MultiTimeSeriesSplit for the case of multiple time-series with different time stamps present in different time-series (or, said differently, with missing data). I compute splits by calculating equally spaced quantiles of the distribution of time stamps.

Here's an illustration for a small dataset. There are four time-series (four countries), and not all countries have data in every year (e.g., it's common to have countries be "born" in the middle of a dataset).

image

Would this be of interest?

@yenchenlin what's the status of HeterogeneousTimeSeriesSplit (or whatever is the latest name)?

Contributor

cbrummitt commented Nov 30, 2016

I'm not sure I understand what HeterogeneousTimeSeries means. It was mentioned here and in #6322. I think it means that the time stamps are not equally spaced.

I've implemented what I'm calling MultiTimeSeriesSplit for the case of multiple time-series with different time stamps present in different time-series (or, said differently, with missing data). I compute splits by calculating equally spaced quantiles of the distribution of time stamps.

Here's an illustration for a small dataset. There are four time-series (four countries), and not all countries have data in every year (e.g., it's common to have countries be "born" in the middle of a dataset).

image

Would this be of interest?

@yenchenlin what's the status of HeterogeneousTimeSeriesSplit (or whatever is the latest name)?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Nov 30, 2016

Member
Member

jnothman commented Nov 30, 2016

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Dec 1, 2016

Member

@jnothman also check out the discussion here: #6322

Member

amueller commented Dec 1, 2016

@jnothman also check out the discussion here: #6322

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment