Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation #23

Closed
mksamelson opened this issue Mar 19, 2021 · 3 comments
Closed

Documentation #23

mksamelson opened this issue Mar 19, 2021 · 3 comments

Comments

@mksamelson
Copy link

Documentation and examples do not address the splitting of data set into training and test sets.

If using one of the cross validators, does the data set need to be sorted in time order? Is there way to designate a datetime column so the class understands on what basis to sequentially split data?

@WenjieZ
Copy link
Owner

WenjieZ commented Mar 20, 2021

The data set is assumed to be in time order, though an explicit datatime column is not required. It should work on lists, arrays, dataframes. If not, please report a bug.

The following example comes from my blog post:

import numpy as np
from tscv import gap_train_test_split
X, y = np.arange(20).reshape((10, 2)), np.arange(10)
X_train, X_test, y_train, y_test = gap_train_test_split(X, y, test_size=2, gap_size=2)

In the example, X and y are numpy arrays. They don't have data time information, but they represent time-ordered data.

@mksamelson
Copy link
Author

Thank you for clarifying. I eventually figured it out when I saw a reference in the code comments.

I recommend you make it clear in your documentation.

Also, do you have a layout recommendations/guidance on how to set combination layouts of split/test size/gap size?

I made some guesses when using and in some cases got errors. I'm using GapWalkForward and given that it ignores data after the test set I'm trying to easy figure exactly what I'm using in the folds so I'm making best use of data.

Thank you so much for creating and publishing this package

@WenjieZ
Copy link
Owner

WenjieZ commented Mar 22, 2021

The documentation part is on the roadmap (see the v0.1.0 milestone).

do you have a layout recommendations/guidance on how to set combination layouts of split/test size/gap size?

This topic is publication worthy. There is no single fixed rule that can handle all cases, and many heuristics are available for choice. I do have some professional insight on this issue, but they cannot be made clear within a couple of sentences. I can give you some quick and dirty advice though: try some small gap sizes and check whether your conclusion stays the same with and without the gaps (cf. stress test and scenario analysis).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants