Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] fix temporal_train_test_split for hierarchical and panel data in case where fh is not passed #5330

Merged
merged 46 commits into from Oct 7, 2023

Conversation

fkiraly
Copy link
Collaborator

@fkiraly fkiraly commented Sep 29, 2023

This PR fixes #5107 by replacing the inner workings of temporal_train_test_split with a splitter, making the design both DRY and reliant on splitter boilerplate.

This is done as follows:

  • addition of a new splitter TemporalTrainTestSplitter, which encapsulates the temporal_train_test_split logic
  • internally, TemporalTrainTestSplitter is a re-implementation of the temporal_train_test_split logic as a splitter class, giving access to the BaseSplitter boilerplate.
  • removed osuleaf example, as that was time series classification and got split across instances in the sample rather than time points. This would be inconsistent with the general splitter interface and the docstring specification.

The issue with the panel data is solved by recurring to the existing broadcasting boilerplate of BaseSplitter.

Depends on #5332 for testing.

@fkiraly fkiraly added enhancement Adding new functionality module:splitters&resamplers data splitters and resamplers labels Sep 29, 2023
@fkiraly
Copy link
Collaborator Author

fkiraly commented Sep 29, 2023

not sure why this fails - perhaps undocumented part of contract?

@hazrulakmal
Copy link
Collaborator

not sure why this fails - perhaps undocumented part of contract?

are you referring to the fh argument? I think the original author may not follow the convention of writing the docstring but he/she did specify the fh's expected input type.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Sep 30, 2023

are you referring to the fh argument?

no, to the failure that seems to arise from sth numpy

@fkiraly
Copy link
Collaborator Author

fkiraly commented Oct 1, 2023

works now - re-requested review @hazrulakmal

Copy link
Collaborator

@yarnabrina yarnabrina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked two clarification questions/doubts.

Also, may be it's a silly question, I couldn't spot a change that looked to be specifically for hierarchical/panel data. Can you please point me to the line that handles that part? Does this branch resolves both #4968 and #5107?

sktime/split/temporal_train_test_split.py Show resolved Hide resolved
sktime/split/temporal_train_test_split.py Outdated Show resolved Hide resolved
test_size=test_size,
train_size=train_size,
)
return y_train, y_test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably does not need an else block, since if block contains a return already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the else block made the condition easier to spot for a reader of the code, because it matches the indentation and can be tracked up to the if condition.

If we remove the else, the if condition will be harder to find.

This is my rationale (in general) to do this even if you have if-return, but I'm not too strongly opined about this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, I did the opposite with if fh is not None:, because the else was very long.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a change to the code flow, and added comments - better? I went with your suggestion, but added comments to the fh and the X branching to make it clear to the reader

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a non-blocking comment really. Usually tools like pylint will complain against these (ref. R1705).

The change looks good to me. pylint or etc. may still complain that function return signature vary between different return statements (ref. R1710), but I don't think there's a way to solve that here at all.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Oct 2, 2023

Also, may be it's a silly question, I couldn't spot a change that looked to be specifically for hierarchical/panel data. Can you please point me to the line that handles that part?

Hm, we should also perhaps add tests for this?

The part that solves this imo is as I claimed in the PR preamble: "The issue with the panel data is solved by recurring to the existing broadcasting boilerplate of BaseSplitter."

In more detail, temporal_train_test_split now calls TemporalTrainTestSplitter's split_series, which is the boilerplate in BaseSplitter (not new, hence DRY, and takes care of the hierarchical case) that ends up calling _split. This includes a broadcasting loop for hierarchical/panel data, all in the boilerplate layer of BaseSplitter, hence solving the issue in #5107.

It does not resolve #4968, as this deals only with the case where fh is not passed, the code for the case where fh is passed remains unchanged.

Can you please point me to the line that handles that part?

To point you to the precise line, there are multiple of relevance, all in split.base._base_splitter.py:

  • split_series calls split, line 224; the surrounding conversions (lines 222, 227, 228) take care of every sktime series type being supported
  • in split, if y is panel or hierarchical, we enter _split_vectorized, in via line 125 and 127
  • the broadcasting happens in _split_vectorized, the key line is 170 which calls _split again for each slice, in our case that's where we return to TemporalTrainTestSplitter._split

@fkiraly fkiraly changed the title [BUG] fix temporal_train_test_split for hierarchical and panel pandas data [BUG] fix temporal_train_test_split for hierarchical and panel data Oct 2, 2023
@fkiraly fkiraly changed the title [BUG] fix temporal_train_test_split for hierarchical and panel data [BUG] fix temporal_train_test_split for hierarchical and panel data in case where fh is not passed Oct 2, 2023
yarnabrina
yarnabrina previously approved these changes Oct 2, 2023
@fkiraly fkiraly merged commit bbfbeed into main Oct 7, 2023
24 checks passed
@fkiraly fkiraly deleted the temporal_train_test_split-hierarchical branch October 7, 2023 09:00
yarnabrina added a commit to yarnabrina/sktime-fork that referenced this pull request Oct 8, 2023
* origin/main: (24 commits)
  [MNT] skpro as a soft dependency (sktime#5273)
  [DOC] Correct code block formatting for pre-commit install command (sktime#5377)
  [ENH] parallelization backend calls in utility module - part 2, backend parameter passing (sktime#5311)
  [MNT] [Dependabot](deps): Bump stefanzweifel/git-auto-commit-action from 4 to 5 (sktime#5373)
  [MNT] python 3.12 compatibility - replace `distutils` calls with equivalent functionality (sktime#5376)
  [DOC] `sktime` intro notebook (sktime#3793)
  [ENH] Skip unnecessary fit in `ForecastX` if inner `forecaster_y` ignores `X` (sktime#5353)
  [ENH] refactor parallelization backend calls in utility module (sktime#5268)
  [ENH] ARCH model (arch package) Issue sktime#2173 (sktime#5326)
  [ENH] add error message return to `deep_equals` assert in `test_reconstruct_identical`  (sktime#4927)
  [DOC] Added all feature names to docstring for DateTimeFeatures class (sktime#5283)
  [BUG] fix `STLForecaster` tag `ignores-exogenous-X` to be correctly set for composites (sktime#5365)
  [BUG] Fix inconsistent date/time index in `plot_windows` sktime#4919 (sktime#5321)
  &benshamza, fkiraly [BUG] Fix `numba` errors when calling `tslearn` `lcss` (sktime#5368)
  [MNT] Dataset file restructure (sktime#5239)
  [BUG] fix `temporal_train_test_split` for hierarchical and panel data in case where `fh` is not passed (sktime#5330)
  [MNT] [Dependabot](deps): Bump styfle/cancel-workflow-action from 0.11.0 to 0.12.0 (sktime#5355)
  [ENH] minor fixes to `NaiveAligner` (sktime#5344)
  [DOC] Improve Readability of Notebook 2 - Classification, Regression & Clustering (sktime#5312)
  [DOC] Documented ax argument and the figure in plot_series (sktime#5325)
  ...
yarnabrina added a commit to yarnabrina/sktime-fork that referenced this pull request Oct 8, 2023
…recasting

* origin/split-ci: (25 commits)
  added flags to codecov
  [MNT] skpro as a soft dependency (sktime#5273)
  [DOC] Correct code block formatting for pre-commit install command (sktime#5377)
  [ENH] parallelization backend calls in utility module - part 2, backend parameter passing (sktime#5311)
  [MNT] [Dependabot](deps): Bump stefanzweifel/git-auto-commit-action from 4 to 5 (sktime#5373)
  [MNT] python 3.12 compatibility - replace `distutils` calls with equivalent functionality (sktime#5376)
  [DOC] `sktime` intro notebook (sktime#3793)
  [ENH] Skip unnecessary fit in `ForecastX` if inner `forecaster_y` ignores `X` (sktime#5353)
  [ENH] refactor parallelization backend calls in utility module (sktime#5268)
  [ENH] ARCH model (arch package) Issue sktime#2173 (sktime#5326)
  [ENH] add error message return to `deep_equals` assert in `test_reconstruct_identical`  (sktime#4927)
  [DOC] Added all feature names to docstring for DateTimeFeatures class (sktime#5283)
  [BUG] fix `STLForecaster` tag `ignores-exogenous-X` to be correctly set for composites (sktime#5365)
  [BUG] Fix inconsistent date/time index in `plot_windows` sktime#4919 (sktime#5321)
  &benshamza, fkiraly [BUG] Fix `numba` errors when calling `tslearn` `lcss` (sktime#5368)
  [MNT] Dataset file restructure (sktime#5239)
  [BUG] fix `temporal_train_test_split` for hierarchical and panel data in case where `fh` is not passed (sktime#5330)
  [MNT] [Dependabot](deps): Bump styfle/cancel-workflow-action from 0.11.0 to 0.12.0 (sktime#5355)
  [ENH] minor fixes to `NaiveAligner` (sktime#5344)
  [DOC] Improve Readability of Notebook 2 - Classification, Regression & Clustering (sktime#5312)
  ...
yarnabrina added a commit to yarnabrina/sktime-fork that referenced this pull request Oct 8, 2023
* origin/split-ci: (25 commits)
  added flags to codecov
  [MNT] skpro as a soft dependency (sktime#5273)
  [DOC] Correct code block formatting for pre-commit install command (sktime#5377)
  [ENH] parallelization backend calls in utility module - part 2, backend parameter passing (sktime#5311)
  [MNT] [Dependabot](deps): Bump stefanzweifel/git-auto-commit-action from 4 to 5 (sktime#5373)
  [MNT] python 3.12 compatibility - replace `distutils` calls with equivalent functionality (sktime#5376)
  [DOC] `sktime` intro notebook (sktime#3793)
  [ENH] Skip unnecessary fit in `ForecastX` if inner `forecaster_y` ignores `X` (sktime#5353)
  [ENH] refactor parallelization backend calls in utility module (sktime#5268)
  [ENH] ARCH model (arch package) Issue sktime#2173 (sktime#5326)
  [ENH] add error message return to `deep_equals` assert in `test_reconstruct_identical`  (sktime#4927)
  [DOC] Added all feature names to docstring for DateTimeFeatures class (sktime#5283)
  [BUG] fix `STLForecaster` tag `ignores-exogenous-X` to be correctly set for composites (sktime#5365)
  [BUG] Fix inconsistent date/time index in `plot_windows` sktime#4919 (sktime#5321)
  &benshamza, fkiraly [BUG] Fix `numba` errors when calling `tslearn` `lcss` (sktime#5368)
  [MNT] Dataset file restructure (sktime#5239)
  [BUG] fix `temporal_train_test_split` for hierarchical and panel data in case where `fh` is not passed (sktime#5330)
  [MNT] [Dependabot](deps): Bump styfle/cancel-workflow-action from 0.11.0 to 0.12.0 (sktime#5355)
  [ENH] minor fixes to `NaiveAligner` (sktime#5344)
  [DOC] Improve Readability of Notebook 2 - Classification, Regression & Clustering (sktime#5312)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:splitters&resamplers data splitters and resamplers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] temporal_train_test_split does not work on panel datatypes with unequal length series.
3 participants