New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] fix BATS
and TBATS
_predict_interval
interface
#4492
Merged
Changes from 4 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# -*- coding: utf-8 -*- | ||
"""Tests for TBATS.""" | ||
# copyright: sktime developers, BSD-3-Clause License (see LICENSE file) | ||
|
||
__author__ = ["fkiraly", "ngupta23"] | ||
|
||
import numpy as np | ||
import pandas as pd | ||
import pytest | ||
|
||
from sktime.forecasting.tbats import TBATS | ||
from sktime.utils.validation._dependencies import _check_estimator_deps | ||
|
||
|
||
@pytest.mark.skipif( | ||
not _check_estimator_deps(TBATS, severity="none"), | ||
reason="skip test if required soft dependency not available", | ||
) | ||
def test_tbats_long_fh(): | ||
"""Test TBATS with long fh, checks for failure condition in bug #4491.""" | ||
np.random.seed(42) | ||
LEN_HISTORY = 50 | ||
train = pd.Series(data=np.random.randint(1, 100, LEN_HISTORY)) | ||
|
||
# train model | ||
estimator = TBATS( | ||
use_box_cox=False, | ||
use_trend=True, | ||
use_damped_trend=False, | ||
sp=10, | ||
use_arma_errors=False, | ||
) | ||
estimator.fit(train) | ||
|
||
# failure condition is fh being longer than training data | ||
long_fh = np.array(range(1, LEN_HISTORY + 2)) | ||
|
||
fcst = estimator.predict_interval(coverage=0.8, fh=long_fh) | ||
assert len(fcst) == len(long_fh) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do unit tests inputs need to be reproducible? If not, removing this will ensure different sample in different runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I suppose not, unless you try to reproduce hard-coded expected outcomes.
I just copy-pasted this from @ngupta23's example in #4491, and I wouldn't see a problem with having this line, or not having it.
Do you have a preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents on this - I feel it is better to be reproducible in tests so you can quickly distinguish actual failures rather than one that is due to the data changing from test to test. If we need to test with multiple datasets (with different conditions), it is better to parameterize the data used for testing (but still be explicit with the dataset definition).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I suppose that's the main advantage.
The disadvantage being, if you have a sporadic failure - like the one with quantiles crossing on occasion - you'll never spot it.
(not arguing for either option, still on the fence)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think unit tests should test with as general example as possible. I like
hypothesis
a lot specifically for this, if my function takes a numeric input, I should be able to test for any (does not necessarily mean all to save resources) number, not just 5 or 6 values I happen to choose at time of writing, as almost never ever those get modified afterwards and test for other values. Apart for any general dataset, I'd also want to test special cases (for example 0 if division is involved), and for that I'd prefer hardcoded inputs, but only because they are special.As for this example, we need to test if
tbats
support to predict intervals or not, not if it can support for the specific data points thatseed=42
would generate. Our goal is to ensure thattbats
works, not to pass tests and get a green tick in GHA. So, ideally I would prefer to testtbats
for any data, and hopefully quite a good amount of variation. But obviously that would take some good amount of time, so I'd agree to restrict to very restricted number of examples, even 1. But I think we should at least test estimator's generalisation for other datasets over multiple runs in different PR's.Had you used some sort of "gold/standard dataset", to me it'd make sense to fix the inputs. Since it is not, I'd prefer to not fix it.
(I sincerely apologise if this sounds too harsh or opinionated. I am just trying to share the point of view from which I suggested. I understand there are other tests in
sktime
which also useseed
, so it should be okay if this line is kept as well.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are good arguments.
Quick question: If a test were to fail with
hypothesis
on GitHub CI, would you be able to reproduce it locally for finding RCA and fixing it? If so, how would this be accomplished?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent question! @yarnabrina, I'd also be interested in the answer!
You can of course always brute-force it and just sample 100 or 1000 times and break on exception, but probably there's a more elegant way than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found the relevant paragraph in the docs of
hypothesis
:https://hypothesis.readthedocs.io/en/latest/reproducing.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fkiraly While this is an option, you still have to run all the test(s) with that seed for lot of iteration for it to fail once with a single case. If there is a actual system failure or something, that would help, but I interpreted @ngupta23's question as to find the particular choice of example that led to failure. There's a simpler option to do that, as all failing tests will show the corresponding inputs. Here's an example from Quick start guide:
Also, instead of just returning all or first failing examples, by default
hypothesis
finds the "minimal" (subject to defined shrinking strategy) example.Few `hypothesis` examples (nothing to do with `sktime`, just for demonstration)
If you run this with
pytest test_hypothesis.py
in an environment withhypothesis[pytest]
installed, first 3 should pass and last one should fail, and it'll should show thatsample_function_1
failed with[]
, and from that you can start debugging.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
hypothesis
would be a very interesting thing to show in the community colalb sessions (if you have time?), we could think about rebasing parts of the test framework on it!Theyare also plugin extensible, so we could think about how to make the fixture generation logic (which framework-wise sits in
skbase
) compatible withhypothesis
.Going slightly off-topic here. Might be worth a design issue!