New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] fix BATS
and TBATS
_predict_interval
interface
#4492
Conversation
) | ||
def test_tbats_long_fh(): | ||
"""Test TBATS with long fh, checks for failure condition in bug #4491.""" | ||
np.random.seed(42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do unit tests inputs need to be reproducible? If not, removing this will ensure different sample in different runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents on this - I feel it is better to be reproducible in tests so you can quickly distinguish actual failures rather than one that is due to the data changing from test to test. If we need to test with multiple datasets (with different conditions), it is better to parameterize the data used for testing (but still be explicit with the dataset definition).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I suppose that's the main advantage.
The disadvantage being, if you have a sporadic failure - like the one with quantiles crossing on occasion - you'll never spot it.
(not arguing for either option, still on the fence)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think unit tests should test with as general example as possible. I like hypothesis
a lot specifically for this, if my function takes a numeric input, I should be able to test for any (does not necessarily mean all to save resources) number, not just 5 or 6 values I happen to choose at time of writing, as almost never ever those get modified afterwards and test for other values. Apart for any general dataset, I'd also want to test special cases (for example 0 if division is involved), and for that I'd prefer hardcoded inputs, but only because they are special.
As for this example, we need to test if tbats
support to predict intervals or not, not if it can support for the specific data points that seed=42
would generate. Our goal is to ensure that tbats
works, not to pass tests and get a green tick in GHA. So, ideally I would prefer to test tbats
for any data, and hopefully quite a good amount of variation. But obviously that would take some good amount of time, so I'd agree to restrict to very restricted number of examples, even 1. But I think we should at least test estimator's generalisation for other datasets over multiple runs in different PR's.
Had you used some sort of "gold/standard dataset", to me it'd make sense to fix the inputs. Since it is not, I'd prefer to not fix it.
(I sincerely apologise if this sounds too harsh or opinionated. I am just trying to share the point of view from which I suggested. I understand there are other tests in sktime
which also use seed
, so it should be okay if this line is kept as well.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are good arguments.
Quick question: If a test were to fail with hypothesis
on GitHub CI, would you be able to reproduce it locally for finding RCA and fixing it? If so, how would this be accomplished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you be able to reproduce it locally for finding RCA and fixing it? If so, how would this be accomplished?
Excellent question! @yarnabrina, I'd also be interested in the answer!
You can of course always brute-force it and just sample 100 or 1000 times and break on exception, but probably there's a more elegant way than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found the relevant paragraph in the docs of hypothesis
:
https://hypothesis.readthedocs.io/en/latest/reproducing.html
When a test fails unexpectedly, usually due to a health check failure, Hypothesis will print out a seed that led to that failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found the relevant paragraph in the docs of
hypothesis
: https://hypothesis.readthedocs.io/en/latest/reproducing.htmlWhen a test fails unexpectedly, usually due to a health check failure, Hypothesis will print out a seed that led to that failure
@fkiraly While this is an option, you still have to run all the test(s) with that seed for lot of iteration for it to fail once with a single case. If there is a actual system failure or something, that would help, but I interpreted @ngupta23's question as to find the particular choice of example that led to failure. There's a simpler option to do that, as all failing tests will show the corresponding inputs. Here's an example from Quick start guide:
Falsifying example: test_decode_inverts_encode(s='')
Also, instead of just returning all or first failing examples, by default hypothesis
finds the "minimal" (subject to defined shrinking strategy) example.
Few `hypothesis` examples (nothing to do with `sktime`, just for demonstration)
import collections.abc
import hypothesis
import hypothesis.strategies
def sample_function_1(input_list: list[int]) -> float:
"""Should fail if ``input_list`` is empty, or contains `0`."""
input_length = len(input_list)
minimum_absolute_value = min(abs(value) for value in input_list)
return input_length / minimum_absolute_value
def sample_function_2(input_list: list[int]) -> float:
"""Should fail if contains `0`."""
# handle empty case
if not input_list:
return 0.0
input_length = len(input_list)
minimum_absolute_value = min(abs(value) for value in input_list)
return input_length / minimum_absolute_value
def sample_function_3(input_list: list[int]) -> float:
"""Should not fail."""
# handle empty case
if not input_list:
return 0.0
# handle zero case
if 0 in input_list:
return float("inf")
input_length = len(input_list)
minimum_absolute_value = min(abs(value) for value in input_list)
return input_length / minimum_absolute_value
@hypothesis.given(
sample_list=hypothesis.strategies.lists(hypothesis.strategies.integers(), min_size=1)
)
def test_sample_function_1(sample_list: list[int]) -> None:
"""Useless test that always pass, unless ``sample_function_1`` fails."""
hypothesis.assume(0 not in sample_list)
return_value = sample_function_1(sample_list)
assert isinstance(return_value, float)
@hypothesis.given(sample_list=hypothesis.strategies.lists(hypothesis.strategies.integers()))
def test_sample_function_2(sample_list: list[int]) -> None:
"""Useless test that always pass, unless ``sample_function_2`` fails."""
hypothesis.assume(0 not in sample_list)
return_value = sample_function_2(sample_list)
assert isinstance(return_value, float)
@hypothesis.given(sample_list=hypothesis.strategies.lists(hypothesis.strategies.integers()))
def test_sample_function_3(sample_list: list[int]) -> None:
"""Useless test that always pass, unless ``sample_function_3`` fails."""
return_value = sample_function_3(sample_list)
assert isinstance(return_value, float)
@hypothesis.given(
sample_list=hypothesis.strategies.lists(hypothesis.strategies.integers()),
sample_function=hypothesis.strategies.sampled_from(
[sample_function_1, sample_function_2, sample_function_3]
),
)
def test_sample_function(
sample_list: list[int], sample_function: collections.abc.Callable[[list[int]], float]
) -> None:
"""Useless test that always pass, unless ``sample_function`` fails."""
return_value = sample_function(sample_list)
assert isinstance(return_value, float)
If you run this with pytest test_hypothesis.py
in an environment with hypothesis[pytest]
installed, first 3 should pass and last one should fail, and it'll should show that sample_function_1
failed with []
, and from that you can start debugging.
E Falsifying example: test_sample_function(
E sample_list=[],
E sample_function=sample_function_1,
E )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! hypothesis
would be a very interesting thing to show in the community colalb sessions (if you have time?), we could think about rebasing parts of the test framework on it!
Theyare also plugin extensible, so we could think about how to make the fixture generation logic (which framework-wise sits in skbase
) compatible with hypothesis
.
Going slightly off-topic here. Might be worth a design issue!
Fixes #4491
The
_predict_interval
interface in_TbatsAdapter
was a bit of a mess.There was also really strange code that triggered specifically whenever the
fh
was longer than the training data, I was unable to understand why that condition should be the same as "contains in-sample prediction intervals".This is the code:
It explains @ngupta23's issues in #4491, but I don't get why this was there in the first place.
Either way, I just completely rewrote the interface to prediction intervals from scratch.
The logic was simple, it is producing
nan
for insample predictions.Contains the failure case from #4491 as a test.