[BUG] fix `BATS` and `TBATS` `_predict_interval` interface #4492

fkiraly · 2023-04-18T11:45:10Z

The _predict_interval interface in _TbatsAdapter was a bit of a mess.

There was also really strange code that triggered specifically whenever the fh was longer than the training data, I was unable to understand why that condition should be the same as "contains in-sample prediction intervals".

This is the code:

        # If pred_int contains in-sample prediction intervals
        if len(pred_int) > len(self._y):
            len_out = len(pred_int) - len(self._y)
            # Workaround for slicing with negative index
            pred_int["idx"] = [x for x in range(-len(self._y), len_out)]

It explains @ngupta23's issues in #4491, but I don't get why this was there in the first place.

Either way, I just completely rewrote the interface to prediction intervals from scratch.
The logic was simple, it is producing nan for insample predictions.

Contains the failure case from #4491 as a test.

yarnabrina · 2023-04-18T17:54:18Z

sktime/forecasting/tests/test_tbats.py

+)
+def test_tbats_long_fh():
+    """Test TBATS with long fh, checks for failure condition in bug #4491."""
+    np.random.seed(42)


Do unit tests inputs need to be reproducible? If not, removing this will ensure different sample in different runs.

hm, I suppose not, unless you try to reproduce hard-coded expected outcomes.

I just copy-pasted this from @ngupta23's example in #4491, and I wouldn't see a problem with having this line, or not having it.

Do you have a preference?

My 2 cents on this - I feel it is better to be reproducible in tests so you can quickly distinguish actual failures rather than one that is due to the data changing from test to test. If we need to test with multiple datasets (with different conditions), it is better to parameterize the data used for testing (but still be explicit with the dataset definition).

yes, I suppose that's the main advantage.
The disadvantage being, if you have a sporadic failure - like the one with quantiles crossing on occasion - you'll never spot it.

(not arguing for either option, still on the fence)

I think unit tests should test with as general example as possible. I like hypothesis a lot specifically for this, if my function takes a numeric input, I should be able to test for any (does not necessarily mean all to save resources) number, not just 5 or 6 values I happen to choose at time of writing, as almost never ever those get modified afterwards and test for other values. Apart for any general dataset, I'd also want to test special cases (for example 0 if division is involved), and for that I'd prefer hardcoded inputs, but only because they are special.

As for this example, we need to test if tbats support to predict intervals or not, not if it can support for the specific data points that seed=42 would generate. Our goal is to ensure that tbats works, not to pass tests and get a green tick in GHA. So, ideally I would prefer to test tbats for any data, and hopefully quite a good amount of variation. But obviously that would take some good amount of time, so I'd agree to restrict to very restricted number of examples, even 1. But I think we should at least test estimator's generalisation for other datasets over multiple runs in different PR's.

Had you used some sort of "gold/standard dataset", to me it'd make sense to fix the inputs. Since it is not, I'd prefer to not fix it.

(I sincerely apologise if this sounds too harsh or opinionated. I am just trying to share the point of view from which I suggested. I understand there are other tests in sktime which also use seed, so it should be okay if this line is kept as well.)

Those are good arguments.

Quick question: If a test were to fail with hypothesis on GitHub CI, would you be able to reproduce it locally for finding RCA and fixing it? If so, how would this be accomplished?

would you be able to reproduce it locally for finding RCA and fixing it? If so, how would this be accomplished?

Excellent question! @yarnabrina, I'd also be interested in the answer!

You can of course always brute-force it and just sample 100 or 1000 times and break on exception, but probably there's a more elegant way than this.

found the relevant paragraph in the docs of hypothesis:
https://hypothesis.readthedocs.io/en/latest/reproducing.html

When a test fails unexpectedly, usually due to a health check failure, Hypothesis will print out a seed that led to that failure

found the relevant paragraph in the docs of hypothesis: https://hypothesis.readthedocs.io/en/latest/reproducing.html

When a test fails unexpectedly, usually due to a health check failure, Hypothesis will print out a seed that led to that failure

@fkiraly While this is an option, you still have to run all the test(s) with that seed for lot of iteration for it to fail once with a single case. If there is a actual system failure or something, that would help, but I interpreted @ngupta23's question as to find the particular choice of example that led to failure. There's a simpler option to do that, as all failing tests will show the corresponding inputs. Here's an example from Quick start guide:

Falsifying example: test_decode_inverts_encode(s='')

Also, instead of just returning all or first failing examples, by default hypothesis finds the "minimal" (subject to defined shrinking strategy) example.

Few `hypothesis` examples (nothing to do with `sktime`, just for demonstration)

import collections.abc import hypothesis import hypothesis.strategies def sample_function_1(input_list: list[int]) -> float: """Should fail if ``input_list`` is empty, or contains `0`.""" input_length = len(input_list) minimum_absolute_value = min(abs(value) for value in input_list) return input_length / minimum_absolute_value def sample_function_2(input_list: list[int]) -> float: """Should fail if contains `0`.""" # handle empty case if not input_list: return 0.0 input_length = len(input_list) minimum_absolute_value = min(abs(value) for value in input_list) return input_length / minimum_absolute_value def sample_function_3(input_list: list[int]) -> float: """Should not fail.""" # handle empty case if not input_list: return 0.0 # handle zero case if 0 in input_list: return float("inf") input_length = len(input_list) minimum_absolute_value = min(abs(value) for value in input_list) return input_length / minimum_absolute_value @hypothesis.given( sample_list=hypothesis.strategies.lists(hypothesis.strategies.integers(), min_size=1) ) def test_sample_function_1(sample_list: list[int]) -> None: """Useless test that always pass, unless ``sample_function_1`` fails.""" hypothesis.assume(0 not in sample_list) return_value = sample_function_1(sample_list) assert isinstance(return_value, float) @hypothesis.given(sample_list=hypothesis.strategies.lists(hypothesis.strategies.integers())) def test_sample_function_2(sample_list: list[int]) -> None: """Useless test that always pass, unless ``sample_function_2`` fails.""" hypothesis.assume(0 not in sample_list) return_value = sample_function_2(sample_list) assert isinstance(return_value, float) @hypothesis.given(sample_list=hypothesis.strategies.lists(hypothesis.strategies.integers())) def test_sample_function_3(sample_list: list[int]) -> None: """Useless test that always pass, unless ``sample_function_3`` fails.""" return_value = sample_function_3(sample_list) assert isinstance(return_value, float) @hypothesis.given( sample_list=hypothesis.strategies.lists(hypothesis.strategies.integers()), sample_function=hypothesis.strategies.sampled_from( [sample_function_1, sample_function_2, sample_function_3] ), ) def test_sample_function( sample_list: list[int], sample_function: collections.abc.Callable[[list[int]], float] ) -> None: """Useless test that always pass, unless ``sample_function`` fails.""" return_value = sample_function(sample_list) assert isinstance(return_value, float)

If you run this with pytest test_hypothesis.py in an environment with hypothesis[pytest] installed, first 3 should pass and last one should fail, and it'll should show that sample_function_1 failed with [], and from that you can start debugging.

E Falsifying example: test_sample_function( E sample_list=[], E sample_function=sample_function_1, E )

nice! hypothesis would be a very interesting thing to show in the community colalb sessions (if you have time?), we could think about rebasing parts of the test framework on it!

Theyare also plugin extensible, so we could think about how to make the fixture generation logic (which framework-wise sits in skbase) compatible with hypothesis.

Going slightly off-topic here. Might be worth a design issue!

…4505) With more test coverage on `predict_interval` from #4470, `BATS` and `TBATS` fail some in-sample cases. These were not addressed in #4492

fkiraly added 2 commits April 18, 2023 12:34

tbats fix

2cc08ef

remove dead function

fa665a1

fkiraly added module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting bugfix Fixes a known bug or removes unintended behavior labels Apr 18, 2023

fkiraly added 2 commits April 18, 2023 12:46

linting

7de9cea

add test case from 4491

aca00d4

yarnabrina previously approved these changes Apr 18, 2023

View reviewed changes

remove random seed from test

6122532

fkiraly dismissed yarnabrina’s stale review via 6122532 April 19, 2023 17:30

fkiraly merged commit edda79a into main Apr 22, 2023
22 checks passed

fkiraly deleted the tbats-fix branch April 22, 2023 16:07

fkiraly mentioned this pull request Apr 23, 2023

[BUG] fix BATS and TBATS _predict_interval interface, part 2 #4505

Merged

fkiraly added a commit that referenced this pull request Apr 23, 2023

[BUG] fix BATS and TBATS _predict_interval interface, part 2 (#…

06dfabb

…4505) With more test coverage on `predict_interval` from #4470, `BATS` and `TBATS` fail some in-sample cases. These were not addressed in #4492

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] fix `BATS` and `TBATS` `_predict_interval` interface #4492

[BUG] fix `BATS` and `TBATS` `_predict_interval` interface #4492

fkiraly commented Apr 18, 2023 •

edited

yarnabrina Apr 18, 2023

fkiraly Apr 18, 2023

ngupta23 Apr 18, 2023

fkiraly Apr 18, 2023 •

edited

yarnabrina Apr 19, 2023

ngupta23 Apr 19, 2023

fkiraly Apr 19, 2023

fkiraly Apr 19, 2023 •

edited

yarnabrina Apr 20, 2023

fkiraly Apr 20, 2023 •

edited

[BUG] fix BATS and TBATS _predict_interval interface #4492

[BUG] fix BATS and TBATS _predict_interval interface #4492

Conversation

fkiraly commented Apr 18, 2023 • edited

yarnabrina Apr 18, 2023

Choose a reason for hiding this comment

fkiraly Apr 18, 2023

Choose a reason for hiding this comment

ngupta23 Apr 18, 2023

Choose a reason for hiding this comment

fkiraly Apr 18, 2023 • edited

Choose a reason for hiding this comment

yarnabrina Apr 19, 2023

Choose a reason for hiding this comment

ngupta23 Apr 19, 2023

Choose a reason for hiding this comment

fkiraly Apr 19, 2023

Choose a reason for hiding this comment

fkiraly Apr 19, 2023 • edited

Choose a reason for hiding this comment

yarnabrina Apr 20, 2023

Choose a reason for hiding this comment

fkiraly Apr 20, 2023 • edited

Choose a reason for hiding this comment

[BUG] fix `BATS` and `TBATS` `_predict_interval` interface #4492

[BUG] fix `BATS` and `TBATS` `_predict_interval` interface #4492

fkiraly commented Apr 18, 2023 •

edited

fkiraly Apr 18, 2023 •

edited

fkiraly Apr 19, 2023 •

edited

fkiraly Apr 20, 2023 •

edited