Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] fix BATS and TBATS _predict_interval interface #4492

Merged
merged 5 commits into from Apr 22, 2023
Merged

Conversation

fkiraly
Copy link
Collaborator

@fkiraly fkiraly commented Apr 18, 2023

Fixes #4491

The _predict_interval interface in _TbatsAdapter was a bit of a mess.

There was also really strange code that triggered specifically whenever the fh was longer than the training data, I was unable to understand why that condition should be the same as "contains in-sample prediction intervals".

This is the code:

        # If pred_int contains in-sample prediction intervals
        if len(pred_int) > len(self._y):
            len_out = len(pred_int) - len(self._y)
            # Workaround for slicing with negative index
            pred_int["idx"] = [x for x in range(-len(self._y), len_out)]

It explains @ngupta23's issues in #4491, but I don't get why this was there in the first place.

Either way, I just completely rewrote the interface to prediction intervals from scratch.
The logic was simple, it is producing nan for insample predictions.

Contains the failure case from #4491 as a test.

@fkiraly fkiraly added module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting bugfix Fixes a known bug or removes unintended behavior labels Apr 18, 2023
yarnabrina
yarnabrina previously approved these changes Apr 18, 2023
)
def test_tbats_long_fh():
"""Test TBATS with long fh, checks for failure condition in bug #4491."""
np.random.seed(42)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do unit tests inputs need to be reproducible? If not, removing this will ensure different sample in different runs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I suppose not, unless you try to reproduce hard-coded expected outcomes.

I just copy-pasted this from @ngupta23's example in #4491, and I wouldn't see a problem with having this line, or not having it.

Do you have a preference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents on this - I feel it is better to be reproducible in tests so you can quickly distinguish actual failures rather than one that is due to the data changing from test to test. If we need to test with multiple datasets (with different conditions), it is better to parameterize the data used for testing (but still be explicit with the dataset definition).

Copy link
Collaborator Author

@fkiraly fkiraly Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I suppose that's the main advantage.
The disadvantage being, if you have a sporadic failure - like the one with quantiles crossing on occasion - you'll never spot it.

(not arguing for either option, still on the fence)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think unit tests should test with as general example as possible. I like hypothesis a lot specifically for this, if my function takes a numeric input, I should be able to test for any (does not necessarily mean all to save resources) number, not just 5 or 6 values I happen to choose at time of writing, as almost never ever those get modified afterwards and test for other values. Apart for any general dataset, I'd also want to test special cases (for example 0 if division is involved), and for that I'd prefer hardcoded inputs, but only because they are special.

As for this example, we need to test if tbats support to predict intervals or not, not if it can support for the specific data points that seed=42 would generate. Our goal is to ensure that tbats works, not to pass tests and get a green tick in GHA. So, ideally I would prefer to test tbats for any data, and hopefully quite a good amount of variation. But obviously that would take some good amount of time, so I'd agree to restrict to very restricted number of examples, even 1. But I think we should at least test estimator's generalisation for other datasets over multiple runs in different PR's.

Had you used some sort of "gold/standard dataset", to me it'd make sense to fix the inputs. Since it is not, I'd prefer to not fix it.

(I sincerely apologise if this sounds too harsh or opinionated. I am just trying to share the point of view from which I suggested. I understand there are other tests in sktime which also use seed, so it should be okay if this line is kept as well.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good arguments.

Quick question: If a test were to fail with hypothesis on GitHub CI, would you be able to reproduce it locally for finding RCA and fixing it? If so, how would this be accomplished?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you be able to reproduce it locally for finding RCA and fixing it? If so, how would this be accomplished?

Excellent question! @yarnabrina, I'd also be interested in the answer!

You can of course always brute-force it and just sample 100 or 1000 times and break on exception, but probably there's a more elegant way than this.

Copy link
Collaborator Author

@fkiraly fkiraly Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found the relevant paragraph in the docs of hypothesis:
https://hypothesis.readthedocs.io/en/latest/reproducing.html

When a test fails unexpectedly, usually due to a health check failure, Hypothesis will print out a seed that led to that failure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found the relevant paragraph in the docs of hypothesis: https://hypothesis.readthedocs.io/en/latest/reproducing.html

When a test fails unexpectedly, usually due to a health check failure, Hypothesis will print out a seed that led to that failure

@fkiraly While this is an option, you still have to run all the test(s) with that seed for lot of iteration for it to fail once with a single case. If there is a actual system failure or something, that would help, but I interpreted @ngupta23's question as to find the particular choice of example that led to failure. There's a simpler option to do that, as all failing tests will show the corresponding inputs. Here's an example from Quick start guide:

Falsifying example: test_decode_inverts_encode(s='')

Also, instead of just returning all or first failing examples, by default hypothesis finds the "minimal" (subject to defined shrinking strategy) example.

Few `hypothesis` examples (nothing to do with `sktime`, just for demonstration)
import collections.abc

import hypothesis
import hypothesis.strategies


def sample_function_1(input_list: list[int]) -> float:
    """Should fail if ``input_list`` is empty, or contains `0`."""
    input_length = len(input_list)
    minimum_absolute_value = min(abs(value) for value in input_list)

    return input_length / minimum_absolute_value


def sample_function_2(input_list: list[int]) -> float:
    """Should fail if contains `0`."""
    # handle empty case
    if not input_list:
        return 0.0

    input_length = len(input_list)
    minimum_absolute_value = min(abs(value) for value in input_list)

    return input_length / minimum_absolute_value


def sample_function_3(input_list: list[int]) -> float:
    """Should not fail."""
    # handle empty case
    if not input_list:
        return 0.0

    # handle zero case
    if 0 in input_list:
        return float("inf")

    input_length = len(input_list)
    minimum_absolute_value = min(abs(value) for value in input_list)

    return input_length / minimum_absolute_value


@hypothesis.given(
    sample_list=hypothesis.strategies.lists(hypothesis.strategies.integers(), min_size=1)
)
def test_sample_function_1(sample_list: list[int]) -> None:
    """Useless test that always pass, unless ``sample_function_1`` fails."""
    hypothesis.assume(0 not in sample_list)

    return_value = sample_function_1(sample_list)

    assert isinstance(return_value, float)


@hypothesis.given(sample_list=hypothesis.strategies.lists(hypothesis.strategies.integers()))
def test_sample_function_2(sample_list: list[int]) -> None:
    """Useless test that always pass, unless ``sample_function_2`` fails."""
    hypothesis.assume(0 not in sample_list)

    return_value = sample_function_2(sample_list)

    assert isinstance(return_value, float)


@hypothesis.given(sample_list=hypothesis.strategies.lists(hypothesis.strategies.integers()))
def test_sample_function_3(sample_list: list[int]) -> None:
    """Useless test that always pass, unless ``sample_function_3`` fails."""
    return_value = sample_function_3(sample_list)

    assert isinstance(return_value, float)


@hypothesis.given(
    sample_list=hypothesis.strategies.lists(hypothesis.strategies.integers()),
    sample_function=hypothesis.strategies.sampled_from(
        [sample_function_1, sample_function_2, sample_function_3]
    ),
)
def test_sample_function(
    sample_list: list[int], sample_function: collections.abc.Callable[[list[int]], float]
) -> None:
    """Useless test that always pass, unless ``sample_function`` fails."""
    return_value = sample_function(sample_list)

    assert isinstance(return_value, float)

If you run this with pytest test_hypothesis.py in an environment with hypothesis[pytest] installed, first 3 should pass and last one should fail, and it'll should show that sample_function_1 failed with [], and from that you can start debugging.

E       Falsifying example: test_sample_function(
E           sample_list=[],
E           sample_function=sample_function_1,
E       )

Copy link
Collaborator Author

@fkiraly fkiraly Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! hypothesis would be a very interesting thing to show in the community colalb sessions (if you have time?), we could think about rebasing parts of the test framework on it!

Theyare also plugin extensible, so we could think about how to make the fixture generation logic (which framework-wise sits in skbase) compatible with hypothesis.

Going slightly off-topic here. Might be worth a design issue!

@fkiraly fkiraly merged commit edda79a into main Apr 22, 2023
22 checks passed
@fkiraly fkiraly deleted the tbats-fix branch April 22, 2023 16:07
fkiraly added a commit that referenced this pull request Apr 23, 2023
…4505)

With more test coverage on `predict_interval` from
#4470, `BATS` and `TBATS` fail some
in-sample cases.

These were not addressed in #4492
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Fixes a known bug or removes unintended behavior module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] TBATS can not produce a forecast with intervals when FH > History Period
3 participants