[BUG] `test_predict_proba` is expecting column name of `pred_dist` to be `0` always. #5349

Vasudeva-bit · 2023-10-03T09:36:43Z

Describe the bug

All the tests in check_estimator are expecting column names to be preserved, while a single test check_predict_proba function, invoked inside test_predict_proba, in sktime.forecasting.tests.test_all_forecasters.py is expecting column name of pred_dist i.e., the output of predict_proba, to be 0 always.

To Reproduce

location: here.

# existing code
if isinstance(y_train, pd.Series):
    assert (pred_cols == pd.Index([0])).all()
else:
    assert (pred_cols == y_train.columns).all()

# expected code
if isinstance(y_train, pd.Series):
    assert (pred_cols == y_train.name).all()
else:
    assert (pred_cols == y_train.columns).all()

Expected behavior

The function mentioned above has to be modified such that, it checks for column names of y_train, pred_dist, like rest of the tests.

Additional context

Any changes in this test will affect existing univariate probabilistic forecasters such as NaiveForecaster. Therefore, fixing this issue also requires necessary changes in all the univariate probabilistic forecasters.

The text was updated successfully, but these errors were encountered:

yarnabrina · 2023-10-03T18:40:33Z

I think this is related to #4709 and #4780.

@Vasudeva-bit I think the issue is y_train may not always have a name attribute that is non-empty. May be something like this (not tested) will solve it?

if isinstance(y_train, pd.Series) and y_train.name:
    assert (pred_cols == y_train.name).all()
elif isinstance(y_train, pd.Series):  # automatically implies empty name
    assert (pred_cols == pd.Index([0])).all()
else:
    assert (pred_cols == y_train.columns).all()

fkiraly · 2023-10-03T19:44:29Z

I think this is related to #4709 and #4780.

Hm, I think not as closely as one might suspect.

The two PRs and related design problems were restricted to interval and quantile forecasts.

I think it is more related to "should we use name as col or not if a conversion to a DataFrame-like type, from pd.Series`, happens", like the journey of suffering in these PR:
#3290
#3297
#3322
#3323
#4150
#4155
#4157

The root of evil is really that depending on how you convert or coerce pd.Series to pd.DataFrame, one of two things might happen: the name appears now as the single column name, or it gets purged and instead the integer 0 appears as single column name. It is one of the idiosyncrasies of pandas, and the "how" matters.

Unfortunately this also needs to be consistent in many places, conversions, back-conversions, etc. While the original problem - I hope - has been resolved, it seems to have reappeared with the skpro distribution output, which is strongly pd.DataFrame inspired, and has similar index and columns attributes.

fkiraly · 2023-10-03T19:46:47Z

Maybe something like this (not tested) will solve it?

Yes, @yarnabrina, I think your test is the correct expectation. But it seems to me that it would be violated by many current forecasters, as @Vasudeva-bit has observed by accident.

Also, in case it is not the current behaviour, we need to run it through a deprecation period to move it to the one that seems more logical - people might have adapted to it downstream, and we do not want to break their code.

Vasudeva-bit · 2023-10-04T03:57:34Z

@Vasudeva-bit I think the issue is y_train may not always have a name attribute that is non-empty. May be something like this (not tested) will solve it?

A few tests are expecting name to be preserved even if it is None, here. Thats why I didn't handle the empty case separately. But as suggested, handling empty name None separately seems reasonable.

Could we return forecasts as pd.DataFrame always, irrespective of whether it is a univariate or multi-variate forecaster?

This way we can test for only one single type. And, while returning the predicted forecasts from _base.py, we could convert pd.DataFrame to pd.Series for univariate forecasters. When a pd.DataFrame is converted to pd.Series, the name of pd.Series would be same as pd.DataFrame's name (attribute name, not columns).

fkiraly · 2023-10-07T19:22:01Z

Could we return forecasts as pd.DataFrame always, irrespective of whether it is a univariate or multi-variate forecaster?

That is an interface decision we have already discussed a number of times, but then the question becomes, what do we do with other data containers, such as dask or np.ndarray - coerce to pd.DataFrame as well? If not, then only pd.Series gets special treatment here...

Not saying that it would be a bad choice, but it would be an exception from the "support multiple formats" that we currently have.

Vasudeva-bit · 2023-10-08T05:41:53Z

Oh, I get it. It's better not to treat them specially.

fkiraly · 2023-10-08T08:34:10Z

Having said that, the issue you highlighted originally seems to be an inconsistency we may like to fix - ahtough it requires deprecation.
https://www.sktime.net/en/stable/developer_guide/deprecation.html

Could you perhaps open a PR which has only your originally proposed fix (changes to test and arch), so we can use this as a starting point?

I would use that to look into the problem (domino effects on other estimators) and build a deprecation strategy around it.

Vasudeva-bit · 2023-10-08T10:06:10Z

I opened a PR #5384 with originally proposed fix.

Fixes #5349 Changes a if (check) condition in `_check_predict_proba` function of `sktime/forecasting/tests/test_all_forecasters.py` to preserve column name in univariate forecaster predictions. Upon fixing this issue, `ARCH` doesn't require function `_predict_proba` as `BaseForecaster`'s `_predict_proba` is sufficient. Hence, removed that function.

Related to #5349 (issue), #5384 (PR). > After debugging for a while, > > 1. Firstly, the function `_check_predict_proba` is only testing for `y_train` generated from `_make_data(n_columns=n_columns)`. If the forecaster is univariate, then the `y_train` generated would always have column name `None` of type `str`. The `_check_predict_proba` checks whether the `_predict_proba` is changing `None` to `0` internally. Since every forecaster's `_predict_proba` doing it, all forecasters passed. > 2. Actually `ARCH` should fail this test, because `_predict_proba` function is deleted in this PR. But, the `type(y_train.name)` is `str` (`'None'`), not `None`. Therefore, `None` case is skipped and checks whether `(pred_cols == y_train.name).all()` which is `None` of `str` equals `None` of `str` so, `ARCH` also passed. > > I think it's my bad, there is no bug in `_check_predict_proba` as it is not designed to check whether column names are preserved. Changes should only be in `ARCH`'s `_predict_proba` like below: > > 1. If the column name is `None` of type `None`/`str`, do `pred_dist.name = pd.Index([0])`. > 2. Otherwise, just return output of `Super()._predict_proba()` as it is, i.e., column names are preserved by default. I think there was some ambiguity in my explanation for #5349 (issue) in #5384 (PR), but what I meant was exactly the changes in this PR.

Vasudeva-bit added the bug Something isn't working label Oct 3, 2023

Vasudeva-bit changed the title ~~[BUG] test_predict_proba function in sktime.forecasting.tests.test_all_forecasters.py is expecting column name of pred_dist to be 0 always.~~ [BUG] test_predict_proba is expecting column name of pred_dist to be 0 always. Oct 3, 2023

Vasudeva-bit mentioned this issue Oct 3, 2023

[ENH] ARCH model (arch package) Issue #2173 #5326

Merged

6 tasks

fkiraly added the module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting label Oct 3, 2023

fkiraly added this to Needs triage & validation in Bugfixing via automation Oct 3, 2023

Vasudeva-bit added a commit to Vasudeva-bit/sktime that referenced this issue Oct 8, 2023

draft pr sktime#5349

6e64eba

Vasudeva-bit added a commit to Vasudeva-bit/sktime that referenced this issue Oct 8, 2023

draft pr sktime#5349

fe3bb95

Vasudeva-bit mentioned this issue Oct 8, 2023

[BUG] Fix ARCH._check_predict_proba #5349 #5384

Merged

1 task

fkiraly closed this as completed in #5384 Oct 11, 2023

Bugfixing automation moved this from Needs triage & validation to Fixed/resolved Oct 11, 2023

Vasudeva-bit added a commit to Vasudeva-bit/sktime that referenced this issue Oct 12, 2023

changes to resolve sktime#5349

0445158

Vasudeva-bit mentioned this issue Oct 12, 2023

[ENH] in ARCH, fix str coercion of pd.Series name #5407

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `test_predict_proba` is expecting column name of `pred_dist` to be `0` always. #5349

[BUG] `test_predict_proba` is expecting column name of `pred_dist` to be `0` always. #5349

Vasudeva-bit commented Oct 3, 2023

yarnabrina commented Oct 3, 2023

fkiraly commented Oct 3, 2023 •

edited

fkiraly commented Oct 3, 2023 •

edited

Vasudeva-bit commented Oct 4, 2023 •

edited

fkiraly commented Oct 7, 2023 •

edited

Vasudeva-bit commented Oct 8, 2023

fkiraly commented Oct 8, 2023

Vasudeva-bit commented Oct 8, 2023

[BUG] test_predict_proba is expecting column name of pred_dist to be 0 always. #5349

[BUG] test_predict_proba is expecting column name of pred_dist to be 0 always. #5349

Comments

Vasudeva-bit commented Oct 3, 2023

yarnabrina commented Oct 3, 2023

fkiraly commented Oct 3, 2023 • edited

fkiraly commented Oct 3, 2023 • edited

Vasudeva-bit commented Oct 4, 2023 • edited

fkiraly commented Oct 7, 2023 • edited

Vasudeva-bit commented Oct 8, 2023

fkiraly commented Oct 8, 2023

Vasudeva-bit commented Oct 8, 2023

[BUG] `test_predict_proba` is expecting column name of `pred_dist` to be `0` always. #5349

[BUG] `test_predict_proba` is expecting column name of `pred_dist` to be `0` always. #5349

fkiraly commented Oct 3, 2023 •

edited

fkiraly commented Oct 3, 2023 •

edited

Vasudeva-bit commented Oct 4, 2023 •

edited

fkiraly commented Oct 7, 2023 •

edited