[ENH] set `ForecastX` missing data handling tag to `True` to properly cope with future unknown variables #4876

fkiraly · 2023-07-13T11:38:53Z

This PR sets the ForecastX missing data handling tag to True to properly cope with future unknown variables.

An edge case - or, arguably, a general case if none of the future variables are known - is showcased in this issue discussion:
#4848

More precisely, ForecastX should always allow for nans in future X, as these may be in places where there is genuinely no knowledge about the future. Alternatively, dummy values have to be provided which sounds odd.

FYI @davidgilbertson

fkiraly · 2023-07-13T22:31:15Z

quick merge for release - minimal change so not very risky

yarnabrina · 2023-07-14T04:14:29Z

@fkiraly Not opposing the merge (at least not yet), but asking for a clarification about the tag below.

Based on extension_templates/forecasting.py, this is what the tag means.

        # handles-missing-data = can estimator handle missing data?
        "handles-missing-data": False,
        # valid values: boolean True (yes), False (no)
        # if False, raises exception if y or X passed contain missing data (nans)

(If it's not the official tag reference for users or for developers, kindly share the proper reference.)

From the tag definition above, I interpret that if it's set as True, it should be able to handle missing value in either y or X, and not just X. Based on my understanding of ForecastX, that is not the case. Even if I ignore the y case leaving up to the forecaster of choice, it does need all values to be present for columns that are not covered by columns, assuming that is non-empty and does not cover all columns of X.

So, based on my interpretation, either addition of this tag gives a possibly wrong information to users, or the tag definition is not quite clear. Can you please confirm which one is it? Or, if I'm just completely off, please let me know.

fkiraly · 2023-07-14T17:20:01Z

@yarnabrina, happy to unmerge this - my quick merge was based on that it is functionally inconsequential in cases that work as present, while unlocking what I felt was a common use case, important enough to have it in the release quickly.

To reply to your points:

you are right as to that the extension template is currently the best reference for the tags.
- maybe that's not the best from a long-term end-state perspective, but at least it's easy to maintain and in a very findable place to extenders who are looking for the information (or aren't but we want to inform them)
- originally, the idea was to have the reference in the tag registry, registry._tags, but as tags are present there without context and can have different meaning depending on base class they live in, and also require "looking in a separate location" of an extender, this proved less useful and harder to maintain
the crux is that the tag specification is valid only for atomic (non-composite) estimators, and may become misleading in the case of composites.
- for composites, bug reports have repeatedly indicated that in case of ambiguity, tags should be set to the most permissive value (i.e., generating least, not most exceptions), which is a soft policy now (but nowhere written down)
- why: imagine users build some composite which, from the methodological perspective should work - such as the example in [ENH] prioritizing make_reduction related items from davidgilbertson use case #4848 which imo isn't that uncommon. However, the compositor blocks this by raising an exception based on tag/data checks.
- the deeper reason is that the tag inference would need to be based on properties of components, and the actual property may depend in very complex ways not just on the tags of the components, but also on parts of the logic which are not inspectable as tags

The answer to your "confirm which one is it":

very short: it is actually a third thing - tags for composites will in general be parameter conditional, so any constant pick will be incorrect in some cases. This change is about minimizing user frustration (and unexpected errors).
the tag definition is as you say - handling missing values in either y or X (although the documentation could be more precise on this).
however, the actual (formal, mathematical) property is usually conditional on parameters and components for composites like ForecastX (e.g., it can handle any missing values that the X-forecaster overwrites - somewhat concerningly, the "can handle" also applies to missing values in only some positions and not others) .
Therefore, assuming that it must be one or the other always, without dependence on parameters, is misleading, as it is not formally correct for some parameter settings.
so we need to think about how we deal with the conditional case. The "safe" setting - if we must pick a value - is usually the one which raises least exceptions.

yarnabrina · 2023-07-15T04:09:45Z

I understand your point of allowing as much as possible to the composite estimators in order to not avoid failures from composite logic and leave that to the actual non-composite estimators. It makes sense, and so there's no need to unmerge.

However, personally I'd like no tags at all for composites in that case, as you said it doesn't make sense. I do not have any idea as of now on how to achieve that based on current base classes, so let's keep it as is. If I can think of anything, will create a new design issue later for discussion.

fkiraly · 2023-07-15T12:44:38Z

However, personally I'd like no tags at all for composites in that case, as you said it doesn't make sense.

I think that's stronger than the statement I wanted to make. What I tried to convey: the properties the tags indicate may depend on parameters (but that does not make them nonsensical overall)

The current design tries to set the tags dynamically, based on the parameters, for objects, and leaves the tag at the most general setting for classes.

I do not have any idea as of now on how to achieve that based on current base classes, so let's keep it as is. If I can think of anything, will create a new design issue later for discussion.

Ok, would be appreciated - I think it's an interesting question.

Though that concludes the discussion for this PR at least as far as I understand.

Update _pipeline.py

2ddfc8f

fkiraly added module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting enhancement Adding new functionality labels Jul 13, 2023

fkiraly requested review from achieveordie, benHeid and yarnabrina as code owners July 13, 2023 11:38

fkiraly merged commit 00a9160 into main Jul 13, 2023
24 checks passed

fkiraly deleted the ForecastX-missingdata branch July 13, 2023 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] set `ForecastX` missing data handling tag to `True` to properly cope with future unknown variables #4876

[ENH] set `ForecastX` missing data handling tag to `True` to properly cope with future unknown variables #4876

fkiraly commented Jul 13, 2023

fkiraly commented Jul 13, 2023 •

edited

yarnabrina commented Jul 14, 2023

fkiraly commented Jul 14, 2023 •

edited

yarnabrina commented Jul 15, 2023

fkiraly commented Jul 15, 2023

[ENH] set ForecastX missing data handling tag to True to properly cope with future unknown variables #4876

[ENH] set ForecastX missing data handling tag to True to properly cope with future unknown variables #4876

Conversation

fkiraly commented Jul 13, 2023

fkiraly commented Jul 13, 2023 • edited

yarnabrina commented Jul 14, 2023

fkiraly commented Jul 14, 2023 • edited

yarnabrina commented Jul 15, 2023

fkiraly commented Jul 15, 2023

[ENH] set `ForecastX` missing data handling tag to `True` to properly cope with future unknown variables #4876

[ENH] set `ForecastX` missing data handling tag to `True` to properly cope with future unknown variables #4876

fkiraly commented Jul 13, 2023 •

edited

fkiraly commented Jul 14, 2023 •

edited