[BUG] Using IndexSubset together with PandasTransformAdaptor(method="dropna") in a pipeline fails #4977

hliebert · 2023-07-29T11:51:35Z

Describe the bug
Using IndexSubset and PandasTransformAdaptor(method="dropna") together in a pipeline fails in some orderings. PandasTransformAdaptor appears to be operating on the untransformed/not subset X.

To Reproduce
The example below fails. It works with the PandasTransformAdaptor commented out.

The example differences y, lags X, then sets y and X to the same index (effectively removing the missing values from X). Dropping nan values should not have any effect.

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sktime.datasets import load_longley
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.compose import (
    ForecastingPipeline,
    TransformedTargetForecaster,
    RecursiveTabularRegressionForecaster,
)
from sklearn.preprocessing import MinMaxScaler
from sktime.transformations.series.adapt import TabularToSeriesAdaptor, PandasTransformAdaptor
from sktime.transformations.series.boxcox import LogTransformer
from sktime.transformations.series.difference import Differencer
from sktime.transformations.series.subset import IndexSubset
from sktime.transformations.series.lag import Lag

y, X = load_longley()
# X.loc["1955", "GNPDEFL"] = np.nan
# X.loc["1947", "POP"] = np.nan

horizon = ForecastingHorizon(np.arange(1, 4), is_relative=True)

pipe = TransformedTargetForecaster(
    [
        LogTransformer(),
        Differencer(lags=1, na_handling="drop_na"),
        ForecastingPipeline(
            [
                Lag(lags=[0, 1], index_out="original"),
                IndexSubset(index_treatment="remove"),
                PandasTransformAdaptor(method="dropna", kwargs={"axis": "columns", "how": "any"}),
                TabularToSeriesAdaptor(MinMaxScaler()),
                RecursiveTabularRegressionForecaster(RandomForestRegressor(), window_length=2),
            ],
        ),
    ]
)
pipe.fit(y, X, fh=horizon)
pipe.predict(fh=horizon, X=X)

pipe._X
pipe.forecaster_.forecaster_._X

Expected behavior
Drop only features in X that are actually missing at this step in the pipeline.

Versions

System: python: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0] executable: /usr/bin/python3.10 machine: Linux-5.19.0-50-generic-x86_64-with-glibc2.35 Python dependencies: pip: 22.0.2 sktime: 0.21.0 sklearn: 1.1.2 skbase: 0.4.6 numpy: 1.22.4 scipy: 1.8.1 pandas: 1.4.4 matplotlib: 3.5.3 joblib: 1.1.0 statsmodels: 0.13.2 numba: 0.56.0 pmdarima: 2.0.1 tsfresh: 0.19.0 tensorflow: None tensorflow_probability: None

The text was updated successfully, but these errors were encountered:

fkiraly · 2023-07-29T20:59:55Z

I see - the problem is that the PandasTransformAdaptor drops exactly the columns with NAs that it sees at the moment, and these can differ between fit and predict.

So, in a sense, PandasTransformAdaptor does what it's told to do, but the implied behaviour is problematic and does not match the user expectation.

Question, what should the end state be here? Not sure if PandasTransformAdaptor can or should be changed.

If it should be changed, how? If not, then what should be the way to get what you want?

Assuming that what you want is dropping exactly the rows that have at least NA in the fit data, in both fit and predict?

Should this be a DropNA transformer perhaps? Which can be applied to rows and columns, and for columns remembers the drops from fit?

fkiraly · 2023-07-29T21:00:45Z

(re status I'm not sure yet whether this is a bug?)

hliebert · 2023-07-30T11:16:22Z

I see - the problem is that the PandasTransformAdaptor drops exactly the columns with NAs that it sees at the moment, and these can differ between fit and predict.

But in the example, the X in fit and predict are the same? And the example below (not relying on IndexSubset) works.

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sktime.datasets import load_longley
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.compose import ForecastingPipeline, RecursiveTabularRegressionForecaster
from sklearn.preprocessing import MinMaxScaler
from sktime.transformations.series.adapt import TabularToSeriesAdaptor, PandasTransformAdaptor

y, X = load_longley()
X.loc["1955", "GNPDEFL"] = np.nan
X.loc["1947", "POP"] = np.nan

horizon = ForecastingHorizon(np.arange(1, 4), is_relative=True)

pipe = ForecastingPipeline(
    [
        PandasTransformAdaptor(method="dropna", kwargs={"axis": "columns", "how": "any"}),
        TabularToSeriesAdaptor(MinMaxScaler()),
        RecursiveTabularRegressionForecaster(RandomForestRegressor(), window_length=2),
    ],
)

pipe.fit(y, X, fh=horizon)
pipe.predict(fh=horizon, X=X)

pipe._X
pipe.forecaster_._X

Assuming that what you want is dropping exactly the rows that have at least NA in the fit data, in both fit and predict?

Yes, that's what I had in mind (assuming you meant columns, not rows).

Should this be a DropNA transformer perhaps? Which can be applied to rows and columns, and for columns remembers the drops from fit

Perhaps - having this functionality would be nice (not a bug but a feature request then.)

hliebert · 2023-08-04T11:24:40Z

If it is not overly complex and somewhat beginner friendly, I'd be happy to start working on a DropNA transformer. I'd probably need some guidance on where to begin though.

fkiraly · 2023-08-04T14:19:26Z

Ah, thanks! That might be a useful addition!

The guide for adding estimators is here: https://www.sktime.net/en/stable/developer_guide/add_estimators.html

In your case, I would start with the extension template extension_templates/transformer_supersimple, since the "drop NA" logic should be a few-liner on top of the simplest template:

in _fit, drop the columns and remember which columns were dropped (write to self.dropped_cols_ or similar)
in _transform, remember the dropped columns and drop them
throughout, care should be taken not to mutate the inputs by accidents
not sure whether we need any parameters for this. What would be useful is to define an axis perhaps? Since some users may like to use it to drop rows, not columns. In that case, nothing needs to be remembered.

#### Reference Issues/PRs Fixes #4977. #### What does this implement/fix? Explain your changes. Implements a `DropNA` transformer that saves the dropped index/column names from `fit`. #### What should a reviewer concentrate their feedback on? Any feedback is appreciated. This is mostly a thin wrapper around `pd.DataFrame.dropna`, accepting the arguments `axis`, `how`, and `thresh`, defaulting to `axis=0` and `how=any`. With univariate series, only `axis=0` will work. I am validating the inputs to some degree and I've extended the `thresh` argument to also accept a fraction of non-NA values (pandas only accepts counts). I've always found the pandas default of specifying the threshold in terms of non-NA values strange (as opposed to specifying NA values) , but I've left it as is to be consistent. Thoughts on this? #### Did you add any tests for the change? I've added a couple of tests for the functionality and to make sure the parameter validation works. I haven't added any tests with a univariate series so far.

hliebert added the bug Something isn't working label Jul 29, 2023

hliebert changed the title ~~[BUG] Using IndexSubset together with PandasTransformAdaptor(method="dropna")~~ [BUG] Using IndexSubset together with PandasTransformAdaptor(method="dropna") in a pipeline fails Jul 29, 2023

fkiraly added the module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing label Jul 29, 2023

fkiraly added this to Needs triage & validation in Bugfixing via automation Jul 29, 2023

fkiraly moved this from Needs triage & validation to Reproduced/confirmed in Bugfixing Jul 29, 2023

hliebert mentioned this issue Aug 7, 2023

[ENH] Add DropNA transformer #5049

Merged

2 tasks

fkiraly closed this as completed in #5049 Aug 12, 2023

Bugfixing automation moved this from Reproduced/confirmed to Fixed/resolved Aug 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Using IndexSubset together with PandasTransformAdaptor(method="dropna") in a pipeline fails #4977

[BUG] Using IndexSubset together with PandasTransformAdaptor(method="dropna") in a pipeline fails #4977

hliebert commented Jul 29, 2023 •

edited

Loading

fkiraly commented Jul 29, 2023 •

edited

Loading

fkiraly commented Jul 29, 2023

hliebert commented Jul 30, 2023 •

edited

Loading

hliebert commented Aug 4, 2023

fkiraly commented Aug 4, 2023 •

edited

Loading

[BUG] Using IndexSubset together with PandasTransformAdaptor(method="dropna") in a pipeline fails #4977

[BUG] Using IndexSubset together with PandasTransformAdaptor(method="dropna") in a pipeline fails #4977

Comments

hliebert commented Jul 29, 2023 • edited Loading

fkiraly commented Jul 29, 2023 • edited Loading

fkiraly commented Jul 29, 2023

hliebert commented Jul 30, 2023 • edited Loading

hliebert commented Aug 4, 2023

fkiraly commented Aug 4, 2023 • edited Loading

hliebert commented Jul 29, 2023 •

edited

Loading

fkiraly commented Jul 29, 2023 •

edited

Loading

hliebert commented Jul 30, 2023 •

edited

Loading

fkiraly commented Aug 4, 2023 •

edited

Loading