Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Using IndexSubset together with PandasTransformAdaptor(method="dropna") in a pipeline fails #4977

Closed
hliebert opened this issue Jul 29, 2023 · 5 comments · Fixed by #5049
Labels
bug Something isn't working module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing
Projects

Comments

@hliebert
Copy link
Contributor

hliebert commented Jul 29, 2023

Describe the bug
Using IndexSubset and PandasTransformAdaptor(method="dropna") together in a pipeline fails in some orderings. PandasTransformAdaptor appears to be operating on the untransformed/not subset X.

To Reproduce
The example below fails. It works with the PandasTransformAdaptor commented out.

The example differences y, lags X, then sets y and X to the same index (effectively removing the missing values from X). Dropping nan values should not have any effect.

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sktime.datasets import load_longley
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.compose import (
    ForecastingPipeline,
    TransformedTargetForecaster,
    RecursiveTabularRegressionForecaster,
)
from sklearn.preprocessing import MinMaxScaler
from sktime.transformations.series.adapt import TabularToSeriesAdaptor, PandasTransformAdaptor
from sktime.transformations.series.boxcox import LogTransformer
from sktime.transformations.series.difference import Differencer
from sktime.transformations.series.subset import IndexSubset
from sktime.transformations.series.lag import Lag

y, X = load_longley()
# X.loc["1955", "GNPDEFL"] = np.nan
# X.loc["1947", "POP"] = np.nan

horizon = ForecastingHorizon(np.arange(1, 4), is_relative=True)

pipe = TransformedTargetForecaster(
    [
        LogTransformer(),
        Differencer(lags=1, na_handling="drop_na"),
        ForecastingPipeline(
            [
                Lag(lags=[0, 1], index_out="original"),
                IndexSubset(index_treatment="remove"),
                PandasTransformAdaptor(method="dropna", kwargs={"axis": "columns", "how": "any"}),
                TabularToSeriesAdaptor(MinMaxScaler()),
                RecursiveTabularRegressionForecaster(RandomForestRegressor(), window_length=2),
            ],
        ),
    ]
)
pipe.fit(y, X, fh=horizon)
pipe.predict(fh=horizon, X=X)

pipe._X
pipe.forecaster_.forecaster_._X

Expected behavior
Drop only features in X that are actually missing at this step in the pipeline.

Versions

System: python: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0] executable: /usr/bin/python3.10 machine: Linux-5.19.0-50-generic-x86_64-with-glibc2.35 Python dependencies: pip: 22.0.2 sktime: 0.21.0 sklearn: 1.1.2 skbase: 0.4.6 numpy: 1.22.4 scipy: 1.8.1 pandas: 1.4.4 matplotlib: 3.5.3 joblib: 1.1.0 statsmodels: 0.13.2 numba: 0.56.0 pmdarima: 2.0.1 tsfresh: 0.19.0 tensorflow: None tensorflow_probability: None
@hliebert hliebert added the bug Something isn't working label Jul 29, 2023
@hliebert hliebert changed the title [BUG] Using IndexSubset together with PandasTransformAdaptor(method="dropna") [BUG] Using IndexSubset together with PandasTransformAdaptor(method="dropna") in a pipeline fails Jul 29, 2023
@fkiraly fkiraly added the module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing label Jul 29, 2023
@fkiraly
Copy link
Collaborator

fkiraly commented Jul 29, 2023

I see - the problem is that the PandasTransformAdaptor drops exactly the columns with NAs that it sees at the moment, and these can differ between fit and predict.

So, in a sense, PandasTransformAdaptor does what it's told to do, but the implied behaviour is problematic and does not match the user expectation.

Question, what should the end state be here? Not sure if PandasTransformAdaptor can or should be changed.

If it should be changed, how? If not, then what should be the way to get what you want?

Assuming that what you want is dropping exactly the rows that have at least NA in the fit data, in both fit and predict?

Should this be a DropNA transformer perhaps? Which can be applied to rows and columns, and for columns remembers the drops from fit?

@fkiraly fkiraly added this to Needs triage & validation in Bugfixing via automation Jul 29, 2023
@fkiraly
Copy link
Collaborator

fkiraly commented Jul 29, 2023

(re status I'm not sure yet whether this is a bug?)

@fkiraly fkiraly moved this from Needs triage & validation to Reproduced/confirmed in Bugfixing Jul 29, 2023
@hliebert
Copy link
Contributor Author

hliebert commented Jul 30, 2023

I see - the problem is that the PandasTransformAdaptor drops exactly the columns with NAs that it sees at the moment, and these can differ between fit and predict.

But in the example, the X in fit and predict are the same? And the example below (not relying on IndexSubset) works.

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sktime.datasets import load_longley
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.compose import ForecastingPipeline, RecursiveTabularRegressionForecaster
from sklearn.preprocessing import MinMaxScaler
from sktime.transformations.series.adapt import TabularToSeriesAdaptor, PandasTransformAdaptor

y, X = load_longley()
X.loc["1955", "GNPDEFL"] = np.nan
X.loc["1947", "POP"] = np.nan

horizon = ForecastingHorizon(np.arange(1, 4), is_relative=True)

pipe = ForecastingPipeline(
    [
        PandasTransformAdaptor(method="dropna", kwargs={"axis": "columns", "how": "any"}),
        TabularToSeriesAdaptor(MinMaxScaler()),
        RecursiveTabularRegressionForecaster(RandomForestRegressor(), window_length=2),
    ],
)

pipe.fit(y, X, fh=horizon)
pipe.predict(fh=horizon, X=X)

pipe._X
pipe.forecaster_._X

Assuming that what you want is dropping exactly the rows that have at least NA in the fit data, in both fit and predict?

Yes, that's what I had in mind (assuming you meant columns, not rows).

Should this be a DropNA transformer perhaps? Which can be applied to rows and columns, and for columns remembers the drops from fit

Perhaps - having this functionality would be nice (not a bug but a feature request then.)

@hliebert
Copy link
Contributor Author

hliebert commented Aug 4, 2023

If it is not overly complex and somewhat beginner friendly, I'd be happy to start working on a DropNA transformer. I'd probably need some guidance on where to begin though.

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 4, 2023

Ah, thanks! That might be a useful addition!

The guide for adding estimators is here: https://www.sktime.net/en/stable/developer_guide/add_estimators.html

In your case, I would start with the extension template extension_templates/transformer_supersimple, since the "drop NA" logic should be a few-liner on top of the simplest template:

  • in _fit, drop the columns and remember which columns were dropped (write to self.dropped_cols_ or similar)
  • in _transform, remember the dropped columns and drop them
  • throughout, care should be taken not to mutate the inputs by accidents
  • not sure whether we need any parameters for this. What would be useful is to define an axis perhaps? Since some users may like to use it to drop rows, not columns. In that case, nothing needs to be remembered.

Bugfixing automation moved this from Reproduced/confirmed to Fixed/resolved Aug 12, 2023
fkiraly pushed a commit that referenced this issue Aug 12, 2023
#### Reference Issues/PRs
Fixes #4977.

#### What does this implement/fix? Explain your changes.
Implements a `DropNA` transformer that saves the dropped index/column
names from `fit`.

#### What should a reviewer concentrate their feedback on?
Any feedback is appreciated. This is mostly a thin wrapper around
`pd.DataFrame.dropna`, accepting the arguments `axis`, `how`, and
`thresh`, defaulting to `axis=0` and `how=any`. With univariate series,
only `axis=0` will work.

I am validating the inputs to some degree and I've extended the `thresh`
argument to also accept a fraction of non-NA values (pandas only accepts
counts). I've always found the pandas default of specifying the
threshold in terms of non-NA values strange (as opposed to specifying NA
values) , but I've left it as is to be consistent. Thoughts on this?

#### Did you add any tests for the change?
I've added a couple of tests for the functionality and to make sure the
parameter validation works. I haven't added any tests with a univariate
series so far.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing
Projects
Bugfixing
Fixed/resolved
Development

Successfully merging a pull request may close this issue.

2 participants