-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Using IndexSubset together with PandasTransformAdaptor(method="dropna") in a pipeline fails #4977
Comments
I see - the problem is that the So, in a sense, Question, what should the end state be here? Not sure if If it should be changed, how? If not, then what should be the way to get what you want? Assuming that what you want is dropping exactly the rows that have at least NA in the Should this be a |
(re status I'm not sure yet whether this is a bug?) |
But in the example, the X in import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sktime.datasets import load_longley
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.compose import ForecastingPipeline, RecursiveTabularRegressionForecaster
from sklearn.preprocessing import MinMaxScaler
from sktime.transformations.series.adapt import TabularToSeriesAdaptor, PandasTransformAdaptor
y, X = load_longley()
X.loc["1955", "GNPDEFL"] = np.nan
X.loc["1947", "POP"] = np.nan
horizon = ForecastingHorizon(np.arange(1, 4), is_relative=True)
pipe = ForecastingPipeline(
[
PandasTransformAdaptor(method="dropna", kwargs={"axis": "columns", "how": "any"}),
TabularToSeriesAdaptor(MinMaxScaler()),
RecursiveTabularRegressionForecaster(RandomForestRegressor(), window_length=2),
],
)
pipe.fit(y, X, fh=horizon)
pipe.predict(fh=horizon, X=X)
pipe._X
pipe.forecaster_._X
Yes, that's what I had in mind (assuming you meant columns, not rows).
Perhaps - having this functionality would be nice (not a bug but a feature request then.) |
If it is not overly complex and somewhat beginner friendly, I'd be happy to start working on a |
Ah, thanks! That might be a useful addition! The guide for adding estimators is here: https://www.sktime.net/en/stable/developer_guide/add_estimators.html In your case, I would start with the extension template
|
#### Reference Issues/PRs Fixes #4977. #### What does this implement/fix? Explain your changes. Implements a `DropNA` transformer that saves the dropped index/column names from `fit`. #### What should a reviewer concentrate their feedback on? Any feedback is appreciated. This is mostly a thin wrapper around `pd.DataFrame.dropna`, accepting the arguments `axis`, `how`, and `thresh`, defaulting to `axis=0` and `how=any`. With univariate series, only `axis=0` will work. I am validating the inputs to some degree and I've extended the `thresh` argument to also accept a fraction of non-NA values (pandas only accepts counts). I've always found the pandas default of specifying the threshold in terms of non-NA values strange (as opposed to specifying NA values) , but I've left it as is to be consistent. Thoughts on this? #### Did you add any tests for the change? I've added a couple of tests for the functionality and to make sure the parameter validation works. I haven't added any tests with a univariate series so far.
Describe the bug
Using
IndexSubset
andPandasTransformAdaptor(method="dropna")
together in a pipeline fails in some orderings.PandasTransformAdaptor
appears to be operating on the untransformed/not subsetX
.To Reproduce
The example below fails. It works with the
PandasTransformAdaptor
commented out.The example differences
y
, lagsX
, then setsy
andX
to the same index (effectively removing the missing values fromX
). Droppingnan
values should not have any effect.Expected behavior
Drop only features in
X
that are actually missing at this step in the pipeline.Versions
The text was updated successfully, but these errors were encountered: