Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Imputer bugfix for issue #6224 #6253

Merged
merged 11 commits into from Apr 10, 2024
4 changes: 3 additions & 1 deletion .all-contributorsrc
Expand Up @@ -2342,7 +2342,9 @@
"profile": "https://github.com/Ram0nB",
"contributions": [
"doc",
"code"
"code",
"bug",
"test"
]
},
{
Expand Down
26 changes: 16 additions & 10 deletions sktime/transformations/series/impute.py
Expand Up @@ -23,7 +23,9 @@ class Imputer(BaseTransformer):
Parameters
----------
method : str, default="drift"
Method to fill the missing values.
Method to fill the missing values. Not all methods can extrapolate, so after
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. Do we know which methods are impacted? If only a few, it makes sense to describe in the method bullet point, like for "linear".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "linear" can't extrapolate
  • "ffill"/"pad" can't extrapolate backward
  • "backfill"/"bfill" can't extrapolate forward
  • In case a method is chosen that fits on data seen in fit ("drift", "mean", "median" and "random"), but the data in transform contains an instance not seen in fit.

Since more than a few, I think it makes sense to leave the docstring as is. Let me know if you have other suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for the explanation.

Btw, I thought mean, median, random should be ok? This would not depend on a method, since that is not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote that "drift" and "random" are also affected, but when testing I noticed that this is not the case. Sorry about this. The reason that "mean" and "median" are affected is that those methods don't use Sktime's vectorization. When using Sktime's vectorization and transforming an instance not seen in fit, the following error is raised:

RuntimeError: Imputer is a transformer that applies per individual time series, and broadcasts across instances. In fit, Imputer makes one fit per instance, and applies that fit to the instance with the same index in transform. Vanilla use therefore requires the same number of instances in fit and transform, butfound different number of instances in transform than in fit. number of instances seen in fit: 2; number of instances seen in transform: 3. For fit/transforming per instance, e.g., for pre-processinng in a time series classification, regression or clustering pipeline, wrap this transformer in FitInTransform, from sktime.transformations.compose.

This error is however not raised when transforming an instance not seen in fit with "mean" and "median". This causes the missing values in those instances not seen in fit not to be imputed with mean" and "median" (there are no mean/median values for those instances), but rather with "ffill" then "bfill" since those are applied after every method.

``method`` is applied the remaining missing values are filled with ``ffill``
then ``bfill``.

* "drift" : drift/trend values by sktime.PolynomialTrendForecaster(degree=1)
first, X in transform() is filled with ffill then bfill
Expand Down Expand Up @@ -231,22 +233,26 @@ def _transform(self, X, y=None):
elif self.method == "constant":
return X.fillna(value=self.value)
elif isinstance(index, pd.MultiIndex):
X_grouped = X.groupby(level=list(range(index.nlevels - 1)))
X_group_levels = list(range(index.nlevels - 1))

if self.method in ["backfill", "bfill"]:
X = X_grouped.bfill()
# fill trailing NAs of panel instances with reverse method
return X.ffill()
X = X.groupby(level=X_group_levels).bfill()
elif self.method in ["pad", "ffill"]:
X = X_grouped.ffill()
# fill leading NAs of panel instances with reverse method
return X.bfill()
X = X.groupby(level=X_group_levels).ffill()
elif self.method == "mean":
return X_grouped.fillna(value=self._mean)
X = X.groupby(level=X_group_levels).fillna(value=self._mean)
elif self.method == "median":
return X_grouped.fillna(value=self._median)
X = X.groupby(level=X_group_levels).fillna(value=self._median)
else:
raise AssertionError("Code should not be reached")

# fill first/last elements of series,
# as some methods can't impute those
X = X.groupby(level=X_group_levels).ffill()
X = X.groupby(level=X_group_levels).bfill()

return X

else:
if self.method in ["backfill", "bfill"]:
X = X.bfill()
Expand Down
40 changes: 40 additions & 0 deletions sktime/transformations/series/tests/test_imputer.py
Expand Up @@ -8,7 +8,9 @@
import numpy as np
import pytest

from sktime.datatypes import get_examples
from sktime.forecasting.naive import NaiveForecaster
from sktime.transformations.compose import TransformByLevel
from sktime.transformations.series.impute import Imputer
from sktime.utils._testing.forecasting import make_forecasting_problem
from sktime.utils._testing.hierarchical import _make_hierarchical
Expand Down Expand Up @@ -58,6 +60,44 @@ def test_imputer(method, Z, value, forecaster):
assert not y_hat.isnull().to_numpy().any()


@pytest.mark.parametrize(
"method",
[
"linear",
"nearest",
"mean",
"median",
"backfill",
"pad",
],
)
def test_impute_multiindex(method):
"""Test for data leakage in case of pd-multiindex data.

Failure case in bug #6224
"""

df = get_examples(mtype="pd-multiindex")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these lines have a side effect on the fixture, i.e., it changes the example itself by mutating the object - iloc writes are mutating, i.e., inplcae.

For this reason, all the weird failures occur, since the example is no longer as expected in checks.
We should probably make the function safer.

For now, could you make a copy or deepcopy of df?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(you could also use this PR: #6259)

df.iloc[:3, :] = np.nan # instance 0 entirely missing
df.iloc[3:4, :] = np.nan # instance 1 first timepoint missing
df.iloc[8:, :] = np.nan # instance 2 last timepoint missing

imp = Imputer(method=method)
df_imp = imp.fit_transform(df)

# instance 0 entirely missing, so it should remain missing
assert np.array_equal(df.iloc[:3, :], df_imp.iloc[:3, :], equal_nan=True)

# instance 1 and 2 should not have any missing values
assert not df_imp.iloc[3:, :].isna().any().any()

# test consistency between applying the imputer to every instance separately,
# vs applying them to the panel
imp_tbl = TransformByLevel(Imputer(method=method))
df_imp_tbl = imp_tbl.fit_transform(df)
assert np.array_equal(df_imp, df_imp_tbl, equal_nan=True)


def test_imputer_forecaster_y():
"""Test that forecaster imputer works with y.

Expand Down