New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Imputer bugfix for issue #6224 #6253
Changes from 8 commits
17a6fd5
27f96f9
25be572
7bd70fb
92832b0
c00cc46
4d97e57
f2f0978
de5436e
0aa3c95
7e824f7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,7 +8,9 @@ | |
import numpy as np | ||
import pytest | ||
|
||
from sktime.datatypes import get_examples | ||
from sktime.forecasting.naive import NaiveForecaster | ||
from sktime.transformations.compose import TransformByLevel | ||
from sktime.transformations.series.impute import Imputer | ||
from sktime.utils._testing.forecasting import make_forecasting_problem | ||
from sktime.utils._testing.hierarchical import _make_hierarchical | ||
|
@@ -58,6 +60,44 @@ def test_imputer(method, Z, value, forecaster): | |
assert not y_hat.isnull().to_numpy().any() | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"method", | ||
[ | ||
"linear", | ||
"nearest", | ||
"mean", | ||
"median", | ||
"backfill", | ||
"pad", | ||
], | ||
) | ||
def test_impute_multiindex(method): | ||
"""Test for data leakage in case of pd-multiindex data. | ||
|
||
Failure case in bug #6224 | ||
""" | ||
|
||
df = get_examples(mtype="pd-multiindex")[0] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these lines have a side effect on the fixture, i.e., it changes the example itself by mutating the object - For this reason, all the weird failures occur, since the example is no longer as expected in checks. For now, could you make a copy or deepcopy of df? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (you could also use this PR: #6259) |
||
df.iloc[:3, :] = np.nan # instance 0 entirely missing | ||
df.iloc[3:4, :] = np.nan # instance 1 first timepoint missing | ||
df.iloc[8:, :] = np.nan # instance 2 last timepoint missing | ||
|
||
imp = Imputer(method=method) | ||
df_imp = imp.fit_transform(df) | ||
|
||
# instance 0 entirely missing, so it should remain missing | ||
assert np.array_equal(df.iloc[:3, :], df_imp.iloc[:3, :], equal_nan=True) | ||
|
||
# instance 1 and 2 should not have any missing values | ||
assert not df_imp.iloc[3:, :].isna().any().any() | ||
|
||
# test consistency between applying the imputer to every instance separately, | ||
# vs applying them to the panel | ||
imp_tbl = TransformByLevel(Imputer(method=method)) | ||
df_imp_tbl = imp_tbl.fit_transform(df) | ||
assert np.array_equal(df_imp, df_imp_tbl, equal_nan=True) | ||
|
||
|
||
def test_imputer_forecaster_y(): | ||
"""Test that forecaster imputer works with y. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. Do we know which methods are impacted? If only a few, it makes sense to describe in the method bullet point, like for
"linear"
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since more than a few, I think it makes sense to leave the docstring as is. Let me know if you have other suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks for the explanation.
Btw, I thought
mean
,median
,random
should be ok? This would not depend on amethod
, since that is not used?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote that "drift" and "random" are also affected, but when testing I noticed that this is not the case. Sorry about this. The reason that "mean" and "median" are affected is that those methods don't use Sktime's vectorization. When using Sktime's vectorization and transforming an instance not seen in fit, the following error is raised:
This error is however not raised when transforming an instance not seen in fit with "mean" and "median". This causes the missing values in those instances not seen in fit not to be imputed with mean" and "median" (there are no mean/median values for those instances), but rather with "ffill" then "bfill" since those are applied after every method.