[BUG] Imputer bugfix for issue #6224 #6253

Ram0nB · 2024-04-02T14:40:05Z

Reference Issues/PRs

Fixes #6224

What does this implement/fix? Explain your changes.

For the Imputer:

Add to docstring that after every method ffill then bfill will be used
In case of pd-multiindex also always ffill then bfill to keep behaviour consistent with single index case
In case of pd-multiindex never impute on non-grouped data

For the Imputer tests:

Add test for bug [BUG] Imputer imputes too many pd-multiindex missing values #6224

Does your contribution introduce a new dependency? If yes, which one?

No

What should a reviewer concentrate their feedback on?

All changes

Did you add any tests for the change?

Yes

Any other comments?

No

PR checklist

For all contributions

I've added myself to the list of contributors with any new badges I've earned :-)
How to: add yourself to the all-contributors file in the sktime root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
See here for full badge reference
Optionally, for added estimators: I've added myself and possibly to the maintainers tag - do this if you want to become the owner or maintainer of an estimator you added.
See here for further details on the algorithm maintainer role.
The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

- Add to docstring that after every method ffill then bfill will be used - In case of pd-multiindex also always ffill then bfill to keep behaviour consistent with single index case - In case of pd-multiindex never impute on non-grouped data

Add test for bug sktime#6224

…mputer-bug

Add bug and test label

Spaces instead of tabs for indent

fkiraly · 2024-04-02T14:43:42Z

sktime/transformations/series/impute.py

@@ -23,7 +23,9 @@ class Imputer(BaseTransformer):
    Parameters
    ----------
    method : str, default="drift"
-        Method to fill the missing values.
+        Method to fill the missing values. Not all methods can extrapolate, so after


makes sense. Do we know which methods are impacted? If only a few, it makes sense to describe in the method bullet point, like for "linear".

"linear" can't extrapolate

"ffill"/"pad" can't extrapolate backward

"backfill"/"bfill" can't extrapolate forward

In case a method is chosen that fits on data seen in fit ("drift", "mean", "median" and "random"), but the data in transform contains an instance not seen in fit.

Since more than a few, I think it makes sense to leave the docstring as is. Let me know if you have other suggestions.

ok, thanks for the explanation.

Btw, I thought mean, median, random should be ok? This would not depend on a method, since that is not used?

I wrote that "drift" and "random" are also affected, but when testing I noticed that this is not the case. Sorry about this. The reason that "mean" and "median" are affected is that those methods don't use Sktime's vectorization. When using Sktime's vectorization and transforming an instance not seen in fit, the following error is raised:

RuntimeError: Imputer is a transformer that applies per individual time series, and broadcasts across instances. In fit, Imputer makes one fit per instance, and applies that fit to the instance with the same index in transform. Vanilla use therefore requires the same number of instances in fit and transform, butfound different number of instances in transform than in fit. number of instances seen in fit: 2; number of instances seen in transform: 3. For fit/transforming per instance, e.g., for pre-processinng in a time series classification, regression or clustering pipeline, wrap this transformer in FitInTransform, from sktime.transformations.compose.

This error is however not raised when transforming an instance not seen in fit with "mean" and "median". This causes the missing values in those instances not seen in fit not to be imputed with mean" and "median" (there are no mean/median values for those instances), but rather with "ffill" then "bfill" since those are applied after every method.

sktime/transformations/series/tests/test_imputer.py

fkiraly

Excellent!

No blocking comments, only suggestions for improvement.

Add test for: - no nans left when only first/last missing (extrapolation) - Consistency between applying the imputer to every instance separately, vs applying them to the panel

- Only test pd-multiindex with methods that support entire instance missing

fkiraly · 2024-04-05T13:30:20Z

sktime/transformations/series/tests/test_imputer.py

+    Failure case in bug #6224
+    """
+
+    df = get_examples(mtype="pd-multiindex")[0]


these lines have a side effect on the fixture, i.e., it changes the example itself by mutating the object - iloc writes are mutating, i.e., inplcae.

For this reason, all the weird failures occur, since the example is no longer as expected in checks.
We should probably make the function safer.

For now, could you make a copy or deepcopy of df?

(you could also use this PR: #6259)

fkiraly

The failures are due to get_examples being unsafe and inplace mutation, see above.

This PR makes `get_examples` safer against side effects on the data fixtures, by adding a `deepcopy` at the end. This should prevent issues like in #6253 (review) from occurring in the future.

Copy example to prevent mutation to original object

…mputer-bug

fkiraly

Fixes the get_examples issue, so the test should work now as intended.

Thanks!

fkiraly · 2024-04-10T13:46:09Z

failures are unrelated, #6280

Ram0nB added 4 commits April 2, 2024 16:27

Update impute.py

17a6fd5

- Add to docstring that after every method ffill then bfill will be used - In case of pd-multiindex also always ffill then bfill to keep behaviour consistent with single index case - In case of pd-multiindex never impute on non-grouped data

Update test_imputer.py

27f96f9

Add test for bug sktime#6224

Merge branch 'imputer-bug' of https://github.com/Ram0nB/sktime into i…

25be572

…mputer-bug

Update .all-contributorsrc

7bd70fb

Add bug and test label

Ram0nB requested review from achieveordie, benHeid, fkiraly and yarnabrina as code owners April 2, 2024 14:40

fkiraly added module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing bugfix Fixes a known bug or removes unintended behavior labels Apr 2, 2024

Update .all-contributorsrc

92832b0

Spaces instead of tabs for indent

fkiraly reviewed Apr 2, 2024

View reviewed changes

sktime/transformations/series/tests/test_imputer.py Outdated Show resolved Hide resolved

fkiraly previously approved these changes Apr 2, 2024

View reviewed changes

Update test_imputer.py

c00cc46

Add test for: - no nans left when only first/last missing (extrapolation) - Consistency between applying the imputer to every instance separately, vs applying them to the panel

Ram0nB dismissed fkiraly’s stale review via c00cc46 April 2, 2024 15:12

Ram0nB marked this pull request as draft April 2, 2024 15:32

Ram0nB added 2 commits April 2, 2024 17:42

Update test_imputer.py

4d97e57

- Only test pd-multiindex with methods that support entire instance missing

Update test_imputer.py

f2f0978

Ram0nB marked this pull request as ready for review April 2, 2024 15:46

Ram0nB requested a review from fkiraly April 5, 2024 09:51

fkiraly reviewed Apr 5, 2024

View reviewed changes

fkiraly requested changes Apr 5, 2024

View reviewed changes

fkiraly mentioned this pull request Apr 5, 2024

[ENH] make get_examples side effect safe via deepcopy #6259

Merged

fkiraly and others added 3 commits April 8, 2024 23:57

Merge branch 'main' into pr/6253

de5436e

Update test_imputer.py

0aa3c95

Copy example to prevent mutation to original object

Merge branch 'imputer-bug' of https://github.com/Ram0nB/sktime into i…

7e824f7

…mputer-bug

fkiraly approved these changes Apr 10, 2024

View reviewed changes

fkiraly merged commit a63537b into sktime:main Apr 10, 2024
52 of 54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Imputer bugfix for issue #6224 #6253

[BUG] Imputer bugfix for issue #6224 #6253

Ram0nB commented Apr 2, 2024

fkiraly Apr 2, 2024

Ram0nB Apr 2, 2024

fkiraly Apr 2, 2024

Ram0nB Apr 5, 2024

fkiraly left a comment

fkiraly Apr 5, 2024

fkiraly Apr 5, 2024

fkiraly left a comment

fkiraly left a comment

fkiraly commented Apr 10, 2024

[BUG] Imputer bugfix for issue #6224 #6253

[BUG] Imputer bugfix for issue #6224 #6253

Conversation

Ram0nB commented Apr 2, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

For all contributions

fkiraly Apr 2, 2024

Choose a reason for hiding this comment

Ram0nB Apr 2, 2024

Choose a reason for hiding this comment

fkiraly Apr 2, 2024

Choose a reason for hiding this comment

Ram0nB Apr 5, 2024

Choose a reason for hiding this comment

fkiraly left a comment

Choose a reason for hiding this comment

fkiraly Apr 5, 2024

Choose a reason for hiding this comment

fkiraly Apr 5, 2024

Choose a reason for hiding this comment

fkiraly left a comment

Choose a reason for hiding this comment

fkiraly left a comment

Choose a reason for hiding this comment

fkiraly commented Apr 10, 2024