New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Imputer bugfix for issue #6224 #6253
Conversation
- Add to docstring that after every method ffill then bfill will be used - In case of pd-multiindex also always ffill then bfill to keep behaviour consistent with single index case - In case of pd-multiindex never impute on non-grouped data
Add test for bug sktime#6224
Add bug and test label
Spaces instead of tabs for indent
@@ -23,7 +23,9 @@ class Imputer(BaseTransformer): | |||
Parameters | |||
---------- | |||
method : str, default="drift" | |||
Method to fill the missing values. | |||
Method to fill the missing values. Not all methods can extrapolate, so after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. Do we know which methods are impacted? If only a few, it makes sense to describe in the method bullet point, like for "linear"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "linear" can't extrapolate
- "ffill"/"pad" can't extrapolate backward
- "backfill"/"bfill" can't extrapolate forward
- In case a method is chosen that fits on data seen in fit ("drift", "mean", "median" and "random"), but the data in transform contains an instance not seen in fit.
Since more than a few, I think it makes sense to leave the docstring as is. Let me know if you have other suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks for the explanation.
Btw, I thought mean
, median
, random
should be ok? This would not depend on a method
, since that is not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote that "drift" and "random" are also affected, but when testing I noticed that this is not the case. Sorry about this. The reason that "mean" and "median" are affected is that those methods don't use Sktime's vectorization. When using Sktime's vectorization and transforming an instance not seen in fit, the following error is raised:
RuntimeError: Imputer is a transformer that applies per individual time series, and broadcasts across instances. In fit, Imputer makes one fit per instance, and applies that fit to the instance with the same index in transform. Vanilla use therefore requires the same number of instances in fit and transform, butfound different number of instances in transform than in fit. number of instances seen in fit: 2; number of instances seen in transform: 3. For fit/transforming per instance, e.g., for pre-processinng in a time series classification, regression or clustering pipeline, wrap this transformer in FitInTransform, from sktime.transformations.compose.
This error is however not raised when transforming an instance not seen in fit with "mean" and "median". This causes the missing values in those instances not seen in fit not to be imputed with mean" and "median" (there are no mean/median values for those instances), but rather with "ffill" then "bfill" since those are applied after every method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
No blocking comments, only suggestions for improvement.
Add test for: - no nans left when only first/last missing (extrapolation) - Consistency between applying the imputer to every instance separately, vs applying them to the panel
- Only test pd-multiindex with methods that support entire instance missing
Failure case in bug #6224 | ||
""" | ||
|
||
df = get_examples(mtype="pd-multiindex")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these lines have a side effect on the fixture, i.e., it changes the example itself by mutating the object - iloc
writes are mutating, i.e., inplcae.
For this reason, all the weird failures occur, since the example is no longer as expected in checks.
We should probably make the function safer.
For now, could you make a copy or deepcopy of df?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(you could also use this PR: #6259)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failures are due to get_examples
being unsafe and inplace mutation, see above.
This PR makes `get_examples` safer against side effects on the data fixtures, by adding a `deepcopy` at the end. This should prevent issues like in #6253 (review) from occurring in the future.
Copy example to prevent mutation to original object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes the get_examples
issue, so the test should work now as intended.
Thanks!
failures are unrelated, #6280 |
Reference Issues/PRs
Fixes #6224
What does this implement/fix? Explain your changes.
For the Imputer:
For the Imputer tests:
Does your contribution introduce a new dependency? If yes, which one?
No
What should a reviewer concentrate their feedback on?
All changes
Did you add any tests for the change?
Yes
Any other comments?
No
PR checklist
For all contributions
How to: add yourself to the all-contributors file in the
sktime
root directory (not theCONTRIBUTORS.md
). Common badges:code
- fixing a bug, or adding code logic.doc
- writing or improving documentation or docstrings.bug
- reporting or diagnosing a bug (get this pluscode
if you also fixed the bug in the PR).maintenance
- CI, test framework, release.See here for full badge reference
maintainers
tag - do this if you want to become the owner or maintainer of an estimator you added.See here for further details on the algorithm maintainer role.