Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Imputer bugfix for issue #6224 #6253

Merged
merged 11 commits into from Apr 10, 2024
Merged

[BUG] Imputer bugfix for issue #6224 #6253

merged 11 commits into from Apr 10, 2024

Conversation

Ram0nB
Copy link
Contributor

@Ram0nB Ram0nB commented Apr 2, 2024

Reference Issues/PRs

Fixes #6224

What does this implement/fix? Explain your changes.

For the Imputer:

  • Add to docstring that after every method ffill then bfill will be used
  • In case of pd-multiindex also always ffill then bfill to keep behaviour consistent with single index case
  • In case of pd-multiindex never impute on non-grouped data

For the Imputer tests:

Does your contribution introduce a new dependency? If yes, which one?

No

What should a reviewer concentrate their feedback on?

All changes

Did you add any tests for the change?

Yes

Any other comments?

No

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the sktime root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • Optionally, for added estimators: I've added myself and possibly to the maintainers tag - do this if you want to become the owner or maintainer of an estimator you added.
    See here for further details on the algorithm maintainer role.
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

- Add to docstring that after every method ffill then bfill will be used
- In case of pd-multiindex also always ffill then bfill to keep behaviour consistent with single index case
- In case of pd-multiindex never impute on non-grouped data
Add bug and test label
@fkiraly fkiraly added module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing bugfix Fixes a known bug or removes unintended behavior labels Apr 2, 2024
Spaces instead of tabs for indent
@@ -23,7 +23,9 @@ class Imputer(BaseTransformer):
Parameters
----------
method : str, default="drift"
Method to fill the missing values.
Method to fill the missing values. Not all methods can extrapolate, so after
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. Do we know which methods are impacted? If only a few, it makes sense to describe in the method bullet point, like for "linear".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "linear" can't extrapolate
  • "ffill"/"pad" can't extrapolate backward
  • "backfill"/"bfill" can't extrapolate forward
  • In case a method is chosen that fits on data seen in fit ("drift", "mean", "median" and "random"), but the data in transform contains an instance not seen in fit.

Since more than a few, I think it makes sense to leave the docstring as is. Let me know if you have other suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for the explanation.

Btw, I thought mean, median, random should be ok? This would not depend on a method, since that is not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote that "drift" and "random" are also affected, but when testing I noticed that this is not the case. Sorry about this. The reason that "mean" and "median" are affected is that those methods don't use Sktime's vectorization. When using Sktime's vectorization and transforming an instance not seen in fit, the following error is raised:

RuntimeError: Imputer is a transformer that applies per individual time series, and broadcasts across instances. In fit, Imputer makes one fit per instance, and applies that fit to the instance with the same index in transform. Vanilla use therefore requires the same number of instances in fit and transform, butfound different number of instances in transform than in fit. number of instances seen in fit: 2; number of instances seen in transform: 3. For fit/transforming per instance, e.g., for pre-processinng in a time series classification, regression or clustering pipeline, wrap this transformer in FitInTransform, from sktime.transformations.compose.

This error is however not raised when transforming an instance not seen in fit with "mean" and "median". This causes the missing values in those instances not seen in fit not to be imputed with mean" and "median" (there are no mean/median values for those instances), but rather with "ffill" then "bfill" since those are applied after every method.

fkiraly
fkiraly previously approved these changes Apr 2, 2024
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

No blocking comments, only suggestions for improvement.

Add test for:
- no nans left when only first/last missing (extrapolation)
- Consistency between applying the imputer to every instance separately, vs applying them to the panel
@Ram0nB Ram0nB marked this pull request as draft April 2, 2024 15:32
- Only test pd-multiindex with methods that support entire instance missing
@Ram0nB Ram0nB marked this pull request as ready for review April 2, 2024 15:46
@Ram0nB Ram0nB requested a review from fkiraly April 5, 2024 09:51
Failure case in bug #6224
"""

df = get_examples(mtype="pd-multiindex")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these lines have a side effect on the fixture, i.e., it changes the example itself by mutating the object - iloc writes are mutating, i.e., inplcae.

For this reason, all the weird failures occur, since the example is no longer as expected in checks.
We should probably make the function safer.

For now, could you make a copy or deepcopy of df?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(you could also use this PR: #6259)

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failures are due to get_examples being unsafe and inplace mutation, see above.

fkiraly added a commit that referenced this pull request Apr 8, 2024
This PR makes `get_examples` safer against side effects on the data
fixtures, by adding a `deepcopy` at the end.

This should prevent issues like in
#6253 (review)
from occurring in the future.
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes the get_examples issue, so the test should work now as intended.

Thanks!

@fkiraly
Copy link
Collaborator

fkiraly commented Apr 10, 2024

failures are unrelated, #6280

@fkiraly fkiraly merged commit a63537b into sktime:main Apr 10, 2024
52 of 54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Fixes a known bug or removes unintended behavior module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Imputer imputes too many pd-multiindex missing values
2 participants