Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom imputation strategy to SimpleImputer #28053

Merged
merged 9 commits into from
Jan 4, 2024

Conversation

mark-thm
Copy link
Contributor

@mark-thm mark-thm commented Jan 3, 2024

Reference Issues/PRs

#27986

What does this implement/fix? Explain your changes.

Adds a 'custom' strategy to SimpleImputer that enables supplying ones own statistics to produce an imputation value.

In my experience, it's useful to be able to compute, for instance, minimum and maximum values of the inputs in addition to, for instance, mean, and this enables unifying the imputation logic and producing a single location to manage all imputations.

Any other comments?

I proposed a similar change in #27986 and @adrinjalali and @jnothman requested to see a variation that accepts a callable instead of explicitly supporting new statistics.

Copy link

github-actions bot commented Jan 3, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 0c21689. Link to the linter CI: here

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround. What I'd imagined was strategy=np.maximum in which case we apply self.strategy(masked_X_column.compressed()) for each column, or, in an optimisation, identify that self.strategy is a ufunc and apply the function vectorized across columns. I think it's okay, initially, to expect that the callable take a dense 1d array as input.

sklearn/impute/_base.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Outdated Show resolved Hide resolved
@mark-thm mark-thm requested a review from jnothman January 3, 2024 19:47
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from changing the validator to callable, LGTM.

Co-authored-by: Joel Nothman <joeln@canva.com>
sklearn/impute/_base.py Show resolved Hide resolved
Comment on lines 469 to 470
else:
raise RuntimeError(f"Unknown strategy {strategy}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this can ever happen since we validate parameters before getting here and people shouldn't be calling these private methods themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to get code coverage to pass without adding the final condition and this test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove this line, codecov doesn't have them to want to cover in the first place.

Comment on lines 541 to 542
else:
raise RuntimeError(f"Unknown strategy {strategy}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the clean solution @mark-thm

@adrinjalali adrinjalali merged commit e2b3785 into scikit-learn:main Jan 4, 2024
26 checks passed
@mark-thm mark-thm deleted the me/custom-imputation2 branch January 4, 2024 15:15
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants