Skip to content

Conversation

mark-thm
Copy link
Contributor

@mark-thm mark-thm commented Jan 3, 2024

Reference Issues/PRs

#27986

What does this implement/fix? Explain your changes.

Adds a 'custom' strategy to SimpleImputer that enables supplying ones own statistics to produce an imputation value.

In my experience, it's useful to be able to compute, for instance, minimum and maximum values of the inputs in addition to, for instance, mean, and this enables unifying the imputation logic and producing a single location to manage all imputations.

Any other comments?

I proposed a similar change in #27986 and @adrinjalali and @jnothman requested to see a variation that accepts a callable instead of explicitly supporting new statistics.

Copy link

github-actions bot commented Jan 3, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 0c21689. Link to the linter CI: here

@mark-thm mark-thm force-pushed the me/custom-imputation2 branch from 2c960f9 to 64eb8bf Compare January 3, 2024 01:10
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround. What I'd imagined was strategy=np.maximum in which case we apply self.strategy(masked_X_column.compressed()) for each column, or, in an optimisation, identify that self.strategy is a ufunc and apply the function vectorized across columns. I think it's okay, initially, to expect that the callable take a dense 1d array as input.

@mark-thm mark-thm requested a review from jnothman January 3, 2024 19:47
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from changing the validator to callable, LGTM.

Co-authored-by: Joel Nothman <joeln@canva.com>
Comment on lines 469 to 470
else:
raise RuntimeError(f"Unknown strategy {strategy}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this can ever happen since we validate parameters before getting here and people shouldn't be calling these private methods themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to get code coverage to pass without adding the final condition and this test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove this line, codecov doesn't have them to want to cover in the first place.

Comment on lines 541 to 542
else:
raise RuntimeError(f"Unknown strategy {strategy}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the clean solution @mark-thm

@adrinjalali adrinjalali merged commit e2b3785 into scikit-learn:main Jan 4, 2024
@mark-thm mark-thm deleted the me/custom-imputation2 branch January 4, 2024 15:15
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants