New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG]Fix simple imputer for string input #17526
[MRG]Fix simple imputer for string input #17526
Conversation
sklearn/impute/tests/test_impute.py
Outdated
|
||
|
||
@pytest.mark.parametrize( | ||
"strategy, X, expected_X", | ||
[("most_frequent", [['a'], [np.nan]], | ||
np.array([['a'], ['a']], dtype=np.object)), | ||
("constant", [['a'], [np.nan]], | ||
np.array([['a'], ['missing_value']], dtype=np.object))] | ||
) | ||
def test_simple_imputation_for_string(strategy, X, expected_X): | ||
imputer = SimpleImputer(strategy=strategy) | ||
X_trans = imputer.fit_transform(X) | ||
|
||
assert_array_equal(X_trans, expected_X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with current naming, I would divide this function to test_imputation_most_frequent_string_list
and test_imputation_constant_string_list
, including near to similar testing functions (e.g., test_imputation_most_frequent_objects
and test_imputation_constant_object
).
Moreover, do not test X
and expected_X
as ndarray
with dtype=object
, since they are already tested in test_imputation_most_frequent_objects
and test_imputation_constant_object
.
sklearn/impute/_base.py
Outdated
@@ -228,7 +228,11 @@ def _validate_input(self, X, in_fit): | |||
self.strategy)) | |||
|
|||
if self.strategy in ("most_frequent", "constant"): | |||
dtype = None | |||
if isinstance(X, (list, tuple)) and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since tuple
is not a typical use case (nor array-like
), I would remove this data type.
sklearn/impute/_base.py
Outdated
dtype = None | ||
if isinstance(X, (list, tuple)) and \ | ||
any(isinstance(elem, str) for row in X for elem in row): | ||
dtype = np.object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this particular case, I prefer object
instead of np.object
(similarly to testing functions). Moreover, adding a comment why is this data type required would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @yagi-3!
@alfaro96 Thank you for your quick and kind suggestions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an entry to the change log at doc/whats_new/v0.24.rst
with tag |Feature|. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@thomasjpfan Thank you for the suggestions! I added a change log. This is my first time. I tried to follow other entries but still worried that I may make mistakes. Any suggestion is welcomed:) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yagi-3 ! Just one comment that tests could be parametrized, otherwise LGTM.
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
applied to #DataUmbrella sprint |
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Reference Issues/PRs
Fixes #17525
What does this implement/fix?
SimpleImputer
withstrategy
="most_frequent" or "constant"Any other comments?