Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG]Fix simple imputer for string input #17526

Merged
merged 15 commits into from Jun 9, 2020

Conversation

yagi-3
Copy link
Contributor

@yagi-3 yagi-3 commented Jun 7, 2020

Reference Issues/PRs

Fixes #17525

What does this implement/fix?

  • Fix to accept list of strings in SimpleImputer with strategy="most_frequent" or "constant"
  • Add unittest

Any other comments?

  • If this PR is not proper, please tell me. I will close:)

@yagi-3 yagi-3 changed the title Fix simple imputer for str input Fix simple imputer for string input Jun 7, 2020
@yagi-3 yagi-3 changed the title Fix simple imputer for string input [WIP]Fix simple imputer for string input Jun 7, 2020
@yagi-3 yagi-3 changed the title [WIP]Fix simple imputer for string input [MRG]Fix simple imputer for string input Jun 8, 2020
Comment on lines 1367 to 1380


@pytest.mark.parametrize(
"strategy, X, expected_X",
[("most_frequent", [['a'], [np.nan]],
np.array([['a'], ['a']], dtype=np.object)),
("constant", [['a'], [np.nan]],
np.array([['a'], ['missing_value']], dtype=np.object))]
)
def test_simple_imputation_for_string(strategy, X, expected_X):
imputer = SimpleImputer(strategy=strategy)
X_trans = imputer.fit_transform(X)

assert_array_equal(X_trans, expected_X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with current naming, I would divide this function to test_imputation_most_frequent_string_list and test_imputation_constant_string_list, including near to similar testing functions (e.g., test_imputation_most_frequent_objects and test_imputation_constant_object).

Moreover, do not test X and expected_X as ndarray with dtype=object, since they are already tested in test_imputation_most_frequent_objects and test_imputation_constant_object.

@@ -228,7 +228,11 @@ def _validate_input(self, X, in_fit):
self.strategy))

if self.strategy in ("most_frequent", "constant"):
dtype = None
if isinstance(X, (list, tuple)) and \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since tuple is not a typical use case (nor array-like), I would remove this data type.

dtype = None
if isinstance(X, (list, tuple)) and \
any(isinstance(elem, str) for row in X for elem in row):
dtype = np.object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this particular case, I prefer object instead of np.object (similarly to testing functions). Moreover, adding a comment why is this data type required would be helpful.

Copy link
Member

@alfaro96 alfaro96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @yagi-3!

@yagi-3
Copy link
Contributor Author

yagi-3 commented Jun 8, 2020

@alfaro96 Thank you for your quick and kind suggestions!
I learned a lot and appreciate the opportunity!:smile:

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry to the change log at doc/whats_new/v0.24.rst with tag |Feature|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
@yagi-3
Copy link
Contributor Author

yagi-3 commented Jun 9, 2020

@thomasjpfan Thank you for the suggestions!

I added a change log. This is my first time. I tried to follow other entries but still worried that I may make mistakes. Any suggestion is welcomed:)

@yagi-3
Copy link
Contributor Author

yagi-3 commented Jun 9, 2020

@alfaro96 I'd like to add your name to change log (5ff61f2). Is it ok?

@alfaro96
Copy link
Member

alfaro96 commented Jun 9, 2020

@alfaro96 I'd like to add your name to change log (5ff61f2). Is it ok?

It would be a pleasure. Thanks @yagi-3 for the credit!

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yagi-3 ! Just one comment that tests could be parametrized, otherwise LGTM.

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@rth rth merged commit 3fb9e75 into scikit-learn:master Jun 9, 2020
@reshamas
Copy link
Member

applied to #DataUmbrella sprint

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError in SimpleImputer for string input with strategy='most_frequent'/'constant'
6 participants