Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Support pd.NA in StringDtype columns for SimpleImputer #21114

Merged
merged 29 commits into from
Nov 5, 2021

Conversation

yxiong
Copy link
Contributor

@yxiong yxiong commented Sep 23, 2021

Reference Issues/PRs

Fixes #21112 .

What does this implement/fix? Explain your changes.

This is a starting point for discussing potential fixes for #21112 , containing two parts:

  1. Make sklearn.utils.is_scalar_nan(x) return true when x is pd.NA. This is necessary for imputer._validate_input to successfully validate pd.StringDtype data with pd.NA.
  2. Support pd.NA in sklearn.utils._mask._get_dense_mask.

With these changes, the code snippet in #21112 will run successfully and imputes pd.NA to empty strings.

Any other comments?

I am new in contributing to sklearn and unfamiliar with the custom and norm (e.g. what's the proper way to import pandas). This PR is just a proof-of-concept to initiate some discussion. If the direction looks promising, I can update the code to adhere to package's convention, add documentation and unit tests, etc. Please kindly advice. Thanks!

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

You can find the failing errors in the CI here: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=32878&view=results

For the linting errors, running black . should resolve the issue.

I left comments on some of my concerns.

sklearn/utils/fixes.py Outdated Show resolved Hide resolved
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a test to check that SimpleImputer works with pd.NA and string extension arrays.

This PR treats pd.NA explicitly when the dataframe is converted into an object dtype. This works well when the extension array is naturally convert into a object dtype, i.e. strings.

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/utils/_mask.py Show resolved Hide resolved
sklearn/impute/_base.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Show resolved Hide resolved
- Make private API: is_pd_na ==> _is_pandas_na.
- Add comments about suppressing `AttributeError`.
- Move the `_more_tags` function from `_BaseImputer` to its children.
- Add comments about skip validation in `_check_inputs_dtype`.
- Add unit test for floating point array
- Fix linter errors on "line too long"
- Add notes to doc/whats_new/v1.1.rst
sklearn/impute/_iterative.py Outdated Show resolved Hide resolved
sklearn/impute/_knn.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Show resolved Hide resolved
- Move `_more_tags` back to the base class
- Revert doc/whats_new/_contributors.rst
- Update the docstring of `SimpleImputer` and add another test for using `np.nan` as missing value
@yxiong
Copy link
Contributor Author

yxiong commented Sep 28, 2021

Thanks again for the code review, @thomasjpfan and @ogrisel . Please take another look.

@yxiong
Copy link
Contributor Author

yxiong commented Oct 5, 2021

Friendly ping @thomasjpfan @ogrisel . Please let me know if you have other comments. If things look good, what is the right procedure to merge this into the main branch?

@yxiong yxiong changed the title Support pd.NA in StringDtype columns for SimpleImputer [MRG] Support pd.NA in StringDtype columns for SimpleImputer Oct 7, 2021
sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
- Add unit test for 'median' strategy on integer-type arrays
- Add xfailing test for 'median' strategy on float-typed arrays
- Update code style to only suppress `ImportError`, not `AttributeError`
@yxiong
Copy link
Contributor Author

yxiong commented Oct 13, 2021

Updated and all tests passed. Please take another look.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, otherwise LGTM!

sklearn/utils/__init__.py Outdated Show resolved Hide resolved
sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
@yxiong
Copy link
Contributor Author

yxiong commented Oct 20, 2021

Thanks again for the review @thomasjpfan !

@ogrisel please take another look and let me know if you have other comments.

@yxiong
Copy link
Contributor Author

yxiong commented Oct 27, 2021

Friendly ping on this PR @thomasjpfan @ogrisel . Could you kindly advice what's the right procedure to get this merged into the main branch? Is there any other actions needed from my side at the moment? Thanks!

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow feedback @yxiong.

Overall this looks good but there are things to improve with the dtype handling I think. See below.

@thomasjpfan let me know if you agree.

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved
@yxiong yxiong requested a review from ogrisel October 28, 2021 19:59
@yxiong
Copy link
Contributor Author

yxiong commented Oct 29, 2021

@ogrisel Please take another look.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new dtype checks in the tests.

sklearn/impute/tests/test_impute.py Show resolved Hide resolved
sklearn/impute/tests/test_impute.py Show resolved Hide resolved
- Add test case for `strategy="mean"`
- Use `assert_allclose` for float arrays
@yxiong
Copy link
Contributor Author

yxiong commented Nov 4, 2021

Friendly ping @thomasjpfan @ogrisel . Please take another look.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you very much!

@ogrisel ogrisel merged commit 5256fb3 into scikit-learn:main Nov 5, 2021
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021
…learn#21114)


Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
…learn#21114)


Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
mathijs02 pushed a commit to mathijs02/scikit-learn that referenced this pull request Dec 27, 2022
…learn#21114)


Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SimpleImputer cannot impute pd.DataFrame of StringDtype
3 participants