Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adds pandas IntegerArray support to check_array #16508

Merged
merged 24 commits into from Apr 28, 2020

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Feb 21, 2020

Reference Issues/PRs

Resolves #16498

What does this implement/fix? Explain your changes.

Uses the name of the dtype to check if the dtype is a pandas IntegerArray.

Any other comments?

The __array__ interface for panda's IntegerArray results in a object:

import pandas as pd
X = pd.Series([1, 2, pd.NA])

np.asarray(X)
# array([1, 2, <NA>], dtype=object)

np.asarray(X, dtype=float)
# ValueError

np.asarray(X.astype(float))
# array([ 1.,  2., nan])

sklearn/preprocessing/tests/test_common.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
@adrinjalali
Copy link
Member

adrinjalali commented Apr 21, 2020

@TomAugspurger could you please maybe have a look at this one?

@TomAugspurger
Copy link
Contributor

Looks reasonable.

Just to confirm: the desired behavior is to coerce NA to NaN when force_all_finite=True? You're comfortable with the precision loss for large integers that comes with that?

if isinstance(dtype_iter, (Int8Dtype, Int16Dtype,
Int32Dtype, Int64Dtype)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have unsigned variants. Is that relevant anywhere in check_array?

# convert dataframe with Integer extension arrays with floats
array = array.astype(None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that IntegerArray.astype(None) returning a float64 ndarray isn't deliberate, right @jorisvandenbossche?.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is deliberate, indeed (I would say that's a bug .. although it seems consistent with numpy where np.dtype(None) is float64).

@thomasjpfan
Copy link
Member Author

Just to confirm: the desired behavior is to coerce NA to NaN when force_all_finite=True?

It is to corece when allow_all_finite is False or 'allow-nan'.

You're comfortable with the precision loss for large integers that comes with that?

It is a compromise to trying to push a extension arrays into a homogeneous numpy array, while making it easier for uses to use pandas + scikit-learn. I am slightly uncomfortable with this and am open to an another approach.

This PR is using the following to cohere:

df = pd.DataFrame(
    {'a': pd.Series([1, 2, 3, None], dtype="UInt8"),
     'b': pd.Series([1, 4, 5, 10], dtype="int")})

df.dtypes
# a    UInt8
# b    int64
# dtype: object

# converst pd.NA to np.nan
df_other = df.astype(None)

df_other.dtype
# a    float64
# b    float64
# dtype: object

which allows us to call asarray and get a numpy array. I do not think this is documented behavior. I can try another approach if this implementation is not officially support by pandas.

@@ -487,6 +502,10 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True,
# list of accepted types.
dtype = dtype[0]

if has_pd_interger_array:
# If there are any pandas integer extension arrays,
array = array.astype(None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .astype(np.dtype('float64')) would be best (assuming you always want 64-bit here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or to_numpy()? That's even more explicit (and certainly given that astype behaviour might still change, eg to always return new ExtensionArray), although only available for more recent versions? (which might not be a problem, since this is still experimental anyway)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the actual code now, array here will typically be a DataFrame (in case of pandas), and it seemt that for DataFrame.to_numpy we didn't add the na_value keyword .. (that's a pity, as I think df.to_numpy(dtype='float64', na_value=np.nan) would have been the most robust going forward).

@jorisvandenbossche
Copy link
Member

You're comfortable with the precision loss for large integers that comes with that?

Note that you can already have this problem with dataframes with the standard numpy dtypes if you have eg a mix of float and int columns (which will all get converted to a single float array)

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codecov complaints seem unrelated. LGTM, thanks @thomasjpfan

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to change log, please.

What's the story with imputers?

sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
@jnothman
Copy link
Member

What's the story with imputers?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides the following comments.

Please also update the docstring of force_all_finite to explicitly mention that pd.NA is accepted and will be converted as np.nan values in a np.float64 arrays (don't forget the versionchanged directive).

Int32Dtype, Int64Dtype,
UInt8Dtype, UInt16Dtype,
UInt32Dtype, UInt64Dtype)):
has_pd_interger_array = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
has_pd_interger_array = True
has_pd_integer_array = True


if all(isinstance(dtype, np.dtype) for dtype in dtypes_orig):
dtype_orig = np.result_type(*dtypes_orig)

if has_pd_interger_array:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if has_pd_interger_array:
if has_pd_integer_array:

@@ -462,15 +462,33 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True,
"It will be converted to a dense numpy array."
)

has_pd_interger_array = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
has_pd_interger_array = False
has_pd_integer_array = False

@@ -473,6 +473,9 @@ Changelog
matrix from a pandas DataFrame that contains only `SparseArray`s.
:pr:`16728` by `Thomas Fan`_.

- |Enhancement| :func:`utils.validation.check_array` supports pandas
IntegerArray. :pr:`16508` by `Thomas Fan`_.
Copy link
Member

@ogrisel ogrisel Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be a bit more descriptive: utils.validation.check_array accepts pandas IntegerArray when force_all_finite is set to False or allow-nan in which case the data is converted to floating point values with pd.NA values replaced by np.nan.

Maybe you could also mention that, as a consequence, all sklearn.preprocessing transformers (also known as "scalers") that accept numeric inputs with missing values represented as np.nan now also accept being directly fed with pandas dataframes with pandas.Int* or pandas.UInt*- typed columns that use pandas.NA as the missing value marker.

@ogrisel
Copy link
Member

ogrisel commented Apr 27, 2020

What's the story with imputers?

They might work as a result of this PR but we need to update their own tests and the changelog should also mention them.

I believe the HistGradientBoosting* estimators should now also accept dataframes with IntegerArray columns.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments:

  • the docstring of other functions method that expose the force_all_finite option in the public API might also need to be updated similarly:
$ git grep "force_all_finite :"
sklearn/metrics/pairwise.py:    force_all_finite : boolean or 'allow-nan', (default=True)
sklearn/metrics/pairwise.py:    force_all_finite : boolean or 'allow-nan', (default=True)
sklearn/utils/validation.py:    force_all_finite : boolean or 'allow-nan', (default=True)
sklearn/utils/validation.py:    force_all_finite : boolean or 'allow-nan', (default=True)
sklearn/utils/validation.py:    force_all_finite : boolean or 'allow-nan', (default=True)
sklearn/utils/validation.py:    force_all_finite : boolean or 'allow-nan', (default=True)
  • we need to extend the test to test for the resulting dtype of this conversion:

sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
@thomasjpfan
Copy link
Member Author

Added tests to sklearn/impute/tests/test_common.py to test the new functionality with imputers. This can be a little strange because missing_value=np.nan, but the missing value of the input is pd.Na, because we are converting them to np.nan.

@ogrisel
Copy link
Member

ogrisel commented Apr 27, 2020

Added tests to sklearn/impute/tests/test_common.py to test the new functionality with imputers. This can be a little strange because missing_value=np.nan, but the missing value of the input is pd.Na, because we are converting them to np.nan.

Maybe we should allow the users to pass missing_value=pd.Na in imputers? Or maybe we should just mention that pd.NA is automatically converted to np.nan in the docstring of the missing_value parameter?

@ogrisel
Copy link
Member

ogrisel commented Apr 28, 2020

There are new sphinx warnings in this PR:

Sphinx Warnings in affected files

    sklearn/impute/_base.py:docstring of sklearn.impute.MissingIndicator:15: WARNING: Inline interpreted text or phrase reference start-string without end-string.
    sklearn/impute/_base.py:docstring of sklearn.impute.SimpleImputer:13: WARNING: Inline interpreted text or phrase reference start-string without end-string.
    sklearn/impute/_iterative.py:docstring of sklearn.impute.IterativeImputer:30: WARNING: Inline interpreted text or phrase reference start-string without end-string.
    sklearn/impute/_knn.py:docstring of sklearn.impute.KNNImputer:15: WARNING: Inline interpreted text or phrase reference start-string without end-string. 

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo that breaks sphinx:

sklearn/impute/_base.py Outdated Show resolved Hide resolved
sklearn/impute/_iterative.py Outdated Show resolved Hide resolved
sklearn/impute/_knn.py Outdated Show resolved Hide resolved
sklearn/impute/_base.py Outdated Show resolved Hide resolved
@@ -316,6 +316,10 @@ Changelog
``max_value`` and ``min_value``. Array-like inputs allow a different max and min to be specified
for each feature. :pr:`16403` by :user:`Narendra Mukherjee <narendramukherjee>`.

- |Enhancement| :class:`impute.SimpleImputer`, :class:`impute.KNNImputer`, and
:class:`impute.SimpleImputer` accepts pandas IntegerArray with nan values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:class:`impute.SimpleImputer` accepts pandas IntegerArray with nan values.
:class:`impute.SimpleImputer` accepts pandas' nullable integer dtype with missing values.

Just a suggestion, but in general we try to speak about the dtype, as "IntegerArray" is only used under the hood and not directly visible to most users (unless you access the underlying values of a single column), and thus not necessarily a term that users are familiar with

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but it makes it more verbose, though, ..)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche that's a good point. There are other occurrence of the "IntegerArray" in this PR. I think we should covert them all to "pandas' nullable integer dtype/values/column/dataframe" depending on the context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan if you agree with this proposal, I can give it a shot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR with removing the "IntegerArray" wording in place of "pandas' nullable ..."

@jorisvandenbossche
Copy link
Member

General comment: although integer columns will certainly be the most common use case for scikit-learn, pd.NA is not only used for nullable integer columns in pandas, but also for the dedicated string data type and for the nullable boolean dtype. So we might want to consider more general handling of this not specific to integers in the future.

@ogrisel
Copy link
Member

ogrisel commented Apr 28, 2020

So we might want to consider more general handling of this not specific to integers in the future.

In particular for categorical encoders with missing values but this is technically a very different and complementary change, I believe.

@ogrisel ogrisel merged commit acbe13c into scikit-learn:master Apr 28, 2020
5 checks passed
@ogrisel
Copy link
Member

ogrisel commented Apr 28, 2020

Merged!

@thomasjpfan
Copy link
Member Author

#17010

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support pd.Na in preprocessing module
7 participants