New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds pandas IntegerArray support to check_array #16508
ENH Adds pandas IntegerArray support to check_array #16508
Conversation
@TomAugspurger could you please maybe have a look at this one? |
Looks reasonable. Just to confirm: the desired behavior is to coerce NA to NaN when |
sklearn/utils/validation.py
Outdated
if isinstance(dtype_iter, (Int8Dtype, Int16Dtype, | ||
Int32Dtype, Int64Dtype)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have unsigned variants. Is that relevant anywhere in check_array
?
sklearn/utils/validation.py
Outdated
# convert dataframe with Integer extension arrays with floats | ||
array = array.astype(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that IntegerArray.astype(None)
returning a float64 ndarray isn't deliberate, right @jorisvandenbossche?.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is deliberate, indeed (I would say that's a bug .. although it seems consistent with numpy where np.dtype(None)
is float64).
It is to corece when
It is a compromise to trying to push a extension arrays into a homogeneous numpy array, while making it easier for uses to use pandas + scikit-learn. I am slightly uncomfortable with this and am open to an another approach. This PR is using the following to cohere: df = pd.DataFrame(
{'a': pd.Series([1, 2, 3, None], dtype="UInt8"),
'b': pd.Series([1, 4, 5, 10], dtype="int")})
df.dtypes
# a UInt8
# b int64
# dtype: object
# converst pd.NA to np.nan
df_other = df.astype(None)
df_other.dtype
# a float64
# b float64
# dtype: object which allows us to call |
sklearn/utils/validation.py
Outdated
@@ -487,6 +502,10 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True, | |||
# list of accepted types. | |||
dtype = dtype[0] | |||
|
|||
if has_pd_interger_array: | |||
# If there are any pandas integer extension arrays, | |||
array = array.astype(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think .astype(np.dtype('float64'))
would be best (assuming you always want 64-bit here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or to_numpy()
? That's even more explicit (and certainly given that astype behaviour might still change, eg to always return new ExtensionArray), although only available for more recent versions? (which might not be a problem, since this is still experimental anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the actual code now, array
here will typically be a DataFrame (in case of pandas), and it seemt that for DataFrame.to_numpy
we didn't add the na_value
keyword .. (that's a pity, as I think df.to_numpy(dtype='float64', na_value=np.nan)
would have been the most robust going forward).
Note that you can already have this problem with dataframes with the standard numpy dtypes if you have eg a mix of float and int columns (which will all get converted to a single float array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codecov complaints seem unrelated. LGTM, thanks @thomasjpfan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to change log, please.
What's the story with imputers?
What's the story with imputers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM besides the following comments.
Please also update the docstring of force_all_finite
to explicitly mention that pd.NA
is accepted and will be converted as np.nan
values in a np.float64
arrays (don't forget the versionchanged
directive).
sklearn/utils/validation.py
Outdated
Int32Dtype, Int64Dtype, | ||
UInt8Dtype, UInt16Dtype, | ||
UInt32Dtype, UInt64Dtype)): | ||
has_pd_interger_array = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has_pd_interger_array = True | |
has_pd_integer_array = True |
sklearn/utils/validation.py
Outdated
|
||
if all(isinstance(dtype, np.dtype) for dtype in dtypes_orig): | ||
dtype_orig = np.result_type(*dtypes_orig) | ||
|
||
if has_pd_interger_array: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if has_pd_interger_array: | |
if has_pd_integer_array: |
sklearn/utils/validation.py
Outdated
@@ -462,15 +462,33 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True, | |||
"It will be converted to a dense numpy array." | |||
) | |||
|
|||
has_pd_interger_array = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has_pd_interger_array = False | |
has_pd_integer_array = False |
doc/whats_new/v0.23.rst
Outdated
@@ -473,6 +473,9 @@ Changelog | |||
matrix from a pandas DataFrame that contains only `SparseArray`s. | |||
:pr:`16728` by `Thomas Fan`_. | |||
|
|||
- |Enhancement| :func:`utils.validation.check_array` supports pandas | |||
IntegerArray. :pr:`16508` by `Thomas Fan`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be a bit more descriptive: utils.validation.check_array
accepts pandas IntegerArray
when force_all_finite
is set to False
or allow-nan
in which case the data is converted to floating point values with pd.NA
values replaced by np.nan
.
Maybe you could also mention that, as a consequence, all sklearn.preprocessing
transformers (also known as "scalers") that accept numeric inputs with missing values represented as np.nan
now also accept being directly fed with pandas dataframes with pandas.Int*
or pandas.UInt*
- typed columns that use pandas.NA
as the missing value marker.
They might work as a result of this PR but we need to update their own tests and the changelog should also mention them. I believe the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments:
- the docstring of other functions method that expose the
force_all_finite
option in the public API might also need to be updated similarly:
$ git grep "force_all_finite :"
sklearn/metrics/pairwise.py: force_all_finite : boolean or 'allow-nan', (default=True)
sklearn/metrics/pairwise.py: force_all_finite : boolean or 'allow-nan', (default=True)
sklearn/utils/validation.py: force_all_finite : boolean or 'allow-nan', (default=True)
sklearn/utils/validation.py: force_all_finite : boolean or 'allow-nan', (default=True)
sklearn/utils/validation.py: force_all_finite : boolean or 'allow-nan', (default=True)
sklearn/utils/validation.py: force_all_finite : boolean or 'allow-nan', (default=True)
- we need to extend the test to test for the resulting
dtype
of this conversion:
Added tests to |
Maybe we should allow the users to pass |
There are new sphinx warnings in this PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo that breaks sphinx:
doc/whats_new/v0.23.rst
Outdated
@@ -316,6 +316,10 @@ Changelog | |||
``max_value`` and ``min_value``. Array-like inputs allow a different max and min to be specified | |||
for each feature. :pr:`16403` by :user:`Narendra Mukherjee <narendramukherjee>`. | |||
|
|||
- |Enhancement| :class:`impute.SimpleImputer`, :class:`impute.KNNImputer`, and | |||
:class:`impute.SimpleImputer` accepts pandas IntegerArray with nan values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:class:`impute.SimpleImputer` accepts pandas IntegerArray with nan values. | |
:class:`impute.SimpleImputer` accepts pandas' nullable integer dtype with missing values. |
Just a suggestion, but in general we try to speak about the dtype, as "IntegerArray" is only used under the hood and not directly visible to most users (unless you access the underlying values of a single column), and thus not necessarily a term that users are familiar with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but it makes it more verbose, though, ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche that's a good point. There are other occurrence of the "IntegerArray" in this PR. I think we should covert them all to "pandas' nullable integer dtype/values/column/dataframe" depending on the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan if you agree with this proposal, I can give it a shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR with removing the "IntegerArray" wording in place of "pandas' nullable ..."
General comment: although integer columns will certainly be the most common use case for scikit-learn, |
In particular for categorical encoders with missing values but this is technically a very different and complementary change, I believe. |
Merged! |
Reference Issues/PRs
Resolves #16498
What does this implement/fix? Explain your changes.
Uses the name of the dtype to check if the dtype is a pandas
IntegerArray
.Any other comments?
The
__array__
interface for panda'sIntegerArray
results in a object: