Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adds pandas IntegerArray support to check_array #16508

Merged
merged 24 commits into from
Apr 28, 2020
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
b936ffb
ENH Adds support for pandas IntegerArray
thomasjpfan Feb 20, 2020
625eea2
DOC Adds comment regarding dtype
thomasjpfan Feb 20, 2020
371d16e
ENH Checks for type explicity
thomasjpfan Feb 21, 2020
da39b41
CLN Address comments
thomasjpfan Feb 21, 2020
8bd32d8
Merge remote-tracking branch 'upstream/master' into pandas_pd_na_support
thomasjpfan Feb 21, 2020
f1d0a3e
Merge remote-tracking branch 'upstream/master' into pandas_pd_na_support
thomasjpfan Apr 22, 2020
50a0de9
CLN Adds support for unsigned
thomasjpfan Apr 22, 2020
0f43b16
CLN Clean up unneeded code
thomasjpfan Apr 22, 2020
a634d3a
CLN Address comments
thomasjpfan Apr 22, 2020
cc21a5f
CLN Slightly nicer
thomasjpfan Apr 22, 2020
5ef3985
Merge remote-tracking branch 'upstream/master' into pandas_pd_na_support
thomasjpfan Apr 23, 2020
c6273a5
DOC Adds whats new
thomasjpfan Apr 27, 2020
679c08e
TST Adds explicit checking
thomasjpfan Apr 27, 2020
0318eb9
CLN Suggestion
thomasjpfan Apr 27, 2020
752ed5a
Merge remote-tracking branch 'upstream/master' into pandas_pd_na_support
thomasjpfan Apr 27, 2020
39dc7a6
CLN Adds test for imputers
thomasjpfan Apr 27, 2020
c861463
TST Adds test for dtypes
thomasjpfan Apr 27, 2020
8ff73e4
ENH Better dtype checking
thomasjpfan Apr 27, 2020
690cfdb
DOC Adds note regarding pandas integerarray
thomasjpfan Apr 27, 2020
77bf3b5
TST Fix
thomasjpfan Apr 27, 2020
38873e6
Merge branch 'master' into pandas_pd_na_support
jnothman Apr 28, 2020
e2b699c
Apply suggestions from code review
ogrisel Apr 28, 2020
09d69b5
DOC Update to pands nullable integer dtype with missing values
thomasjpfan Apr 28, 2020
774eaba
Merge branch 'master' into pandas_pd_na_support
ogrisel Apr 28, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions doc/whats_new/v0.23.rst
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,10 @@ Changelog
``max_value`` and ``min_value``. Array-like inputs allow a different max and min to be specified
for each feature. :pr:`16403` by :user:`Narendra Mukherjee <narendramukherjee>`.

- |Enhancement| :class:`impute.SimpleImputer`, :class:`impute.KNNImputer`, and
:class:`impute.SimpleImputer` accepts pandas IntegerArray with nan values.
Copy link
Member

@jorisvandenbossche jorisvandenbossche Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:class:`impute.SimpleImputer` accepts pandas IntegerArray with nan values.
:class:`impute.SimpleImputer` accepts pandas' nullable integer dtype with missing values.

Just a suggestion, but in general we try to speak about the dtype, as "IntegerArray" is only used under the hood and not directly visible to most users (unless you access the underlying values of a single column), and thus not necessarily a term that users are familiar with

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but it makes it more verbose, though, ..)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche that's a good point. There are other occurrence of the "IntegerArray" in this PR. I think we should covert them all to "pandas' nullable integer dtype/values/column/dataframe" depending on the context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan if you agree with this proposal, I can give it a shot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR with removing the "IntegerArray" wording in place of "pandas' nullable ..."

:pr:`16508` by `Thomas Fan`_.

:mod:`sklearn.inspection`
.........................

Expand Down Expand Up @@ -485,6 +489,13 @@ Changelog
can now contain `None`, where `drop_idx_[i] = None` means that no category
is dropped for index `i`. :pr:`16585` by :user:`Chiara Marmo <cmarmo>`.

- |Enhancement| :class:`preprocessing.MaxAbsScaler`,
:class:`preprocessing.MinMaxScaler`, :class:`preprocessing.StandardScaler`,
:class:`preprocessing.PowerTransformer`,
:class:`preprocessing.QuantileTransformer`,
:class:`preprocessing.RobustScaler` now supports pandas IntegerArrays with
nan values.

- |Efficiency| :class:`preprocessing.OneHotEncoder` is now faster at
transforming. :pr:`15762` by `Thomas Fan`_.

Expand Down Expand Up @@ -566,6 +577,15 @@ Changelog
matrix from a pandas DataFrame that contains only `SparseArray` columns.
:pr:`16728` by `Thomas Fan`_.

- |Enhancement| :func:`utils.validation.check_array` supports pandas
IntegerArray when `force_all_finite` is set to `False` or `'allow-nan'`
in which case the data is converted to floating point values where `pd.NA`
values are replaced by `np.nan`. As a consequence, all
:mod:`sklearn.preprocessing` transformers that accept numeric inputs with
missing values represented as `np.nan` now also accepts being directly fed
pandas dataframes with `pd.Int* or `pd.Uint*` typed columns that use
`pd.NA` as a missing value marker. :pr:`16508` by `Thomas Fan`_.

- |API| Passing classes to :func:`utils.estimator_checks.check_estimator` and
:func:`utils.estimator_checks.parametrize_with_checks` is now deprecated,
and support for classes will be removed in 0.24. Pass instances instead.
Expand Down
8 changes: 6 additions & 2 deletions sklearn/impute/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,9 @@ class SimpleImputer(_BaseImputer):
----------
missing_values : number, string, np.nan (default) or None
The placeholder for the missing values. All occurrences of
`missing_values` will be imputed.
`missing_values` will be imputed. For pandas dataframes with
IntegerArray and `pd.NA` values, `missing_values` should be set to
`np.nan`, since `pd.NA` will be converted to `np.nan.
ogrisel marked this conversation as resolved.
Show resolved Hide resolved

strategy : string, default='mean'
The imputation strategy.
Expand Down Expand Up @@ -477,7 +479,9 @@ class MissingIndicator(TransformerMixin, BaseEstimator):
missing_values : number, string, np.nan (default) or None
The placeholder for the missing values. All occurrences of
`missing_values` will be indicated (True in the output array), the
other values will be marked as False.
other values will be marked as False. For pandas dataframes with
IntegerArray and `pd.NA` values, `missing_values` should be set to
`np.nan`, since `pd.NA` will be converted to `np.nan.
ogrisel marked this conversation as resolved.
Show resolved Hide resolved

features : str, default=None
Whether the imputer mask should represent all or a subset of
Expand Down
4 changes: 3 additions & 1 deletion sklearn/impute/_iterative.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,9 @@ class IterativeImputer(_BaseImputer):

missing_values : int, np.nan, default=np.nan
The placeholder for the missing values. All occurrences of
``missing_values`` will be imputed.
``missing_values`` will be imputed. For pandas dataframes with
IntegerArray and `pd.NA` values, `missing_values` should be set to
`np.nan`, since `pd.NA` will be converted to `np.nan.
ogrisel marked this conversation as resolved.
Show resolved Hide resolved

sample_posterior : boolean, default=False
Whether to sample from the (Gaussian) predictive posterior of the
Expand Down
4 changes: 3 additions & 1 deletion sklearn/impute/_knn.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,9 @@ class KNNImputer(_BaseImputer):
----------
missing_values : number, string, np.nan or None, default=`np.nan`
The placeholder for the missing values. All occurrences of
`missing_values` will be imputed.
``missing_values`` will be imputed. For pandas dataframes with
IntegerArray and `pd.NA` values, `missing_values` should be set to
`np.nan`, since `pd.NA` will be converted to `np.nan.
ogrisel marked this conversation as resolved.
Show resolved Hide resolved

n_neighbors : int, default=5
Number of neighboring samples to use for imputation.
Expand Down
29 changes: 29 additions & 0 deletions sklearn/impute/tests/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,3 +84,32 @@ def test_imputers_add_indicator_sparse(imputer, marker):
imputer.set_params(add_indicator=False)
X_trans_no_indicator = imputer.fit_transform(X)
assert_allclose_dense_sparse(X_trans[:, :-4], X_trans_no_indicator)


# ConvergenceWarning will be raised by the IterativeImputer
@pytest.mark.filterwarnings("ignore::sklearn.exceptions.ConvergenceWarning")
@pytest.mark.parametrize("imputer", IMPUTERS)
@pytest.mark.parametrize("add_indicator", [True, False])
def test_imputers_pandas_na_integer_array_support(imputer, add_indicator):
# Test pandas IntegerArray with pd.NA
pd = pytest.importorskip('pandas', minversion="1.0")
marker = np.nan
imputer = imputer.set_params(add_indicator=add_indicator,
missing_values=marker)

X = np.array([
[marker, 1, 5, marker, 1],
[2, marker, 1, marker, 2],
[6, 3, marker, marker, 3],
[1, 2, 9, marker, 4]
])
# fit on numpy array
X_trans_expected = imputer.fit_transform(X)

# Creates dataframe with IntegerArrays with pd.NA
X_df = pd.DataFrame(X, dtype="Int16", columns=["a", "b", "c", "d", "e"])

# fit on pandas dataframe with IntegerArrays
X_trans = imputer.fit_transform(X_df)

assert_allclose(X_trans_expected, X_trans)
23 changes: 15 additions & 8 deletions sklearn/metrics/pairwise.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,17 +100,20 @@ def check_pairwise_arrays(X, Y, *, precomputed=False, dtype=None,
raise an error.

force_all_finite : boolean or 'allow-nan', (default=True)
Whether to raise an error on np.inf and np.nan in array. The
Whether to raise an error on np.inf, np.nan, pd.NA in array. The
possibilities are:

- True: Force all values of array to be finite.
- False: accept both np.inf and np.nan in array.
- 'allow-nan': accept only np.nan values in array. Values cannot
be infinite.
- False: accepts np.inf, np.nan, pd.NA in array.
- 'allow-nan': accepts only np.nan and pd.NA values in array. Values
cannot be infinite.

.. versionadded:: 0.22
``force_all_finite`` accepts the string ``'allow-nan'``.

.. versionchanged:: 0.23
Accepts `pd.NA` and converts it into `np.nan`

copy : bool
Whether a forced copy will be triggered. If copy=False, a copy might
be triggered by a conversion.
Expand Down Expand Up @@ -1691,15 +1694,19 @@ def pairwise_distances(X, Y=None, metric="euclidean", *, n_jobs=None,
for more details.

force_all_finite : boolean or 'allow-nan', (default=True)
Whether to raise an error on np.inf and np.nan in array. The
Whether to raise an error on np.inf, np.nan, pd.NA in array. The
possibilities are:

- True: Force all values of array to be finite.
- False: accept both np.inf and np.nan in array.
- 'allow-nan': accept only np.nan values in array. Values cannot
be infinite.
- False: accepts np.inf, np.nan, pd.NA in array.
- 'allow-nan': accepts only np.nan and pd.NA values in array. Values
cannot be infinite.

.. versionadded:: 0.22
``force_all_finite`` accepts the string ``'allow-nan'``.

.. versionchanged:: 0.23
Accepts `pd.NA` and converts it into `np.nan`

**kwds : optional keyword parameters
Any further parameters are passed directly to the distance function.
Expand Down
30 changes: 30 additions & 0 deletions sklearn/preprocessing/tests/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,3 +126,33 @@ def test_missing_value_handling(est, func, support_sparse, strictly_positive):
Xt_inv_sp = est_sparse.inverse_transform(Xt_sp)
assert len(records) == 0
assert_allclose(Xt_inv_sp.A, Xt_inv_dense)


@pytest.mark.parametrize(
"est, func",
[(MaxAbsScaler(), maxabs_scale),
(MinMaxScaler(), minmax_scale),
(StandardScaler(), scale),
(StandardScaler(with_mean=False), scale),
(PowerTransformer('yeo-johnson'), power_transform),
(PowerTransformer('box-cox'), power_transform,),
(QuantileTransformer(n_quantiles=3), quantile_transform),
(RobustScaler(), robust_scale),
(RobustScaler(with_centering=False), robust_scale)]
)
def test_missing_value_pandas_na_support(est, func):
# Test pandas IntegerArray with pd.NA
pd = pytest.importorskip('pandas', minversion="1.0")

X = np.array([[1, 2, 3, np.nan, np.nan, 4, 5, 1],
[np.nan, np.nan, 8, 4, 6, np.nan, np.nan, 8],
[1, 2, 3, 4, 5, 6, 7, 8]]).T

# Creates dataframe with IntegerArrays with pd.NA
X_df = pd.DataFrame(X, dtype="Int16", columns=['a', 'b', 'c'])
X_df['c'] = X_df['c'].astype('int')

X_trans = est.fit_transform(X)
X_df_trans = est.fit_transform(X_df)

assert_allclose(X_trans, X_df_trans)
31 changes: 31 additions & 0 deletions sklearn/utils/tests/test_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -349,6 +349,37 @@ def test_check_array():
check_array(X, dtype="numeric")


@pytest.mark.parametrize("pd_dtype", ["Int8", "Int16", "UInt8", "UInt16"])
@pytest.mark.parametrize("dtype, expected_dtype", [
([np.float32, np.float64], np.float32),
(np.float64, np.float64),
("numeric", np.float64),
])
def test_check_array_pandas_na_support(pd_dtype, dtype, expected_dtype):
# Test pandas IntegerArray with pd.NA
pd = pytest.importorskip('pandas', minversion="1.0")

X_np = np.array([[1, 2, 3, np.nan, np.nan],
[np.nan, np.nan, 8, 4, 6],
[1, 2, 3, 4, 5]]).T

# Creates dataframe with IntegerArrays with pd.NA
X = pd.DataFrame(X_np, dtype=pd_dtype, columns=['a', 'b', 'c'])
# column c has no nans
X['c'] = X['c'].astype('float')
X_checked = check_array(X, force_all_finite='allow-nan', dtype=dtype)
assert_allclose(X_checked, X_np)
assert X_checked.dtype == expected_dtype

X_checked = check_array(X, force_all_finite=False, dtype=dtype)
assert_allclose(X_checked, X_np)
assert X_checked.dtype == expected_dtype

msg = "Input contains NaN, infinity"
with pytest.raises(ValueError, match=msg):
check_array(X, force_all_finite=True)


def test_check_array_pandas_dtype_object_conversion():
# test that data-frame like objects with dtype object
# get converted
Expand Down
70 changes: 49 additions & 21 deletions sklearn/utils/validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,17 +135,20 @@ def as_float_array(X, *, copy=True, force_all_finite=True):
returned if X's dtype is not a floating point type.

force_all_finite : boolean or 'allow-nan', (default=True)
Whether to raise an error on np.inf and np.nan in X. The possibilities
are:
Whether to raise an error on np.inf, np.nan, pd.NA in X. The
possibilities are:

- True: Force all values of X to be finite.
- False: accept both np.inf and np.nan in X.
- 'allow-nan': accept only np.nan values in X. Values cannot be
infinite.
- False: accepts np.inf, np.nan, pd.NA in X.
- 'allow-nan': accepts only np.nan and pd.NA values in X. Values cannot
be infinite.

.. versionadded:: 0.20
``force_all_finite`` accepts the string ``'allow-nan'``.

.. versionchanged:: 0.23
Accepts `pd.NA` and converts it into `np.nan`

Returns
-------
XT : {array, sparse matrix}
Expand Down Expand Up @@ -317,17 +320,20 @@ def _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy,
be triggered by a conversion.

force_all_finite : boolean or 'allow-nan', (default=True)
Whether to raise an error on np.inf and np.nan in X. The possibilities
are:
Whether to raise an error on np.inf, np.nan, pd.NA in X. The
possibilities are:

- True: Force all values of X to be finite.
- False: accept both np.inf and np.nan in X.
- 'allow-nan': accept only np.nan values in X. Values cannot be
infinite.
- False: accepts np.inf, np.nan, pd.NA in X.
- 'allow-nan': accepts only np.nan and pd.NA values in X. Values cannot
be infinite.

.. versionadded:: 0.20
``force_all_finite`` accepts the string ``'allow-nan'``.

.. versionchanged:: 0.23
Accepts `pd.NA` and converts it into `np.nan`

Returns
-------
spmatrix_converted : scipy sparse matrix.
Expand Down Expand Up @@ -438,19 +444,20 @@ def check_array(array, accept_sparse=False, *, accept_large_sparse=True,
be triggered by a conversion.

force_all_finite : boolean or 'allow-nan', (default=True)
Whether to raise an error on np.inf and np.nan in array. The
Whether to raise an error on np.inf, np.nan, pd.NA in array. The
possibilities are:

- True: Force all values of array to be finite.
- False: accept both np.inf and np.nan in array.
- 'allow-nan': accept only np.nan values in array. Values cannot
be infinite.

For object dtyped data, only np.nan is checked and not np.inf.
- False: accepts np.inf, np.nan, pd.NA in array.
- 'allow-nan': accepts only np.nan and pd.NA values in array. Values
cannot be infinite.

.. versionadded:: 0.20
``force_all_finite`` accepts the string ``'allow-nan'``.

.. versionchanged:: 0.23
Accepts `pd.NA` and converts it into `np.nan`

ensure_2d : boolean (default=True)
Whether to raise a value error if array is not 2D.

Expand Down Expand Up @@ -491,6 +498,7 @@ def check_array(array, accept_sparse=False, *, accept_large_sparse=True,
# check if the object contains several dtypes (typically a pandas
# DataFrame), and store them. If not, store None.
dtypes_orig = None
has_pd_integer_array = False
if hasattr(array, "dtypes") and hasattr(array.dtypes, '__array__'):
# throw warning if columns are sparse. If all columns are sparse, then
# array.sparse exists and sparsity will be perserved (later).
Expand All @@ -508,6 +516,19 @@ def check_array(array, accept_sparse=False, *, accept_large_sparse=True,
for i, dtype_iter in enumerate(dtypes_orig):
if dtype_iter.kind == 'b':
dtypes_orig[i] = np.object
elif dtype_iter.name.startswith(("Int", "UInt")):
# name looks like an Integer Extension Array, now check for
# the dtype
with suppress(ImportError):
from pandas import (Int8Dtype, Int16Dtype,
Int32Dtype, Int64Dtype,
UInt8Dtype, UInt16Dtype,
UInt32Dtype, UInt64Dtype)
if isinstance(dtype_iter, (Int8Dtype, Int16Dtype,
Int32Dtype, Int64Dtype,
UInt8Dtype, UInt16Dtype,
UInt32Dtype, UInt64Dtype)):
has_pd_integer_array = True

if all(isinstance(dtype, np.dtype) for dtype in dtypes_orig):
dtype_orig = np.result_type(*dtypes_orig)
Expand All @@ -528,6 +549,10 @@ def check_array(array, accept_sparse=False, *, accept_large_sparse=True,
# list of accepted types.
dtype = dtype[0]

if has_pd_integer_array:
# If there are any pandas integer extension arrays,
array = array.astype(dtype)

if force_all_finite not in (True, False, 'allow-nan'):
raise ValueError('force_all_finite should be a bool or "allow-nan"'
'. Got {!r} instead'.format(force_all_finite))
Expand Down Expand Up @@ -712,18 +737,21 @@ def check_X_y(X, y, accept_sparse=False, *, accept_large_sparse=True,
be triggered by a conversion.

force_all_finite : boolean or 'allow-nan', (default=True)
Whether to raise an error on np.inf and np.nan in X. This parameter
does not influence whether y can have np.inf or np.nan values.
Whether to raise an error on np.inf, np.nan, pd.NA in X. This parameter
does not influence whether y can have np.inf, np.nan, pd.NA values.
The possibilities are:

- True: Force all values of X to be finite.
- False: accept both np.inf and np.nan in X.
- 'allow-nan': accept only np.nan values in X. Values cannot be
infinite.
- False: accepts np.inf, np.nan, pd.NA in X.
- 'allow-nan': accepts only np.nan or pd.NA values in X. Values cannot
be infinite.

.. versionadded:: 0.20
``force_all_finite`` accepts the string ``'allow-nan'``.

.. versionchanged:: 0.23
Accepts `pd.NA` and converts it into `np.nan`

ensure_2d : boolean (default=True)
Whether to raise a value error if X is not 2D.

Expand Down