ENH check_array returns numeric array w/ mixed typed dataframes #22237

thomasjpfan · 2022-01-17T19:40:24Z

Reference Issues/PRs

Related to #22231

What does this implement/fix? Explain your changes.

This PR allows us to better handle DataFrames with boolean dtypes, when dtype=None. It also refactors the code slightly to make it easier to reason about. With the following snippet:

import pandas as pd
from sklearn.utils.validation import check_array

df = pd.DataFrame(
    {"bool": [True, False, True], "int": [1, 2, 3]},
    columns=["bool", "int"],
)

array = check_array(df, dtype=None)

On main, the dtype of array is object
With this PR, the dtype of array is int64.

Another example is:

import pandas as pd
from sklearn.utils.validation import check_array

df = pd.DataFrame(
    {"bool": [True, False, True], "int": [1, 2, 3]},
    columns=["bool", "int"],
)

array = check_array(df, dtype="numeric")

On main, the dtype of array is float.
With this PR, the dtype of array is int64.

Any other comments?

I think the semantics of dtype=None is weird when the dataframe has mixed dtypes. There is really no way to "preserve the original dtype". With this PR, the semantics become "use np.result_type to make a good guess".

CC @glemaitre

…frames

glemaitre

LGTM

sklearn/utils/tests/test_validation.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

thomasjpfan · 2022-04-24T13:02:50Z

I am merging this PR, because it already has two +1s.

…it-learn#22237) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

ENH check_array with dtype=None returns numeric arrays with bool data…

17168af

…frames

github-actions bot added the module:utils label Jan 17, 2022

thomasjpfan added 3 commits January 17, 2022 14:41

DOC Adds whats new nubmer

25cf0c0

FIX Fixes 32bit testing issues

34c24ce

FIX Better support for pandas < 1.0

f1a4bcc

glemaitre approved these changes Jan 21, 2022

View reviewed changes

sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved

Update sklearn/utils/tests/test_validation.py

82d8b90

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

amueller approved these changes Mar 28, 2022

View reviewed changes

thomasjpfan added 3 commits April 23, 2022 23:40

Merge remote-tracking branch 'upstream/main' into boolean_check_array

7bca0b0

DOC Adds new line

abfa1f6

CLN Reduce diff

69b128b

thomasjpfan merged commit 8d79358 into scikit-learn:main Apr 24, 2022

jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022

ENH check_array returns numeric array w/ mixed typed dataframes (scik…

e2247ed

…it-learn#22237) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH check_array returns numeric array w/ mixed typed dataframes #22237

ENH check_array returns numeric array w/ mixed typed dataframes #22237

thomasjpfan commented Jan 17, 2022

Uh oh!

glemaitre left a comment

Uh oh!

Uh oh!

thomasjpfan commented Apr 24, 2022

Uh oh!

Uh oh!

Uh oh!

ENH check_array returns numeric array w/ mixed typed dataframes #22237

ENH check_array returns numeric array w/ mixed typed dataframes #22237

Conversation

thomasjpfan commented Jan 17, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan commented Apr 24, 2022

Uh oh!

Uh oh!