Skip to content

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Related to #22231

What does this implement/fix? Explain your changes.

This PR allows us to better handle DataFrames with boolean dtypes, when dtype=None. It also refactors the code slightly to make it easier to reason about. With the following snippet:

import pandas as pd
from sklearn.utils.validation import check_array

df = pd.DataFrame(
    {"bool": [True, False, True], "int": [1, 2, 3]},
    columns=["bool", "int"],
)

array = check_array(df, dtype=None)
  • On main, the dtype of array is object
  • With this PR, the dtype of array is int64.

Another example is:

import pandas as pd
from sklearn.utils.validation import check_array

df = pd.DataFrame(
    {"bool": [True, False, True], "int": [1, 2, 3]},
    columns=["bool", "int"],
)

array = check_array(df, dtype="numeric")
  • On main, the dtype of array is float.
  • With this PR, the dtype of array is int64.

Any other comments?

I think the semantics of dtype=None is weird when the dataframe has mixed dtypes. There is really no way to "preserve the original dtype". With this PR, the semantics become "use np.result_type to make a good guess".

CC @glemaitre

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@thomasjpfan
Copy link
Member Author

I am merging this PR, because it already has two +1s.

@thomasjpfan thomasjpfan merged commit 8d79358 into scikit-learn:main Apr 24, 2022
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022
…it-learn#22237)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants