Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG respect dtypes in pandas dataframes if homogeneous #15094

Merged

Conversation

@amueller
Copy link
Member

amueller commented Sep 25, 2019

Fixes #15093.
Does that deserve/need a whatsnew?

Copy link
Member

jnothman left a comment

Yes, might as well have a change log entry

@jnothman jnothman closed this Sep 25, 2019
@jnothman jnothman reopened this Sep 25, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 25, 2019

Misfire

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Sep 26, 2019

EDIT: never mind.

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Sep 26, 2019

Correction: passing float16 to check_array(dtype=FLOAT_DTYPES) works as expected (result is float16), but passing int16 results in float64, which is somewhat unexpected.
Looks like np.common_type(any int) is float64 so may this is actually "the correct" behavior?

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Sep 26, 2019

Ok I think I'm confused whether np.result_type or np.common_type is the right thing to do. I'm not tending towards np.result_type.

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Sep 26, 2019

ok now this resolves anything pandas-related. It leaves the numpy-casting as it was, so we're still casting int32 to float64, not float32.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 26, 2019

CI failures, but I agree with your changes

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Sep 27, 2019

If dtypes contains pandas dtypes then result_type doesn't work, so I think ideally we'd get the corresponding numpy dtype for the pandas dtype here.

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Sep 27, 2019

It's actually basically impossible to correctly sniff out the types right now:
pandas-dev/pandas#22791 but I think the solution I just pushed should be ok for now (better than master lol).

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Sep 27, 2019

green again yay

# check that we handle pandas dtypes in a semi-reasonable way
# this is actually tricky because we can't really know that this
# should be integer ahead of converting it.
assert (check_array(pd.DataFrame([pd.Categorical([1, 2, 3])])).dtype

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Sep 28, 2019

Member

For completeness, should we test for dtype=FLOAT_DTYPES as well?

This comment has been minimized.

Copy link
@amueller

amueller Oct 4, 2019

Author Member

an check what? That it's float64?

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan Oct 4, 2019

Member

Right above this check we check that a int16 dataframe is casted to float64. It seems reasonable to test that this categorical goes to float64 as well.

This comment has been minimized.

Copy link
@amueller

amueller Oct 4, 2019

Author Member

done.

@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Oct 4, 2019

@jnothman still good?

amueller added 3 commits Oct 4, 2019
…r/scikit-learn into respect_pandas_homogeneous_dtype
@jnothman jnothman merged commit b906078 into scikit-learn:master Oct 8, 2019
19 checks passed
19 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc artifact Link to 0/doc/_changed.html
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.84%)
Details
codecov/project 96.84% (+<.01%) compared to 871b251
Details
scikit-learn.scikit-learn Build #20191004.23 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl) Linux pylatest_conda_mkl succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_pip_openblas_pandas) Linux pylatest_pip_openblas_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@amueller

This comment has been minimized.

Copy link
Member Author

amueller commented Oct 8, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.