Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG respect dtypes in pandas dataframes if homogeneous #15094

Merged

Conversation

amueller
Copy link
Member

Fixes #15093.
Does that deserve/need a whatsnew?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, might as well have a change log entry

@jnothman jnothman closed this Sep 25, 2019
@jnothman jnothman reopened this Sep 25, 2019
@jnothman
Copy link
Member

Misfire

@amueller
Copy link
Member Author

amueller commented Sep 26, 2019

EDIT: never mind.

@amueller
Copy link
Member Author

amueller commented Sep 26, 2019

Correction: passing float16 to check_array(dtype=FLOAT_DTYPES) works as expected (result is float16), but passing int16 results in float64, which is somewhat unexpected.
Looks like np.common_type(any int) is float64 so may this is actually "the correct" behavior?

@amueller
Copy link
Member Author

Ok I think I'm confused whether np.result_type or np.common_type is the right thing to do. I'm not tending towards np.result_type.

@amueller
Copy link
Member Author

ok now this resolves anything pandas-related. It leaves the numpy-casting as it was, so we're still casting int32 to float64, not float32.

@jnothman
Copy link
Member

CI failures, but I agree with your changes

@amueller
Copy link
Member Author

If dtypes contains pandas dtypes then result_type doesn't work, so I think ideally we'd get the corresponding numpy dtype for the pandas dtype here.

@amueller
Copy link
Member Author

It's actually basically impossible to correctly sniff out the types right now:
pandas-dev/pandas#22791 but I think the solution I just pushed should be ok for now (better than master lol).

@amueller
Copy link
Member Author

green again yay

# check that we handle pandas dtypes in a semi-reasonable way
# this is actually tricky because we can't really know that this
# should be integer ahead of converting it.
assert (check_array(pd.DataFrame([pd.Categorical([1, 2, 3])])).dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, should we test for dtype=FLOAT_DTYPES as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an check what? That it's float64?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right above this check we check that a int16 dataframe is casted to float64. It seems reasonable to test that this categorical goes to float64 as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@amueller
Copy link
Member Author

amueller commented Oct 4, 2019

@jnothman still good?

@jnothman jnothman merged commit b906078 into scikit-learn:master Oct 8, 2019
@amueller
Copy link
Member Author

amueller commented Oct 8, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MaxAbsScaler Upcasts Pandas to float64
3 participants