Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MaxAbsScaler Upcasts Pandas to float64 #15093

Closed
danhitchcock opened this issue Sep 25, 2019 · 6 comments · Fixed by #15094
Closed

MaxAbsScaler Upcasts Pandas to float64 #15093

danhitchcock opened this issue Sep 25, 2019 · 6 comments · Fixed by #15094

Comments

@danhitchcock
Copy link

Description

I am working with the Column transformer, and for memory issues, am trying to produce a float32 sparse matrix. Unfortunately, regardless of pandas input type, the output is always float64.

I've identified one of the Pipeline scalers, MaxAbsScaler, as being the culprit. Other preprocessing functions, such as OneHotEncoder, have an optional dtype argument. This argument does not exist in MaxAbsScaler (among others). It appears that the upcasting happens when check_array is executed.

Is it possible to specify a dtype? Or is there a commonly accepted practice to do so from the Column Transformer?

Thank you!

Steps/Code to Reproduce

Example:

import pandas as pd
from sklearn.preprocessing import MaxAbsScaler

df = pd.DataFrame({
    'DOW': [0, 1, 2, 3, 4, 5, 6],
    'Month': [3, 2, 4, 3, 2, 6, 7],
    'Value': [3.4, 4., 8, 5, 3, 6, 4]
})
df = df.astype('float32')
print(df.dtypes)
a = MaxAbsScaler()
scaled = a.fit_transform(df) # providing df.values will produce correct response
print('Transformed Type: ', scaled.dtype)

Expected Results

DOW      float32
Month    float32
Value    float32
dtype: object
Transformed Type: float32

Actual Results

DOW      float32
Month    float32
Value    float32
dtype: object
Transformed Type: float64

Versions

Darwin-18.7.0-x86_64-i386-64bit
Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:07:37)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.17.1
SciPy 1.3.1
Scikit-Learn 0.20.3
Pandas 0.25.1

@jnothman
Copy link
Member

jnothman commented Sep 25, 2019 via email

@danhitchcock
Copy link
Author

Thanks for the quick response!
Same issue with 0.21.3

Darwin-18.7.0-x86_64-i386-64bit
Python 3.6.7 | packaged by conda-forge | (default, Jul  2 2019, 02:07:37) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.17.1
SciPy 1.3.1
Scikit-Learn 0.21.3
Pandas 0.25.1

Upon a closer look, this might be a bug in check_array, though I don't know enough about its desired functionality to comment. MaxAbsScaler calls check_array with dtype=FLOAT_DTYPES which has the value['float64', 'float32', 'float16']. In check_array, pandas dtypes are properly pulled but not used. Instead, check_array pulls the dtype from first list item in the supplied dtype=FLOAT_DTYPES, which results in 'float64'. I placed inline comments next to what I think is going on:

dtypes_orig = None
if hasattr(array, "dtypes") and hasattr(array.dtypes, '__array__'):
    dtypes_orig = np.array(array.dtypes) # correctly pulls the float32 dtypes from pandas

if dtype_numeric:
    if dtype_orig is not None and dtype_orig.kind == "O":
        # if input is object, convert to float.
        dtype = np.float64
    else:
        dtype = None

if isinstance(dtype, (list, tuple)):
    if dtype_orig is not None and dtype_orig in dtype:
        # no dtype conversion required
        dtype = None
    else:
        # dtype conversion required. Let's select the first element of the
        # list of accepted types.
        dtype = dtype[0] # Should this be dtype = dtypes_orig[0]? dtype[0] is always float64

Thanks again!

@jnothman
Copy link
Member

jnothman commented Sep 25, 2019 via email

@amueller
Copy link
Member

Can confirm it's a bug in the handling of pandas introduced here: #10949
If dtypes has more then one entry we need to figure out the best cast, right?
Here we're in the simple case where len(unique(dtypes)))==1 which is easy to fix.

@amueller
Copy link
Member

Fixed in #15094. (I should be writing grants, in case that's not obvious)

@danhitchcock
Copy link
Author

Y'all are awesome, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants