Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support pd.Na in preprocessing module #16498

Closed
Lisska opened this issue Feb 20, 2020 · 3 comments · Fixed by #16508
Closed

Add support pd.Na in preprocessing module #16498

Lisska opened this issue Feb 20, 2020 · 3 comments · Fixed by #16508
Milestone

Comments

@Lisska
Copy link

Lisska commented Feb 20, 2020

Describe the bug

Starting from pandas 1.0, an experimental pd.NA value (singleton) is available to represent scalar missing values. At this moment, it is used in the nullable integer, boolean and dedicated string data types as the missing value indicator.

I get the error TypeError: float() argument must be a string or a number, not 'NAType' when transfer integer data containing NaN in the form of a pandas dataframe to preprocessing module, in particular QuantileTransformer and StandardScaler after updating pandas to the current version.

Steps/Code to Reproduce

Example:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({'a': [1,2,3, np.nan, np.nan], 
                   'b': [np.nan, np.nan, 8, 4, 6]}, 
                  dtype = pd.Int64Dtype())

scaler = StandardScaler() 
scaler.fit_transform(df)

Expected Results

array([[-1.22474487,         nan],
       [ 0.        ,         nan],
       [ 1.22474487,  1.22474487],
       [        nan, -1.22474487],
       [        nan,  0.        ]])

Actual Results

TypeError                                 Traceback (most recent call last)
<ipython-input-42-2104609ef9c0> in <module>
      7 print(df)
      8 scaler = StandardScaler()
----> 9 scaler.fit_transform(df)

/anaconda3/lib/python3.6/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    569         if y is None:
    570             # fit method of arity 1 (unsupervised transformation)
--> 571             return self.fit(X, **fit_params).transform(X)
    572         else:
    573             # fit method of arity 2 (supervised transformation)

/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
    667         # Reset internal state before fitting
    668         self._reset()
--> 669         return self.partial_fit(X, y)
    670 
    671     def partial_fit(self, X, y=None):

/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y)
    698         X = check_array(X, accept_sparse=('csr', 'csc'),
    699                         estimator=self, dtype=FLOAT_DTYPES,
--> 700                         force_all_finite='allow-nan')
    701 
    702         # Even in the case of `with_mean=False`, we update the mean anyway

/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    529                     array = array.astype(dtype, casting="unsafe", copy=False)
    530                 else:
--> 531                     array = np.asarray(array, order=order, dtype=dtype)
    532             except ComplexWarning:
    533                 raise ValueError("Complex data not supported\n"

/anaconda3/lib/python3.6/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

TypeError: float() argument must be a string or a number, not 'NAType'

Versions

System:
    python: 3.6.10 |Anaconda, Inc.| (default, Jan  7 2020, 15:01:53)  [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
executable: /anaconda3/bin/python
   machine: Darwin-17.7.0-x86_64-i386-64bit

Python dependencies:
       pip: 20.0.2
setuptools: 45.2.0.post20200210
   sklearn: 0.22.1
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: None
    pandas: 1.0.1
matplotlib: 3.1.3
    joblib: 0.14.1

Built with OpenMP: True
@Lisska Lisska added the Bug label Feb 20, 2020
@rth
Copy link
Member

rth commented Feb 20, 2020

Thanks @Lisska ! It's indeed a know issue that it would be good to address. I'm not sure how difficult it would be though in general.

@rth rth added this to the 0.23 milestone Feb 20, 2020
@glemaitre glemaitre added Enhancement and removed Bug labels Feb 20, 2020
@glemaitre
Copy link
Member

I am not sure if we can easily handle this case without importing pandas?

@jnothman
Copy link
Member

jnothman commented Feb 20, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants