Add support pd.Na in preprocessing module #16498

Lisska · 2020-02-20T15:43:19Z

Describe the bug

Starting from pandas 1.0, an experimental pd.NA value (singleton) is available to represent scalar missing values. At this moment, it is used in the nullable integer, boolean and dedicated string data types as the missing value indicator.

I get the error TypeError: float() argument must be a string or a number, not 'NAType' when transfer integer data containing NaN in the form of a pandas dataframe to preprocessing module, in particular QuantileTransformer and StandardScaler after updating pandas to the current version.

Steps/Code to Reproduce

Example:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({'a': [1,2,3, np.nan, np.nan], 
                   'b': [np.nan, np.nan, 8, 4, 6]}, 
                  dtype = pd.Int64Dtype())

scaler = StandardScaler() 
scaler.fit_transform(df)

Expected Results

array([[-1.22474487,         nan],
       [ 0.        ,         nan],
       [ 1.22474487,  1.22474487],
       [        nan, -1.22474487],
       [        nan,  0.        ]])

Actual Results

TypeError                                 Traceback (most recent call last)
<ipython-input-42-2104609ef9c0> in <module>
      7 print(df)
      8 scaler = StandardScaler()
----> 9 scaler.fit_transform(df)

/anaconda3/lib/python3.6/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    569         if y is None:
    570             # fit method of arity 1 (unsupervised transformation)
--> 571             return self.fit(X, **fit_params).transform(X)
    572         else:
    573             # fit method of arity 2 (supervised transformation)

/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
    667         # Reset internal state before fitting
    668         self._reset()
--> 669         return self.partial_fit(X, y)
    670 
    671     def partial_fit(self, X, y=None):

/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y)
    698         X = check_array(X, accept_sparse=('csr', 'csc'),
    699                         estimator=self, dtype=FLOAT_DTYPES,
--> 700                         force_all_finite='allow-nan')
    701 
    702         # Even in the case of `with_mean=False`, we update the mean anyway

/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    529                     array = array.astype(dtype, casting="unsafe", copy=False)
    530                 else:
--> 531                     array = np.asarray(array, order=order, dtype=dtype)
    532             except ComplexWarning:
    533                 raise ValueError("Complex data not supported\n"

/anaconda3/lib/python3.6/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

TypeError: float() argument must be a string or a number, not 'NAType'

Versions

System:
    python: 3.6.10 |Anaconda, Inc.| (default, Jan  7 2020, 15:01:53)  [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
executable: /anaconda3/bin/python
   machine: Darwin-17.7.0-x86_64-i386-64bit

Python dependencies:
       pip: 20.0.2
setuptools: 45.2.0.post20200210
   sklearn: 0.22.1
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: None
    pandas: 1.0.1
matplotlib: 3.1.3
    joblib: 0.14.1

Built with OpenMP: True

The text was updated successfully, but these errors were encountered:

rth · 2020-02-20T16:46:47Z

Thanks @Lisska ! It's indeed a know issue that it would be good to address. I'm not sure how difficult it would be though in general.

glemaitre · 2020-02-20T22:15:15Z

I am not sure if we can easily handle this case without importing pandas?

jnothman · 2020-02-20T23:19:41Z

Importing pandas is fine if it is already in sys.modules, which it will be if pd.Na is in use

Lisska added the Bug label Feb 20, 2020

rth added this to the 0.23 milestone Feb 20, 2020

rth added the help wanted label Feb 20, 2020

glemaitre added Enhancement and removed Bug labels Feb 20, 2020

thomasjpfan mentioned this issue Feb 21, 2020

ENH Adds pandas IntegerArray support to check_array #16508

Merged

glemaitre mentioned this issue Feb 24, 2020

SimpleImputer breaks using Pandas 1.0 with Int64 column #16531

Closed

cmarmo removed the help wanted label Mar 29, 2020

ogrisel closed this as completed in #16508 Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support pd.Na in preprocessing module #16498

Add support pd.Na in preprocessing module #16498

Lisska commented Feb 20, 2020

rth commented Feb 20, 2020 •

edited

Loading

glemaitre commented Feb 20, 2020

jnothman commented Feb 20, 2020 via email

Add support pd.Na in preprocessing module #16498

Add support pd.Na in preprocessing module #16498

Comments

Lisska commented Feb 20, 2020

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

rth commented Feb 20, 2020 • edited Loading

glemaitre commented Feb 20, 2020

jnothman commented Feb 20, 2020 via email

rth commented Feb 20, 2020 •

edited

Loading