Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gen_features failing with SimpleImputer on bool column #176

Closed
david-waterworth opened this issue Oct 7, 2018 · 9 comments
Closed

gen_features failing with SimpleImputer on bool column #176

david-waterworth opened this issue Oct 7, 2018 · 9 comments

Comments

@david-waterworth
Copy link

Hi

I'm working through the the https://github.com/scikit-learn-contrib/sklearn-pandas/blob/master/README.rst.

I'm getting a deprecation warning on this code

feature_def = gen_features(
    columns=[['col1'], ['col2'], ['col3']],
    classes=[{'class': sklearn.preprocessing.Imputer, 'strategy': 'most_frequent'}]
)
mapper6 = DataFrameMapper(feature_def)
data6 = pd.DataFrame({
    'col1': [None, 1, 1, 2, 3],
    'col2': [True, False, None, None, True],
    'col3': [0, 0, 0, None, None]
})
mapper6.fit_transform(data6)

So I replaced it with

feature_def = gen_features(
    columns=[['col1'], ['col2'], ['col3']],
    classes=[{'class': sklearn.impute.SimpleImputer, 'strategy': 'most_frequent'}]
)
mapper6 = DataFrameMapper(feature_def)

But this fails with

TypeError: ['col2']: unorderable types: NoneType() < bool()

So I replaced gen_features with an explicit DataFrameMapper and it works

mapper6 = DataFrameMapper([
    (['col1'], sklearn.impute.SimpleImputer(), {'strategy': 'most_frequent'}),
    (['col2'], sklearn.impute.SimpleImputer(), {'strategy': 'most_frequent'}),
    (['col3'], sklearn.impute.SimpleImputer(), {'strategy': 'most_frequent'}),
])
data6 = pd.DataFrame({
    'col1': [None, 1, 1, 2, 3],
    'col2': [True, False, None, None, True],
    'col3': [0, 0, 0, None, None]
})
mapper6.fit_transform(data6)

As far as I can see the explicit DataFrameMapper should be the same as the one built by ataFrameMapper(feature_def). Have I done something wrong or is the a bug in gen_features?

@devforfu
Copy link
Collaborator

devforfu commented Oct 7, 2018

@david-waterworth It could be a bug. The package is not tested on sklearn==0.20.0.

@david-waterworth
Copy link
Author

david-waterworth commented Oct 7, 2018

Ah right, no worries. I guess I'll have to either put up with the deprecation warnings or downgrade sklearn

Actually my last example doesn't work either. It doesn't throw an error but it looks like it's converting the bool to 0/1's and in all cases it's ignoring most_frequent and seems to be using the default strategy (mean) so it doesn't match the original

Edit: I think I've misunderstood how to pass non-default constructor arguments to the DataFrameMapper classes. I think it should be:

mapper6 = DataFrameMapper([
    (['col1'], sklearn.impute.SimpleImputer(strategy='most_frequent')),
    (['col2'], sklearn.impute.SimpleImputer(strategy='most_frequent')),
    (['col3'], sklearn.impute.SimpleImputer(strategy='most_frequent')),
])
data6 = pd.DataFrame({
    'col1': [None, 1, 1, 2, 3],
    'col2': [True, False, None, None, True],
    'col3': [0, 0, 0, None, None]
})
mapper6.fit_transform(data6)

Which fails in the same way as my original feature_def

So it's probably not a bug in feature_def, more likely a change in behaviour in sklearn==0.20.0

@monikamulani
Copy link

monikamulani commented Oct 15, 2018

Following code doesn't work for me:

x = dataset1.iloc[:,0:3]

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="most_frequent")
imputer = imputer.fit(x.values[:,0:2])
x.values[:,0:2] = imputer.transform(x.values[:,0:2])   

However, if I remove the last column from "x" the imputer object works just fine.

x = dataset1.iloc[:,0:3]
x = x.drop(x.columns[2], axis=1)

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="most_frequent")
imputer = imputer.fit(x.values[:,0:2])
x.values[:,0:2] = imputer.transform(x.values[:,0:2])   

@dukebody
Copy link
Collaborator

@monikamulani does your issue have something to do with the original post?

@dukebody
Copy link
Collaborator

@david-waterworth , can you provide the deprecation warning and the full traceback of the error you receive? This will make debugging easier.

@david-waterworth
Copy link
Author

@dukebody sure, I'm travelling for the next few weeks but once I'm back I'll update.

@david-waterworth
Copy link
Author

david-waterworth commented Oct 20, 2018

@dukebody deprecation message is as follows

C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\utils\deprecation.py:58: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
  warnings.warn(msg, category=DeprecationWarning)

@david-waterworth
Copy link
Author

@dukebody and the full traceback is

feature_def = gen_features(
    columns=[['col1'], ['col2'], ['col3']],
    classes=[{'class': sklearn.impute.SimpleImputer, 'strategy': 'most_frequent'}]
)
mapper6 = DataFrameMapper(feature_def)
data6 = pd.DataFrame({
    'col1': [None, 1, 1, 2, 3],
    'col2': [True, False, None, None, True],
    'col3': [0, 0, 0, None, None]
})
mapper6.fit_transform(data6)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\base.py", line 462, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn_pandas\dataframe_mapper.py", line 214, in fit
    _call_fit(transformers.fit, Xt, y)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn_pandas\pipeline.py", line 27, in _call_fit
    return fit_method(X, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn_pandas\pipeline.py", line 77, in fit
    _call_fit(self.steps[-1][-1].fit, Xt, y, **fit_params)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn_pandas\pipeline.py", line 27, in _call_fit
    return fit_method(X, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\impute.py", line 259, in fit
    fill_value)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\impute.py", line 343, in _dense_fit
    most_frequent[i] = _most_frequent(row, np.nan, 0)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\impute.py", line 70, in _most_frequent
    mode = stats.mode(array)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scipy\stats\stats.py", line 439, in mode
    scores = np.unique(np.ravel(a))       # get ALL unique values
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\numpy\lib\arraysetops.py", line 233, in unique
    ret = _unique1d(ar, return_index, return_inverse, return_counts)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\lib\site-packages\numpy\lib\arraysetops.py", line 281, in _unique1d
    ar.sort()
TypeError: ['col2']: '<' not supported between instances of 'NoneType' and 'bool'
>>>

@raoulbia
Copy link

raoulbia commented Dec 19, 2018

from sklearn/impute.py line 201:

raise ValueError("SimpleImputer does not support data with dtype "
                          "{0}. Please provide either a numeric array (with"
                          " a floating point or integer dtype) or "
                          "categorical data represented either as an array "
                          "with integer dtype or an array of string values "
                          "with an object dtype.".format(X.dtype))

the dtypes of the toy data in the README.rst example are:

col1    float64
col2     object
col3    float64
dtype: object

therefore, if you change the dtype of col2 to category it will work :)
data.col2 = data.col2.astype('category')

@ragrawal ragrawal closed this as completed May 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants