Handle missing values in OneHotEncoder #11996

jnothman · 2018-09-04T05:40:03Z

A minimum implementation might translate a NaN in input to a row of NaNs in output. I believe this would be the most consistent default behaviour with respect to other preprocessing tools, and with reasonable backwards-compatibility, but other core devs might disagree (see #10465 (comment)).

NaN should also be excluded from the categories identified in fit.

A handle_missing parameter might allow NaN in input to be:

replaced with a row of NaNs as above
replaced with a row of zeros
represented with a separate one-hot column

in the output.

A missing_values parameter might allow the user to configure what object is a placeholder for missingness (e.g. NaN, None, etc.).

See #10465 for background

The text was updated successfully, but these errors were encountered:

Olamyy · 2018-09-04T15:43:52Z

Hi @jnothman , can I jump on this or would I need to wait for other core devs to share their opinions first?

jnothman · 2018-09-04T21:37:04Z

I think an initial implementation would be welcome.

Olamyy · 2018-09-05T01:45:02Z

I'm trying to confirm I understand the task very well.
Help me run through this pseudocode and point out anywhere I might be wrong to me.

Add a missing_values parameter to the init method of the OneHotEncoder class. This parameter will allow users specify what should be taken as a missing value. Available options should be either:
```
 - NaN
 _ None
```

Add a handle_missing parameter to the init method of the OneHotEncoder class. This parameter will allow users specify what happens to missing values as specified by the missing_values parameters. Available options should be either:

 - Replace with rows with NaN : Replace each row containing missing values with NaN(I'm still a little confused here)
 _ Replace with rows of zeros : Replace each row containing missing values with 0s.
 - Represent with another one hot column : Pick a one hot encoded column in the data and replace the row with it?

jnothman · 2018-09-05T02:32:33Z

Perhaps:

X = [["A"],
     ["B"],
     [NaN],
     ["B"]]

handle_missing='all-missing':

Xt = [[  1,   0],
      [  0,   1],
      [NaN, NaN],
      [  0,   1]]

handle_missing='all-zero':

Xt = [[  1,   0],
      [  0,   1],
      [  0,   0],
      [  0,   1]]

handle_missing='category':

Xt = [[  1,   0,  0],
      [  0,   1,  0],
      [  0,   0,  1],
      [  0,   1,  0]]

A good idea might be to start by writing things other than the implementation:

docstring
tests
doc/modules/preprocessing.rst where you could outline the pros and cons of each of these options

jnothman · 2018-09-05T02:32:55Z

You don't need a complete implementation to open a PR, either

datajanko · 2018-12-05T21:03:20Z

Is there any update on this issue - asking for min free in One Hot Encoder? Am I assuming correctly, that we also need to change the _fit-function of _BaseEncoder? This function call _check_X and is called in all the fit-functions. _check_X calls check_array but here nan values are not allowed.

Can this issue really be considered as easy? For me, this looks rather complex right now

jnothman · 2018-12-09T10:15:09Z

#12025 is an open pull request, but it seems to be stalled and lacks tests (@Olamyy?). I wouldn't say the change is trivial, but it's not a big change either, if we only worry about NaN and not other representations of missing values.

baluyotraf · 2019-01-22T04:23:02Z

I'll give this a try. I'll make a PR when I have enough progress.

jnothman · 2019-01-22T06:52:20Z

Thanks @baluyotraf

amueller · 2019-04-24T19:26:53Z

I'm not sure if having the row of NaNs is worth supporting. It seems to make this much trickier as well.
I think the separate value makes a ton of sense, and if people don't want that, they can use the imputer first.
This should simplify the treatment in OHE.

Given my work on dabl, right now I'm more concerned with making things possible at all than making them very easy with sklearn.

What I found most annoying within this complex of things (and it's only tangentially related but not sure which issue would be the correct one, #2888 maybe?) is that I can't actually use the "constant" strategy on the categorical columns within a ColumnTransformer.
My naive approach would be to separate continuous and categorical, fill the nans in categorical with the string "missing" and then do one-hot-encoding.
However, that only works if the categorical variables were strings; otherwise SimpleImputer will raise an error.
Alternatively we could use "most_frequent" in SimpleImputer and add a MissingValueIndicator, but that seems less intuitive to me: it would mean the most frequent feature category would be "1" and the "missing" feature would be "1" as well.

ogrisel · 2019-07-18T10:20:11Z

I am also +1 for not supporting the option that would generate a row of nans, it sounds like YAGNI to me.

Let's consider the following data case with a CSV file with 2 categorical columns, where one uses string labels and the other uses integer labels:

>>> import pandas as pd                                                                                                                                                         
>>> from io import StringIO                                                                                                                                                     
>>> csv_content = """\ 
... f1,f2 
... "a",0 
... ,1 
... "b", 
... , 
... """                                                                                                                                                                         
>>> raw_df = pd.read_csv(StringIO(csv_content))                                                                                                                                 
>>> raw_df                                                                                                                                                                      
    f1   f2
0    a  0.0
1  NaN  1.0
2    b  NaN
3  NaN  NaN
>>> raw_df.dtypes                                                                                                                                                               
f1     object
f2    float64
dtype: object

So by default pandas will use float64 dtype for the int-valued column so as to be able to use nan as the missing value marker.

It's actually possible to use SimpleImputer with the constant strategy on this kind of heterogeneously typed data as it will convert it to a numpy array with object dtype:

>>> from sklearn.impute import SimpleImputer                                                                                                                                    
>>> imputed = SimpleImputer(strategy="constant", fill_value="missing").fit_transform(raw_df)
>>> imputed
array([['a', 0.0],
       ['missing', 1.0],
       ['b', 'missing'],
       ['missing', 'missing']], dtype=object)

However putting string values in an otherwise float valued column is weird and causes the OneHotEncoder to crash on that column:

>>> OneHotEncoder().fit_transform(imputed)                                                                                                                                      
Traceback (most recent call last):
  File "<ipython-input-48-04b9d558c891>", line 1, in <module>
    OneHotEncoder().fit_transform(imputed)
  File "/home/ogrisel/code/scikit-learn/sklearn/preprocessing/_encoders.py", line 358, in fit_transform
    return super().fit_transform(X, y)
  File "/home/ogrisel/code/scikit-learn/sklearn/base.py", line 556, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/home/ogrisel/code/scikit-learn/sklearn/preprocessing/_encoders.py", line 338, in fit
    self._fit(X, handle_unknown=self.handle_unknown)
  File "/home/ogrisel/code/scikit-learn/sklearn/preprocessing/_encoders.py", line 86, in _fit
    cats = _encode(Xi)
  File "/home/ogrisel/code/scikit-learn/sklearn/preprocessing/label.py", line 114, in _encode
    raise TypeError("argument must be a string or number")
TypeError: argument must be a string or number

Using the debugger to see the underlying exception reveals:

TypeError: '<' not supported between instances of 'str' and 'float'

One could use the column transformer to split the string valued categories from the number valued categorical columns and use suitable fill_value for constant imputing on each side.

However from a usability standpoint it would make sense to have OneHotEncoder be able to directly to do constant imputation with handle_missing="indicator".

We could also implement the zero strategy with handle_missing="zero". We need to decide about the default. missing_

We also need to make sure that nan passed only at transform time (without being seen in this column at fit time) should be accepted (with the zero encoding) so that cross-validation is possible on data with just a few missing values that might end up all in the validation split by chance.

ogrisel · 2020-03-19T07:51:44Z

Note that some datasets such as the Ames housing dataset from fetch_openml use None instead of np.nan for missing values in the object columns of a pandas dataframe that otherwise have str labels: #16702 (comment)

This also leads to a confusing error message TypeError: '<' not supported between instances of 'str' and 'NoneType': #16702.

nilichen · 2020-03-23T23:27:48Z

take

netomenoci · 2020-04-25T13:04:54Z

Any updates on this?

I can't fit a dataset that contains 'nan' values.

nilichen · 2020-04-27T00:15:45Z

@netomenoci I recently worked on this and here is my comment on this issue: #16749 (comment)

thomasjpfan · 2020-05-02T01:01:05Z

I am going to work on this with the goal of getting it into 0.24.

zachmayer · 2020-05-04T14:33:14Z

FYI, all of the encoders in sklearn-contrib/category_encoders already have the option to handle_unknown — it might be pretty straightforward to just port their missing handling logic from category_encoders.one_hot.OneHotEncoder

https://github.com/scikit-learn-contrib/category_encoders/tree/master/category_encoders

See scikit-learn/scikit-learn#11996

jnothman mentioned this issue Sep 4, 2018

Handling of missing values in the CategoricalEncoder #10465

Closed

jnothman added Moderate Anything that requires some knowledge of conventions and best practices help wanted Easy Well-defined and straightforward way to resolve labels Sep 4, 2018

Olamyy mentioned this issue Sep 5, 2018

Fixes #11996 : Handle missing values in OneHotEncoder #12017

Closed

jorisvandenbossche mentioned this issue Sep 5, 2018

BUG: OneHotEncoder(string values) handles NaN as category on transform step #12018

Closed

Olamyy mentioned this issue Sep 6, 2018

ENH: handle missing values in OneHotEncoder #12025

Closed

jnothman removed the Easy Well-defined and straightforward way to resolve label Dec 9, 2018

Framartin mentioned this issue Dec 23, 2018

KBinsDiscretizer: allow nans #9341

Open

baluyotraf mentioned this issue Jan 22, 2019

[WIP] NaN Support for OneHotEncoder #13028

Closed

datajanko mentioned this issue Apr 18, 2019

[WIP] "other"/min_freq in OneHot and OrdinalEncoder #12264

Closed

7 tasks

TwsThomas mentioned this issue Sep 18, 2019

[WIP] Handle missing values in label._encode() #15009

Closed

rth mentioned this issue Dec 4, 2019

META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical #15796

Open

ogrisel mentioned this issue Mar 21, 2020

Suggest to use SimpleImputer for str categorical data with None values #16739

Closed

nilichen mentioned this issue Mar 23, 2020

[WIP] Handle NaNs in OneHotEncoder #16749

Closed

github-actions bot assigned nilichen Mar 23, 2020

github-actions bot removed the help wanted label Mar 23, 2020

zachmayer mentioned this issue May 4, 2020

OneHotEncoder(sparse=True) scikit-learn-contrib/category_encoders#230

Open

thomasjpfan mentioned this issue May 23, 2020

ENH Adds missing value support to OneHotEncoder #17317

Merged

lorentzenchr closed this as completed in #17317 Oct 9, 2020

vruusmann added a commit to jpmml/jpmml-sklearn that referenced this issue Jan 27, 2021

Improved support for the 'OneHotEncoder' transformation type

97a7e2a

See scikit-learn/scikit-learn#11996

Frank-III mentioned this issue Apr 11, 2022

OneHotEncoder with missing values or Type JuliaAI/MLJModels.jl#450

Closed

WittmannF mentioned this issue May 20, 2022

Include drop='last' to OneHotEncoder #23436

Closed

woodly0 mentioned this issue Jun 8, 2023

What happend to the idea of adding a 'handle_missing' parameter to the OneHotEncoder? #26543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle missing values in OneHotEncoder #11996

Handle missing values in OneHotEncoder #11996

jnothman commented Sep 4, 2018

Olamyy commented Sep 4, 2018

jnothman commented Sep 4, 2018

Olamyy commented Sep 5, 2018

jnothman commented Sep 5, 2018

jnothman commented Sep 5, 2018

datajanko commented Dec 5, 2018

jnothman commented Dec 9, 2018

baluyotraf commented Jan 22, 2019

jnothman commented Jan 22, 2019

amueller commented Apr 24, 2019

ogrisel commented Jul 18, 2019 •

edited

ogrisel commented Mar 19, 2020

nilichen commented Mar 23, 2020

netomenoci commented Apr 25, 2020

nilichen commented Apr 27, 2020

thomasjpfan commented May 2, 2020

zachmayer commented May 4, 2020

Handle missing values in OneHotEncoder #11996

Handle missing values in OneHotEncoder #11996

Comments

jnothman commented Sep 4, 2018

Olamyy commented Sep 4, 2018

jnothman commented Sep 4, 2018

Olamyy commented Sep 5, 2018

jnothman commented Sep 5, 2018

jnothman commented Sep 5, 2018

datajanko commented Dec 5, 2018

jnothman commented Dec 9, 2018

baluyotraf commented Jan 22, 2019

jnothman commented Jan 22, 2019

amueller commented Apr 24, 2019

ogrisel commented Jul 18, 2019 • edited

ogrisel commented Mar 19, 2020

nilichen commented Mar 23, 2020

netomenoci commented Apr 25, 2020

nilichen commented Apr 27, 2020

thomasjpfan commented May 2, 2020

zachmayer commented May 4, 2020

ogrisel commented Jul 18, 2019 •

edited