Skip to content

OneHotEncoder should handle unknown values more gracefully #2169

@larsmans

Description

@larsmans

As reported over at SO, it would be nice if OneHotEncoder could handle unknown values for categorical features at transform time. Currently, the following throws an exception:

>>> from sklearn.preprocessing import OneHotEncoder
>>> oh = OneHotEncoder().fit([[0]])
>>> oh.transform([[1]])
Traceback (most recent call last):
  File "<ipython-input-17-54f21ed7c610>", line 1, in <module>
    oh.transform([[1]])
  File "/home/lars/src/scikit-learn/sklearn/preprocessing.py", line 878, in transform
    self.categorical_features, copy=True)
  File "/home/lars/src/scikit-learn/sklearn/preprocessing.py", line 662, in _transform_selected
    return transform(X)
  File "/home/lars/src/scikit-learn/sklearn/preprocessing.py", line 851, in _transform
    raise ValueError("Feature out of bounds. Try setting n_values.")
ValueError: Feature out of bounds. Try setting n_values.

I personally find this a bit strict, and would expect at most a warning and an appropriate number of zero columns for an unknown value.

This would be consistent with DictVectorizer and CountVectorizer, which ignore whatever features were not in their training set.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions