Skip to content

OneHotEncoder ignore unknown error when categories are strings  #12470

@GMarzinotto

Description

@GMarzinotto

Description

This bug is very specific, but it happens when you set OneHotEncoder to ignore unknown entries.
and your labels are strings. The memory of the arrays is not handled safely and it can lead to a ValueError

Basically, when you call the transform method it will sets all the unknown strings on your array to OneHotEncoder.categories_[i][0] which is the first category alphabetically sorted given for fit
If this OneHotEncoder.categories_[i][0] is a long string, and the array that you want to transform has small strings, then it is impossible to fit the whole OneHotEncoder.categories_[i][0] into the entries of the array we want to transform. So OneHotEncoder.categories_[i][0] is truncated and this raise the ValueError.

Steps/Code to Reproduce


import numpy as np
from sklearn.preprocessing import OneHotEncoder


# It needs to be numpy arrays, the error does not appear 
# is you have lists of lists because it gets treated like an array of objects.
train  = np.array([ '22','333','4444','11111111' ]).reshape((-1,1))
test   = np.array([ '55555',  '22' ]).reshape((-1,1))

ohe = OneHotEncoder(dtype=bool,handle_unknown='ignore')

ohe.fit( train )
enc_test = ohe.transform( test )

Expected Results

Here we should get an sparse matrix 2x4 false everywhere except at (1,1) the '22' that is known

Actual Results

ValueError: y contains previously unseen labels: ['111111']

Versions

System:
python: 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609]
machine: Linux-4.4.0-138-generic-x86_64-with-Ubuntu-16.04-xenial
executable: /usr/bin/python

BLAS:
macros: HAVE_CBLAS=None
cblas_libs: openblas, openblas
lib_dirs: /usr/lib

Python deps:
Cython: 0.25.2
scipy: 0.18.1
setuptools: 36.7.0
pip: 9.0.1
numpy: 1.15.2
pandas: 0.19.1
sklearn: 0.21.dev0

Comments

I already implemented a fix for this issue, where I check the size of the elements in the array before, and I cast them into objects if necessary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions