-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Description
This bug is very specific, but it happens when you set OneHotEncoder to ignore unknown entries.
and your labels are strings. The memory of the arrays is not handled safely and it can lead to a ValueError
Basically, when you call the transform method it will sets all the unknown strings on your array to OneHotEncoder.categories_[i][0] which is the first category alphabetically sorted given for fit
If this OneHotEncoder.categories_[i][0] is a long string, and the array that you want to transform has small strings, then it is impossible to fit the whole OneHotEncoder.categories_[i][0] into the entries of the array we want to transform. So OneHotEncoder.categories_[i][0] is truncated and this raise the ValueError.
Steps/Code to Reproduce
import numpy as np
from sklearn.preprocessing import OneHotEncoder
# It needs to be numpy arrays, the error does not appear
# is you have lists of lists because it gets treated like an array of objects.
train = np.array([ '22','333','4444','11111111' ]).reshape((-1,1))
test = np.array([ '55555', '22' ]).reshape((-1,1))
ohe = OneHotEncoder(dtype=bool,handle_unknown='ignore')
ohe.fit( train )
enc_test = ohe.transform( test )
Expected Results
Here we should get an sparse matrix 2x4 false everywhere except at (1,1) the '22' that is known
Actual Results
ValueError: y contains previously unseen labels: ['111111']
Versions
System:
python: 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609]
machine: Linux-4.4.0-138-generic-x86_64-with-Ubuntu-16.04-xenial
executable: /usr/bin/python
BLAS:
macros: HAVE_CBLAS=None
cblas_libs: openblas, openblas
lib_dirs: /usr/lib
Python deps:
Cython: 0.25.2
scipy: 0.18.1
setuptools: 36.7.0
pip: 9.0.1
numpy: 1.15.2
pandas: 0.19.1
sklearn: 0.21.dev0
Comments
I already implemented a fix for this issue, where I check the size of the elements in the array before, and I cast them into objects if necessary.