-
Notifications
You must be signed in to change notification settings - Fork 425
Description
I find a bug when I use sklearn_pandas.DataFrameMapper. I've already read the source code of DataFrameMapper while fail to fix it. Could someone help with it?
Firstly, I get a categorical variable as below
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn_pandas import DataFrameMapper
import pandas as pd
testdata = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish']})When I implement transformer LabelBinarizer() through DataFrameMapper, I got what I expect
In[11]
mapper1 = DataFrameMapper([('pet',[LabelBinarizer()])],sparse=False)
mapper1.fit_transform(testdata)
Out[11]:
array([[1, 0, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 0, 1]])However, when I use cascaded transformer LabelEncoder() and OneHotEncoder(), it just output wrong result with some warnings
In[14]
mapper1 = DataFrameMapper([('pet',[LabelEncoder(),OneHotEncoder()])],sparse=False)
mapper1.fit_transform(testdata)
Out[14]:
array([[ 1., 1., 1., 1., 1., 1., 1., 1.]])The problem lies on OneHotEncoder. Seems that it is not compatible with DataFrameMapper, also not compatible with sklearn.pipeline.Pipeline
updates:
I found this bug the same as #60 . The reason is that in scikit-learn 0.17, 1-D array input to OneHotEncoder is deprecated. LabelEncoder() output 1-D array, thus things go wrong when cascading with OneHotEncoder()