Skip to content

How to deal with OneHotEncoder() in pipeline? #63

@DataTerminatorX

Description

@DataTerminatorX

I find a bug when I use sklearn_pandas.DataFrameMapper. I've already read the source code of DataFrameMapper while fail to fix it. Could someone help with it?

Firstly, I get a categorical variable as below

from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn_pandas import DataFrameMapper
import pandas as pd

testdata = pd.DataFrame({'pet':    ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish']})

When I implement transformer LabelBinarizer() through DataFrameMapper, I got what I expect

In[11]
mapper1 = DataFrameMapper([('pet',[LabelBinarizer()])],sparse=False)
mapper1.fit_transform(testdata)
Out[11]:
array([[1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1]])

However, when I use cascaded transformer LabelEncoder() and OneHotEncoder(), it just output wrong result with some warnings

In[14]
mapper1 = DataFrameMapper([('pet',[LabelEncoder(),OneHotEncoder()])],sparse=False)
mapper1.fit_transform(testdata)
Out[14]:
array([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]])

The problem lies on OneHotEncoder. Seems that it is not compatible with DataFrameMapper, also not compatible with sklearn.pipeline.Pipeline


updates:

I found this bug the same as #60 . The reason is that in scikit-learn 0.17, 1-D array input to OneHotEncoder is deprecated. LabelEncoder() output 1-D array, thus things go wrong when cascading with OneHotEncoder()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions