How to deal with OneHotEncoder() in pipeline?

I find a bug when I use `sklearn_pandas.DataFrameMapper`. I've already read the source code of `DataFrameMapper` while fail to fix it. Could someone help with it?

Firstly, I get a categorical variable as below

``` python
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn_pandas import DataFrameMapper
import pandas as pd

testdata = pd.DataFrame({'pet':    ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish']})
```

When I implement transformer `LabelBinarizer()` through `DataFrameMapper`, I got what I expect

``` python
In[11]
mapper1 = DataFrameMapper([('pet',[LabelBinarizer()])],sparse=False)
mapper1.fit_transform(testdata)
Out[11]:
array([[1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1]])
```

However, when I use cascaded transformer `LabelEncoder()` and `OneHotEncoder()`, it just output wrong result with some warnings

``` python
In[14]
mapper1 = DataFrameMapper([('pet',[LabelEncoder(),OneHotEncoder()])],sparse=False)
mapper1.fit_transform(testdata)
Out[14]:
array([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]])
```

The problem lies on `OneHotEncoder`. Seems that it is not compatible with `DataFrameMapper`, also not compatible with `sklearn.pipeline.Pipeline`

---
## updates:

I found this bug the same as https://github.com/paulgb/sklearn-pandas/issues/60 . The reason is that in scikit-learn 0.17, 1-D array input to OneHotEncoder is deprecated. `LabelEncoder()` output 1-D array, thus things go wrong when cascading with `OneHotEncoder()`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to deal with OneHotEncoder() in pipeline? #63

updates:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to deal with OneHotEncoder() in pipeline? #63

Description

updates:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions