LabelEncoder throws an error when it's used in a Pipeline or in a ColumnTransform #12720

hal-314 · 2018-12-04T19:42:47Z

Description

fit and fit_transform methods in LabelEncoder don't follow the standard scikit-lean convention for these methods: fit(X[, y]) and fit_transform(X[, y]). The fit and fit_transform method in the LabelEncoder only accepts one argument: fit(y) and fit_transform(y).

Therefore, LabelEncoder couldn't be used inside a Pipeline or a ColumnTransform. I suspect that there are a bunch of other classes in which it doesn't work (GridSearchCV, ...) but I haven't tested it.

In contrast, fit and fit_transform methods in OneHotEncoder and OrdinalEncoder follows the standard scikit-learn signature.

See reference:
LabelEncoder: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
OneHotEnconder: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder
OrdinalEncoder:https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder

Steps/Code to Reproduce

Example:

import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
import sklearn.tree as tree

X = pd.DataFrame(
    {'city': ['London', 'London', 'Paris', 'Sallisaw'],
     'title': ["His Last Bow", "How Watson Learned the Trick",
               "A Moveable Feast", "The Grapes of Wrath"],
     'expert_rating': [5, 3, 4, 5],
     'user_rating': [4, 5, 4, 3]})

column_trans = ColumnTransformer(
    [('title_bow', LabelEncoder(), 'title')],
    remainder='drop').fit(X)

pipe = make_pipeline(LabelEncoder(), tree.DecisionTreeClassifier()).fit(X)

Expected Results

No error is thrown.

Actual Results

The same error in both cases:
TypeError: fit_transform() takes 2 positional arguments but 3 were given.

Versions

System:
python: 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16) [GCC 7.3.0]
executable: /home/twins/anaconda3/envs/pytorch/bin/python
machine: Linux-4.8.0-56-generic-x86_64-with-debian-stretch-sid

BLAS:
macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
lib_dirs: /home/twins/anaconda3/envs/pytorch/lib
cblas_libs: mkl_rt, pthread

Python deps:
pip: 18.1
setuptools: 40.6.2
sklearn: 0.20.1
numpy: 1.15.4
scipy: 1.1.0
Cython: None
pandas: 0.23.4

Thanks for the amazing job you do !

TomDLT · 2018-12-04T20:59:03Z

Yes indeed, from the user guide:

These are transformers that are not intended to be used on features, only on supervised learning targets.

To encode features, you need to use OneHotEncoder or OrdinalEncoder.

qinhanmin2014 · 2018-12-05T01:43:10Z

duplicate of #3956
Please use the latest OneHotEncoder (in scikit-learn >= 0.20)

gmichaelson · 2019-03-24T19:29:29Z

The problem is that OneHotEncoder is broken in >0.20... it fails if you pass it features with string values...

maggiex · 2019-04-13T05:45:40Z

The problem is that OneHotEncoder is broken in >0.20... it fails if you pass it features with string values...

yes.. any suggestions on how to resolve this?

qinhanmin2014 · 2019-04-13T06:41:51Z

The problem is that OneHotEncoder is broken in >0.20... it fails if you pass it features with string values...

Why? See the example section in our doc. https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html

kuabhish · 2019-11-30T21:28:08Z

Yes indeed, from the user guide:

These are transformers that are not intended to be used on features, only on supervised learning targets.

To encode features, you need to use OneHotEncoder or OrdinalEncoder.

What if I want to Labelencode the input feature?
With Labelencode there is also problem with unseen category (for e.x in the test set)
What should I do then?

trialzuki · 2019-12-13T11:10:38Z

may be cos LabelEncoder wont pass multi_future in one step, but OneHotEncoder can do this ?

check this:
OneHotEncoder().fit(df[multi_cols]) - pass OK
vs
LabelEncoder().fit(df[multi_cols]) - get error

kuabhish · 2019-12-13T12:51:21Z

I have written a labelencoder which can take care of unknown values and multiple columns too and can be inserted into a pipeline with some tweaks.

Code:
`
class My_LabelEncoder(BaseEstimator, TransformerMixin):

def fit( self , df ,df_y  ):
  maps_={}
  for col in df:
    y = df[col]
    uni = np.unique(y)
    map_ = {}
    for c in uni:
        map_[c] = len(map_)
    maps_[col] = map_
  self.maps_ = maps_
  return self


def transform(self , df):
  ndf = df.copy()
  for col in df:
    ny = []
    map_= self.maps_[col]
    for c in np.array(df[col]):
      if c in self.maps_[col]:
        ny.append(self.maps_[col][c])
      else:
        ny.append(-1)
    ndf[col] = ny
  return ndf`

jinsel · 2020-01-21T16:31:07Z

Here is an example for Multiple Column convert into int

from sklearn.preprocessing import LabelEncoder
column= ['ethnicity','gender','hospital_admit_source','icu_stay_type','icu_type','icu_admit_source','icu_id','apache_3j_bodysystem','apache_2_bodysystem']
def cat_to_int(column):
for col in column:
encoder = LabelEncoder()
str_to_int = df_training1[col]
df_training1[col]=encoder.fit_transform(str_to_int)
cat_to_int (column)

jnothman · 2020-01-21T22:25:40Z

Please don't do that. Please use OrdinalEncoder, perhaps together with ColumnTransformer.

Vikrant-Deshmukh · 2020-04-12T07:39:32Z

Traceback (most recent call last):
File "/Users/vikrant/Downloads/Car/catthatcodes/app/app.py", line 7, in
from pipeline import pipeline
File "/Users/vikrant/Downloads/Car/catthatcodes/app/pipeline/pipeline.py", line 8, in
graph = tf.get_default_graph()
AttributeError: module 'tensorflow' has no attribute 'get_default_graph'

#Here is the Code
import json
import urllib.request
import numpy as np
import pickle as pk

import tensorflow as tf
global graph, model
graph = tf.get_default_graph()

from tensorflow.keras.models import load_model
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.imagenet_utils import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array, load_img
import keras.utils.data_utils

qinhanmin2014 closed this as completed Dec 5, 2018

rwjmiller mentioned this issue May 16, 2019

Feature request: support for sklearn.preprocessing.OrdinalEncoder jpmml/jpmml-sklearn#104

Closed

yarnabrina mentioned this issue Feb 3, 2024

[BUG] use of LabelEncoder leads to failure of scitype sktime/sktime#5867

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LabelEncoder throws an error when it's used in a Pipeline or in a ColumnTransform #12720

LabelEncoder throws an error when it's used in a Pipeline or in a ColumnTransform #12720

hal-314 commented Dec 4, 2018

TomDLT commented Dec 4, 2018

qinhanmin2014 commented Dec 5, 2018

gmichaelson commented Mar 24, 2019

maggiex commented Apr 13, 2019

qinhanmin2014 commented Apr 13, 2019

kuabhish commented Nov 30, 2019

trialzuki commented Dec 13, 2019

kuabhish commented Dec 13, 2019 •

edited

jinsel commented Jan 21, 2020

jnothman commented Jan 21, 2020 via email

Vikrant-Deshmukh commented Apr 12, 2020 •

edited

Navigation Menu

LabelEncoder throws an error when it's used in a Pipeline or in a ColumnTransform #12720

LabelEncoder throws an error when it's used in a Pipeline or in a ColumnTransform #12720

Comments

hal-314 commented Dec 4, 2018

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

TomDLT commented Dec 4, 2018

qinhanmin2014 commented Dec 5, 2018

gmichaelson commented Mar 24, 2019

maggiex commented Apr 13, 2019

qinhanmin2014 commented Apr 13, 2019

kuabhish commented Nov 30, 2019

trialzuki commented Dec 13, 2019

kuabhish commented Dec 13, 2019 • edited

jinsel commented Jan 21, 2020

jnothman commented Jan 21, 2020 via email

Vikrant-Deshmukh commented Apr 12, 2020 • edited

kuabhish commented Dec 13, 2019 •

edited

Vikrant-Deshmukh commented Apr 12, 2020 •

edited