Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LabelEncoder throws an error when it's used in a Pipeline or in a ColumnTransform #12720

Closed
hal-314 opened this issue Dec 4, 2018 · 11 comments
Closed

Comments

@hal-314
Copy link

hal-314 commented Dec 4, 2018

Description

fit and fit_transform methods in LabelEncoder don't follow the standard scikit-lean convention for these methods: fit(X[, y]) and fit_transform(X[, y]). The fit and fit_transform method in the LabelEncoder only accepts one argument: fit(y) and fit_transform(y).

Therefore, LabelEncoder couldn't be used inside a Pipeline or a ColumnTransform. I suspect that there are a bunch of other classes in which it doesn't work (GridSearchCV, ...) but I haven't tested it.

In contrast, fit and fit_transform methods in OneHotEncoder and OrdinalEncoder follows the standard scikit-learn signature.

See reference:
LabelEncoder: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
OneHotEnconder: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder
OrdinalEncoder:https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder

Steps/Code to Reproduce

Example:

import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
import sklearn.tree as tree

X = pd.DataFrame(
    {'city': ['London', 'London', 'Paris', 'Sallisaw'],
     'title': ["His Last Bow", "How Watson Learned the Trick",
               "A Moveable Feast", "The Grapes of Wrath"],
     'expert_rating': [5, 3, 4, 5],
     'user_rating': [4, 5, 4, 3]})

column_trans = ColumnTransformer(
    [('title_bow', LabelEncoder(), 'title')],
    remainder='drop').fit(X)

pipe = make_pipeline(LabelEncoder(), tree.DecisionTreeClassifier()).fit(X)

Expected Results

No error is thrown.

Actual Results

The same error in both cases:
TypeError: fit_transform() takes 2 positional arguments but 3 were given.

Versions

System:
python: 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16) [GCC 7.3.0]
executable: /home/twins/anaconda3/envs/pytorch/bin/python
machine: Linux-4.8.0-56-generic-x86_64-with-debian-stretch-sid

BLAS:
macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
lib_dirs: /home/twins/anaconda3/envs/pytorch/lib
cblas_libs: mkl_rt, pthread

Python deps:
pip: 18.1
setuptools: 40.6.2
sklearn: 0.20.1
numpy: 1.15.4
scipy: 1.1.0
Cython: None
pandas: 0.23.4

Thanks for the amazing job you do !

@TomDLT
Copy link
Member

TomDLT commented Dec 4, 2018

Yes indeed, from the user guide:

These are transformers that are not intended to be used on features, only on supervised learning targets.

To encode features, you need to use OneHotEncoder or OrdinalEncoder.

@qinhanmin2014
Copy link
Member

duplicate of #3956
Please use the latest OneHotEncoder (in scikit-learn >= 0.20)

@gmichaelson
Copy link

The problem is that OneHotEncoder is broken in >0.20... it fails if you pass it features with string values...

@maggiex
Copy link

maggiex commented Apr 13, 2019

The problem is that OneHotEncoder is broken in >0.20... it fails if you pass it features with string values...

yes.. any suggestions on how to resolve this?

@qinhanmin2014
Copy link
Member

The problem is that OneHotEncoder is broken in >0.20... it fails if you pass it features with string values...

Why? See the example section in our doc. https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html

@kuabhish
Copy link

Yes indeed, from the user guide:

These are transformers that are not intended to be used on features, only on supervised learning targets.

To encode features, you need to use OneHotEncoder or OrdinalEncoder.

What if I want to Labelencode the input feature?
With Labelencode there is also problem with unseen category (for e.x in the test set)
What should I do then?

@trialzuki
Copy link

may be cos LabelEncoder wont pass multi_future in one step, but OneHotEncoder can do this ?

check this:
OneHotEncoder().fit(df[multi_cols]) - pass OK
vs
LabelEncoder().fit(df[multi_cols]) - get error

@kuabhish
Copy link

kuabhish commented Dec 13, 2019

I have written a labelencoder which can take care of unknown values and multiple columns too and can be inserted into a pipeline with some tweaks.

Code:
`
class My_LabelEncoder(BaseEstimator, TransformerMixin):

def fit( self , df ,df_y  ):
  maps_={}
  for col in df:
    y = df[col]
    uni = np.unique(y)
    map_ = {}
    for c in uni:
        map_[c] = len(map_)
    maps_[col] = map_
  self.maps_ = maps_
  return self


def transform(self , df):
  ndf = df.copy()
  for col in df:
    ny = []
    map_= self.maps_[col]
    for c in np.array(df[col]):
      if c in self.maps_[col]:
        ny.append(self.maps_[col][c])
      else:
        ny.append(-1)
    ndf[col] = ny
  return ndf`

@jinsel
Copy link

jinsel commented Jan 21, 2020

Here is an example for Multiple Column convert into int

from sklearn.preprocessing import LabelEncoder
column= ['ethnicity','gender','hospital_admit_source','icu_stay_type','icu_type','icu_admit_source','icu_id','apache_3j_bodysystem','apache_2_bodysystem']
def cat_to_int(column):
for col in column:
encoder = LabelEncoder()
str_to_int = df_training1[col]
df_training1[col]=encoder.fit_transform(str_to_int)
cat_to_int (column)

@jnothman
Copy link
Member

jnothman commented Jan 21, 2020 via email

@Vikrant-Deshmukh
Copy link

Vikrant-Deshmukh commented Apr 12, 2020

Traceback (most recent call last):
File "/Users/vikrant/Downloads/Car/catthatcodes/app/app.py", line 7, in
from pipeline import pipeline
File "/Users/vikrant/Downloads/Car/catthatcodes/app/pipeline/pipeline.py", line 8, in
graph = tf.get_default_graph()
AttributeError: module 'tensorflow' has no attribute 'get_default_graph'

#Here is the Code
import json
import urllib.request
import numpy as np
import pickle as pk

import tensorflow as tf
global graph, model
graph = tf.get_default_graph()

from tensorflow.keras.models import load_model
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.applications.imagenet_utils import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array, load_img
import keras.utils.data_utils

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants