Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LabelEncoder with mixed typed labels feature or bug? #17294

Open
thomasjpfan opened this issue May 20, 2020 · 4 comments
Open

LabelEncoder with mixed typed labels feature or bug? #17294

thomasjpfan opened this issue May 20, 2020 · 4 comments

Comments

@thomasjpfan
Copy link
Member

thomasjpfan commented May 20, 2020

Is the following a bug or a feature?

from sklearn.preprocessing import LabelEncoder

LabelEncoder().fit_transform(['1', '2', 2])
# array([0, 1, 1])

LabelEncoder().fit_transform(np.array(['1', '2', 2]))
# array([0, 1, 1])

LabelEncoder().fit_transform(np.array(['1', '2', 2], dtype=object))
# TypeError: Encoders require their input to be uniformly strings or numbers.

This stems from how we only check for multiple types in _encode when dtype=object. Note this behavior is the same in 0.22, 0.23, and on master.

@thomasjpfan thomasjpfan changed the title LabelEncoder with different labels feature or bug? LabelEncoder with mixed typed labels feature or bug? May 20, 2020
@amueller
Copy link
Member

For the center we have no choice, the bottom one makes sense to me, you don't want to necessarily identify '1' with 1.
You might argue that the top one is a bug, but I'm not sure if we want to replace the numpy casting logic.

Btw, you run into this when concatenating dataframes from multiple csv files and a column is sometimes parsed as string and sometimes as integer. Great Fun!

@jnothman
Copy link
Member

jnothman commented May 21, 2020 via email

@jnothman
Copy link
Member

jnothman commented May 21, 2020 via email

@glemaitre
Copy link
Member

If we want to make the stuff consistent I am really scared that we will stumble in the dtype nightmare and we might come with tricks-and-tips implementations on something which should already be solved otherwise I think.

If we could maybe have better error messages with potential avenues for resolution could already be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants