Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GapEncoder and MinHashEncoder modify their input inplace #921

Closed
jeromedockes opened this issue May 31, 2024 · 2 comments · Fixed by #930
Closed

GapEncoder and MinHashEncoder modify their input inplace #921

jeromedockes opened this issue May 31, 2024 · 2 comments · Fixed by #930
Labels
bug Something isn't working
Milestone

Comments

@jeromedockes
Copy link
Member

Nulls are replaced by empty strings in the original array:

>>> import pandas as pd
>>> from skrub import GapEncoder

>>> encoder = GapEncoder(n_components=2)
>>> df = pd.DataFrame(dict(a=["one", None, "one"]))
>>> df
      a
0   one
1  None
2   one
>>> encoder.fit(df)
GapEncoder(n_components=2)
>>> df
     a
0  one
1
2  one

As the inputs are strings that need to be processed anyway, it is easy to avoid this in-place modification without making an extra copy of the input.

For example we can change the preprocessor of the Count or Hashing Vectorizer

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> x = ['abcd', 'abcd', 'cdef', None]
>>> v = CountVectorizer(analyzer='char')
>>> v.fit_transform(x).A
Traceback (most recent call last):
    ...
AttributeError: 'NoneType' object has no attribute 'lower'

>>> v = CountVectorizer(analyzer='char', preprocessor=lambda x: x if isinstance(x, str) else '')
>>> v.fit_transform(x).A
array([[1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0]])

maybe it would make sense to ask scikit-learn to add an option to let the default preprocessor do this imputation. In the meanwhile I would suggest we do it in skrub to avoid modifying X in place

@jeromedockes jeromedockes added the bug Something isn't working label May 31, 2024
@jeromedockes jeromedockes changed the title GapEncoder modifies its input inplace GapEncoder and MinHashEncoder modify their input inplace May 31, 2024
@jeromedockes
Copy link
Member Author

note the minhashencoder has the same issue

>>> import pandas as pd
>>> import numpy as np
>>> from skrub import MinHashEncoder

>>> encoder = MinHashEncoder(n_components=2)
>>> df = pd.DataFrame(dict(a=["one", None, "one"]))
>>> df
      a
0   one
1  None
2   one
>>> df.isna()
       a
0  False
1   True
2  False
>>> _ = encoder.fit_transform(df)
>>> df
     a
0  one
1  NAN
2  one
>>> df.isna()
       a
0  False
1  False
2  False

@jeromedockes
Copy link
Member Author

here is a way to still keep the default preprocessing while adding the nan handling; CountVectorizer exposes a public build_preprocessor() function:

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> x = ['abcd', 'abcd', 'cdef', None]

>>> default_preprocessor = CountVectorizer().build_preprocessor()
>>> def preprocessor(text):
...     return default_preprocessor(text) if isinstance(text, str) else ''

>>> v = CountVectorizer(analyzer='char', preprocessor=preprocessor)
>>> v.fit_transform(x).A
array([[1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0]])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant