GapEncoder and MinHashEncoder modify their input inplace #921

jeromedockes · 2024-05-31T14:48:19Z

Nulls are replaced by empty strings in the original array:

>>> import pandas as pd
>>> from skrub import GapEncoder

>>> encoder = GapEncoder(n_components=2)
>>> df = pd.DataFrame(dict(a=["one", None, "one"]))
>>> df
      a
0   one
1  None
2   one
>>> encoder.fit(df)
GapEncoder(n_components=2)
>>> df
     a
0  one
1
2  one

As the inputs are strings that need to be processed anyway, it is easy to avoid this in-place modification without making an extra copy of the input.

For example we can change the preprocessor of the Count or Hashing Vectorizer

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> x = ['abcd', 'abcd', 'cdef', None]
>>> v = CountVectorizer(analyzer='char')
>>> v.fit_transform(x).A
Traceback (most recent call last):
    ...
AttributeError: 'NoneType' object has no attribute 'lower'

>>> v = CountVectorizer(analyzer='char', preprocessor=lambda x: x if isinstance(x, str) else '')
>>> v.fit_transform(x).A
array([[1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0]])

maybe it would make sense to ask scikit-learn to add an option to let the default preprocessor do this imputation. In the meanwhile I would suggest we do it in skrub to avoid modifying X in place

jeromedockes · 2024-05-31T14:53:37Z

note the minhashencoder has the same issue

>>> import pandas as pd
>>> import numpy as np
>>> from skrub import MinHashEncoder

>>> encoder = MinHashEncoder(n_components=2)
>>> df = pd.DataFrame(dict(a=["one", None, "one"]))
>>> df
      a
0   one
1  None
2   one
>>> df.isna()
       a
0  False
1   True
2  False
>>> _ = encoder.fit_transform(df)
>>> df
     a
0  one
1  NAN
2  one
>>> df.isna()
       a
0  False
1  False
2  False

jeromedockes · 2024-06-05T13:37:42Z

here is a way to still keep the default preprocessing while adding the nan handling; CountVectorizer exposes a public build_preprocessor() function:

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> x = ['abcd', 'abcd', 'cdef', None]

>>> default_preprocessor = CountVectorizer().build_preprocessor()
>>> def preprocessor(text):
...     return default_preprocessor(text) if isinstance(text, str) else ''

>>> v = CountVectorizer(analyzer='char', preprocessor=preprocessor)
>>> v.fit_transform(x).A
array([[1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0]])

jeromedockes added the bug Something isn't working label May 31, 2024

jeromedockes changed the title ~~GapEncoder modifies its input inplace~~ GapEncoder and MinHashEncoder modify their input inplace May 31, 2024

jeromedockes mentioned this issue Jun 7, 2024

Missing values in gap and minhash #930

Merged

jeromedockes added this to the 0.1.2 milestone Jun 13, 2024

jeromedockes closed this as completed in #930 Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GapEncoder and MinHashEncoder modify their input inplace #921

GapEncoder and MinHashEncoder modify their input inplace #921

jeromedockes commented May 31, 2024

jeromedockes commented May 31, 2024

jeromedockes commented Jun 5, 2024

GapEncoder and MinHashEncoder modify their input inplace #921

GapEncoder and MinHashEncoder modify their input inplace #921

Comments

jeromedockes commented May 31, 2024

jeromedockes commented May 31, 2024

jeromedockes commented Jun 5, 2024