ENH Fit multiple columns with MinHash #243

jovan-stojanovic · 2022-03-22T16:01:12Z

Currently, it seems that it is not possible to fit multiple columns with the MinHashEncoder simultaneously.
This may provoke errors when using the sklearn.compose.make_column_transformer. Here is an example:

Option 1:

    encoder = make_column_transformer(
        (one_hot, ['var0']),
        ('passthrough', ['var1','var2']),
        ( MinHashEncoder(n_components=100), ['dirty1','dirty2']),
        remainder='drop')

Option 2:

    encoder = make_column_transformer(
        (one_hot, ['var0']),
        ('passthrough', ['var1','var2']),
        ( MinHashEncoder(n_components=100), ['dirty1']),
        ( MinHashEncoder(n_components=100), ['dirty2']),
        remainder='drop')

Option 1 and 2 should be equivalent. Option 1 will currently not work, as the method tries to encode 2 variables simultaneously, while option 2 will work, as they are independently taken into account.
I have modified the encoder so that it is now possible to do it simultaneously. The columns will of course still be encoded independently, as they should be, but you could all do it with one command.

I also added a test, to check if encoding variables simultaneously is possible and goes well.

Please check if this works for you.

Thanks.

GaelVaroquaux

Thanks for the PR. Super useful!!

A few comments.

GaelVaroquaux · 2022-03-22T16:06:11Z

dirty_cat/minhash_encoder.py

@@ -156,7 +156,7 @@ def fit(self, X, y=None):
        self
            The fitted MinHashEncoder instance.
        """
-        self.hash_dict = LRUDict(capacity=self._capacity)
+        self.hash_dict = LRUDict(capacity=self._capacity)        


You've added whitespace at the end of the line here.

GaelVaroquaux · 2022-03-22T16:16:05Z

dirty_cat/test/test_minhash_encoder.py

+        encode the column independently """
+    from dirty_cat.datasets import fetch_employee_salaries
+    employee_salaries = fetch_employee_salaries()
+    X = employee_salaries.X


Can you do the test on purely synthetic data, please, rather than downloading from internet: dowloading from internet is fragile.

GaelVaroquaux · 2022-03-22T16:16:51Z

dirty_cat/test/test_minhash_encoder.py

+        encode the column independently """
+    from dirty_cat.datasets import fetch_employee_salaries
+    employee_salaries = fetch_employee_salaries()
+    X = employee_salaries.X


Can you do the test on purely synthetic data, please, rather than downloading from internet: dowloading from internet is fragile.

GaelVaroquaux · 2022-03-22T16:16:55Z

dirty_cat/test/test_minhash_encoder.py

+    from dirty_cat.datasets import fetch_employee_salaries
+    employee_salaries = fetch_employee_salaries()
+    X = employee_salaries.X
+    try:        


I'm not sure that I understand the role of the try/except: if an error is raised, let it be raised. The test will fail, and that's good.

GaelVaroquaux · 2022-03-22T16:19:26Z

dirty_cat/test/test_minhash_encoder.py

+    employee_salaries = fetch_employee_salaries()
+    X = employee_salaries.X
+    try:        
+        MinHashEncoder().fit_transform(X[['employee_position_title','department_name']])


Suggested change

MinHashEncoder().fit_transform(X[['employee_position_title','department_name']])

MinHashEncoder().fit_transform(X[['employee_position_title', 'department_name']])

PEP8 formating: space after coma. You need to be careful to systematically apply the PEP8 standard :) @

jovan-stojanovic · 2022-03-22T16:41:29Z

Thanks for the comments! I hope this will work.

GaelVaroquaux

Sorry, a few more comments

GaelVaroquaux · 2022-03-23T08:14:55Z

dirty_cat/test/test_minhash_encoder.py

+        with the MinHashEncoder will not produce an error, but will 
+        encode the column independently """
+    X = pd.DataFrame([('bird', 'parrot'),
+                   ('bird', 'nightingale'),


Indent these lines so that the "('bird'" of a line is indented aligned with that above

GaelVaroquaux · 2022-03-23T08:15:26Z

dirty_cat/minhash_encoder.py

-            raise ValueError(msg)
-
-        return X_out
+        X_enc = []


The idea is the get rid of the "X_enc" as a list, and the "append" and "hstack" at the end

GaelVaroquaux · 2022-03-23T08:16:47Z

dirty_cat/minhash_encoder.py

+            nan_idx = []
+
+            if self.hashing == 'fast':
+                for i, x in enumerate(X_in):


We should merge the for loop above with the for loop here: move the for loop above here to have the two next to each other.

Would this solution work?

GaelVaroquaux

I'm sorry, still a bit of work.

I think that there is a bug with the handling of the nan.

We should find a test that highlight this bug before correcting it, as our current test suite does not seem stringent enough

GaelVaroquaux · 2022-03-28T12:38:10Z

dirty_cat/minhash_encoder.py

+            X_out = np.zeros((len(X[:]), self.n_components * X.shape[1]))
+            counter = self.n_components
+            for k in range(X.shape[1]):
+                X_in = X[:,k].reshape(-1)


Suggested change

X_in = X[:,k].reshape(-1)

X_in = X[:, k].reshape(-1)

PEP8 formating :)

You need to learn it and be systematic about it :)

GaelVaroquaux · 2022-03-28T13:11:41Z

dirty_cat/minhash_encoder.py

+                X_in = X[:,k].reshape(-1)
+                for i, x in enumerate(X_in):
+                    if isinstance(x, float): # true if x is a missing value
+                        nan_idx.append(i)


Given this, I wonder if the nan_idx list should not be reinitialized for each column.

The fact that we have no test that breaks tells me that our test suite is not detailed enough.

I think the nan_idx list should still stay outside of the loop, because the only goal of the list is to see if there are missing values and, if yes, force the user to use the handle_missing='zero_impute' option (see code below from line 223-226).

if self.handle_missing == 'error' and nan_idx: msg = ("Found missing values in input data; set " "handle_missing='zero_impute' to encode with missing values") raise ValueError(msg)

I can confirm that you are correct.

Our code is ugly. We need to change this to using a boolean rather than a list. Can you do this, please

GaelVaroquaux · 2022-03-28T13:12:09Z

dirty_cat/minhash_encoder.py

+            X_out = np.zeros((len(X[:]), self.n_components * X.shape[1]))
+            counter = self.n_components
+            for k in range(X.shape[1]):
+                X_in = X[:,k].reshape(-1)


Suggested change

X_in = X[:,k].reshape(-1)

X_in = X[:, k].reshape(-1)

PEP8

GaelVaroquaux · 2022-03-28T13:30:19Z

dirty_cat/test/test_minhash_encoder.py

+                      ('mammal', 'monkey'),
+                      ('mammal', np.nan)],
+                      columns=('class', 'type'))
+    MinHashEncoder().fit_transform(X)


We should push a bit this test. For instance we could compare this to column-wise encoding.

GaelVaroquaux

Only one tiny comment and we are good!

GaelVaroquaux · 2022-03-29T16:36:47Z

dirty_cat/minhash_encoder.py

+                X_in = X[:,k].reshape(-1)
+                for i, x in enumerate(X_in):
+                    if isinstance(x, float): # true if x is a missing value
+                        nan_idx.append(i)


I can confirm that you are correct.

Our code is ugly. We need to change this to using a boolean rather than a list. Can you do this, please

jovan-stojanovic · 2022-03-30T09:56:00Z

Agreed, hope this works!

GaelVaroquaux

LGTM. Merging. Thank you!!

jovan-stojanovic added 2 commits March 22, 2022 16:30

Added solution to multiple column encoding and test

bdd4d3c

Fixed links

007f33e

GaelVaroquaux requested changes Mar 22, 2022

View reviewed changes

jovan-stojanovic added 2 commits March 22, 2022 17:35

Fixed whitespace and test example

7ec98e5

Fixed links

19e4bac

GaelVaroquaux requested changes Mar 23, 2022

View reviewed changes

jovan-stojanovic added 3 commits March 23, 2022 10:48

Merging loops and indent

f1ff82d

fixed links

6635c95

removed whitespace

dc9c495

GaelVaroquaux requested changes Mar 28, 2022

View reviewed changes

jovan-stojanovic added 2 commits March 29, 2022 16:59

Improved test and PEP8 standard

e226915

Removed whitespace

2f7b0bb

GaelVaroquaux reviewed Mar 29, 2022

View reviewed changes

jovan-stojanovic added 2 commits March 30, 2022 09:51

Replace list by boolean

ac743f1

Fix links

0485823

GaelVaroquaux approved these changes Mar 30, 2022

View reviewed changes

GaelVaroquaux merged commit 7c298f4 into skrub-data:master Mar 30, 2022

jovan-stojanovic deleted the minhashcolumns branch March 30, 2022 13:51

jovan-stojanovic changed the title ~~Minhashcolumns~~ ENH Fit multiple columns with MinHash Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Fit multiple columns with MinHash #243

ENH Fit multiple columns with MinHash #243

jovan-stojanovic commented Mar 22, 2022

GaelVaroquaux left a comment

GaelVaroquaux Mar 22, 2022

GaelVaroquaux Mar 22, 2022

GaelVaroquaux Mar 22, 2022

GaelVaroquaux Mar 22, 2022

GaelVaroquaux Mar 22, 2022

jovan-stojanovic commented Mar 22, 2022

GaelVaroquaux left a comment

GaelVaroquaux Mar 23, 2022

GaelVaroquaux Mar 23, 2022

GaelVaroquaux Mar 23, 2022

jovan-stojanovic Mar 23, 2022

GaelVaroquaux left a comment

GaelVaroquaux Mar 28, 2022

GaelVaroquaux Mar 28, 2022

jovan-stojanovic Mar 29, 2022

GaelVaroquaux Mar 29, 2022

GaelVaroquaux Mar 28, 2022

GaelVaroquaux Mar 28, 2022

GaelVaroquaux left a comment

GaelVaroquaux Mar 29, 2022

jovan-stojanovic commented Mar 30, 2022

GaelVaroquaux left a comment

	MinHashEncoder().fit_transform(X[['employee_position_title','department_name']])
	MinHashEncoder().fit_transform(X[['employee_position_title', 'department_name']])

ENH Fit multiple columns with MinHash #243

ENH Fit multiple columns with MinHash #243

Conversation

jovan-stojanovic commented Mar 22, 2022

GaelVaroquaux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jovan-stojanovic commented Mar 22, 2022

GaelVaroquaux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jovan-stojanovic commented Mar 30, 2022

GaelVaroquaux left a comment

Choose a reason for hiding this comment