Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Fit multiple columns with MinHash #243

Merged
merged 11 commits into from
Mar 30, 2022

Conversation

jovan-stojanovic
Copy link
Member

Currently, it seems that it is not possible to fit multiple columns with the MinHashEncoder simultaneously.
This may provoke errors when using the sklearn.compose.make_column_transformer. Here is an example:

Option 1:

    encoder = make_column_transformer(
        (one_hot, ['var0']),
        ('passthrough', ['var1','var2']),
        ( MinHashEncoder(n_components=100), ['dirty1','dirty2']),
        remainder='drop')

Option 2:

    encoder = make_column_transformer(
        (one_hot, ['var0']),
        ('passthrough', ['var1','var2']),
        ( MinHashEncoder(n_components=100), ['dirty1']),
        ( MinHashEncoder(n_components=100), ['dirty2']),
        remainder='drop')

Option 1 and 2 should be equivalent. Option 1 will currently not work, as the method tries to encode 2 variables simultaneously, while option 2 will work, as they are independently taken into account.
I have modified the encoder so that it is now possible to do it simultaneously. The columns will of course still be encoded independently, as they should be, but you could all do it with one command.

I also added a test, to check if encoding variables simultaneously is possible and goes well.

Please check if this works for you.

Thanks.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Super useful!!

A few comments.

@@ -156,7 +156,7 @@ def fit(self, X, y=None):
self
The fitted MinHashEncoder instance.
"""
self.hash_dict = LRUDict(capacity=self._capacity)
self.hash_dict = LRUDict(capacity=self._capacity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've added whitespace at the end of the line here.

encode the column independently """
from dirty_cat.datasets import fetch_employee_salaries
employee_salaries = fetch_employee_salaries()
X = employee_salaries.X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do the test on purely synthetic data, please, rather than downloading from internet: dowloading from internet is fragile.

encode the column independently """
from dirty_cat.datasets import fetch_employee_salaries
employee_salaries = fetch_employee_salaries()
X = employee_salaries.X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do the test on purely synthetic data, please, rather than downloading from internet: dowloading from internet is fragile.

from dirty_cat.datasets import fetch_employee_salaries
employee_salaries = fetch_employee_salaries()
X = employee_salaries.X
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I understand the role of the try/except: if an error is raised, let it be raised. The test will fail, and that's good.

employee_salaries = fetch_employee_salaries()
X = employee_salaries.X
try:
MinHashEncoder().fit_transform(X[['employee_position_title','department_name']])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MinHashEncoder().fit_transform(X[['employee_position_title','department_name']])
MinHashEncoder().fit_transform(X[['employee_position_title', 'department_name']])

PEP8 formating: space after coma. You need to be careful to systematically apply the PEP8 standard :) @

@jovan-stojanovic
Copy link
Member Author

Thanks for the comments! I hope this will work.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, a few more comments

with the MinHashEncoder will not produce an error, but will
encode the column independently """
X = pd.DataFrame([('bird', 'parrot'),
('bird', 'nightingale'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent these lines so that the "('bird'" of a line is indented aligned with that above

raise ValueError(msg)

return X_out
X_enc = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is the get rid of the "X_enc" as a list, and the "append" and "hstack" at the end

nan_idx = []

if self.hashing == 'fast':
for i, x in enumerate(X_in):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should merge the for loop above with the for loop here: move the for loop above here to have the two next to each other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this solution work?

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, still a bit of work.

I think that there is a bug with the handling of the nan.

We should find a test that highlight this bug before correcting it, as our current test suite does not seem stringent enough

X_out = np.zeros((len(X[:]), self.n_components * X.shape[1]))
counter = self.n_components
for k in range(X.shape[1]):
X_in = X[:,k].reshape(-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
X_in = X[:,k].reshape(-1)
X_in = X[:, k].reshape(-1)

PEP8 formating :)

You need to learn it and be systematic about it :)

X_in = X[:,k].reshape(-1)
for i, x in enumerate(X_in):
if isinstance(x, float): # true if x is a missing value
nan_idx.append(i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this, I wonder if the nan_idx list should not be reinitialized for each column.

The fact that we have no test that breaks tells me that our test suite is not detailed enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the nan_idx list should still stay outside of the loop, because the only goal of the list is to see if there are missing values and, if yes, force the user to use the handle_missing='zero_impute' option (see code below from line 223-226).

if self.handle_missing == 'error' and nan_idx:
            msg = ("Found missing values in input data; set "
                   "handle_missing='zero_impute' to encode with missing values")
            raise ValueError(msg)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that you are correct.

Our code is ugly. We need to change this to using a boolean rather than a list. Can you do this, please

X_out = np.zeros((len(X[:]), self.n_components * X.shape[1]))
counter = self.n_components
for k in range(X.shape[1]):
X_in = X[:,k].reshape(-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
X_in = X[:,k].reshape(-1)
X_in = X[:, k].reshape(-1)

PEP8

('mammal', 'monkey'),
('mammal', np.nan)],
columns=('class', 'type'))
MinHashEncoder().fit_transform(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should push a bit this test. For instance we could compare this to column-wise encoding.

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one tiny comment and we are good!

X_in = X[:,k].reshape(-1)
for i, x in enumerate(X_in):
if isinstance(x, float): # true if x is a missing value
nan_idx.append(i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that you are correct.

Our code is ugly. We need to change this to using a boolean rather than a list. Can you do this, please

@jovan-stojanovic
Copy link
Member Author

Agreed, hope this works!

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Merging. Thank you!!

@GaelVaroquaux GaelVaroquaux merged commit 7c298f4 into skrub-data:master Mar 30, 2022
@jovan-stojanovic jovan-stojanovic deleted the minhashcolumns branch March 30, 2022 13:51
@jovan-stojanovic jovan-stojanovic changed the title Minhashcolumns ENH Fit multiple columns with MinHash Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants