-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Fit multiple columns with MinHash #243
ENH Fit multiple columns with MinHash #243
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Super useful!!
A few comments.
dirty_cat/minhash_encoder.py
Outdated
@@ -156,7 +156,7 @@ def fit(self, X, y=None): | |||
self | |||
The fitted MinHashEncoder instance. | |||
""" | |||
self.hash_dict = LRUDict(capacity=self._capacity) | |||
self.hash_dict = LRUDict(capacity=self._capacity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've added whitespace at the end of the line here.
encode the column independently """ | ||
from dirty_cat.datasets import fetch_employee_salaries | ||
employee_salaries = fetch_employee_salaries() | ||
X = employee_salaries.X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do the test on purely synthetic data, please, rather than downloading from internet: dowloading from internet is fragile.
encode the column independently """ | ||
from dirty_cat.datasets import fetch_employee_salaries | ||
employee_salaries = fetch_employee_salaries() | ||
X = employee_salaries.X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do the test on purely synthetic data, please, rather than downloading from internet: dowloading from internet is fragile.
from dirty_cat.datasets import fetch_employee_salaries | ||
employee_salaries = fetch_employee_salaries() | ||
X = employee_salaries.X | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I understand the role of the try/except: if an error is raised, let it be raised. The test will fail, and that's good.
employee_salaries = fetch_employee_salaries() | ||
X = employee_salaries.X | ||
try: | ||
MinHashEncoder().fit_transform(X[['employee_position_title','department_name']]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MinHashEncoder().fit_transform(X[['employee_position_title','department_name']]) | |
MinHashEncoder().fit_transform(X[['employee_position_title', 'department_name']]) |
PEP8 formating: space after coma. You need to be careful to systematically apply the PEP8 standard :) @
Thanks for the comments! I hope this will work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, a few more comments
with the MinHashEncoder will not produce an error, but will | ||
encode the column independently """ | ||
X = pd.DataFrame([('bird', 'parrot'), | ||
('bird', 'nightingale'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indent these lines so that the "('bird'" of a line is indented aligned with that above
dirty_cat/minhash_encoder.py
Outdated
raise ValueError(msg) | ||
|
||
return X_out | ||
X_enc = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is the get rid of the "X_enc" as a list, and the "append" and "hstack" at the end
nan_idx = [] | ||
|
||
if self.hashing == 'fast': | ||
for i, x in enumerate(X_in): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should merge the for loop above with the for loop here: move the for loop above here to have the two next to each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this solution work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, still a bit of work.
I think that there is a bug with the handling of the nan.
We should find a test that highlight this bug before correcting it, as our current test suite does not seem stringent enough
dirty_cat/minhash_encoder.py
Outdated
X_out = np.zeros((len(X[:]), self.n_components * X.shape[1])) | ||
counter = self.n_components | ||
for k in range(X.shape[1]): | ||
X_in = X[:,k].reshape(-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_in = X[:,k].reshape(-1) | |
X_in = X[:, k].reshape(-1) |
PEP8 formating :)
You need to learn it and be systematic about it :)
dirty_cat/minhash_encoder.py
Outdated
X_in = X[:,k].reshape(-1) | ||
for i, x in enumerate(X_in): | ||
if isinstance(x, float): # true if x is a missing value | ||
nan_idx.append(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this, I wonder if the nan_idx list should not be reinitialized for each column.
The fact that we have no test that breaks tells me that our test suite is not detailed enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the nan_idx list should still stay outside of the loop, because the only goal of the list is to see if there are missing values and, if yes, force the user to use the handle_missing='zero_impute' option (see code below from line 223-226).
if self.handle_missing == 'error' and nan_idx:
msg = ("Found missing values in input data; set "
"handle_missing='zero_impute' to encode with missing values")
raise ValueError(msg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that you are correct.
Our code is ugly. We need to change this to using a boolean rather than a list. Can you do this, please
dirty_cat/minhash_encoder.py
Outdated
X_out = np.zeros((len(X[:]), self.n_components * X.shape[1])) | ||
counter = self.n_components | ||
for k in range(X.shape[1]): | ||
X_in = X[:,k].reshape(-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_in = X[:,k].reshape(-1) | |
X_in = X[:, k].reshape(-1) |
PEP8
('mammal', 'monkey'), | ||
('mammal', np.nan)], | ||
columns=('class', 'type')) | ||
MinHashEncoder().fit_transform(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should push a bit this test. For instance we could compare this to column-wise encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one tiny comment and we are good!
dirty_cat/minhash_encoder.py
Outdated
X_in = X[:,k].reshape(-1) | ||
for i, x in enumerate(X_in): | ||
if isinstance(x, float): # true if x is a missing value | ||
nan_idx.append(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that you are correct.
Our code is ugly. We need to change this to using a boolean rather than a list. Can you do this, please
Agreed, hope this works! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merging. Thank you!!
Currently, it seems that it is not possible to fit multiple columns with the MinHashEncoder simultaneously.
This may provoke errors when using the sklearn.compose.make_column_transformer. Here is an example:
Option 1:
Option 2:
Option 1 and 2 should be equivalent. Option 1 will currently not work, as the method tries to encode 2 variables simultaneously, while option 2 will work, as they are independently taken into account.
I have modified the encoder so that it is now possible to do it simultaneously. The columns will of course still be encoded independently, as they should be, but you could all do it with one command.
I also added a test, to check if encoding variables simultaneously is possible and goes well.
Please check if this works for you.
Thanks.