-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] harmonizing the Joiner parameters #757
[MRG] harmonizing the Joiner parameters #757
Conversation
start applying changes decided in skrub-data#751
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I would expect only a couple of changes in the documentation mainly.
skrub/_joiner.py
Outdated
aux_table : :obj:`~pandas.DataFrame` | ||
The auxiliary table, which will be fuzzy-joined to the main table when | ||
calling ``transform``. | ||
main_key : str or list of str or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main_key : str or list of str or None | |
main_key : str, list of str or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that in scikit-learn we will have the following convention
main_key: str or list of str, default=None
And somehow, we want to have details regarding None
when the meaning is not straightforward: scikit-learn/scikit-learn#17295
We might want to have something similar (I still need to figure out what the class are doign :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am not sure what default=None
means. As a pandas user, I would expect to match the index of the tables (let see if this is this behaviour ;))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, this is not. It is only to know a sentinel to know if we have a matching key (using key
) or different keys (so using main_key
and aux_key
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a Pandas user, this is what I'm expecting ;)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
Note that they don't document the None
skrub/_joiner.py
Outdated
for col in self._aux_key: | ||
if col not in self.aux_table.columns: | ||
raise ValueError( | ||
f"Column key {col!r} not found in columns of " | ||
f"auxiliary table: {self.aux_table.columns.tolist()}. " | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would not make sense to have it in check_key
? Or is it really specific to this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the check_key
expect aux_key
, main_key
so I would expect to make such check as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I'm not sure this check is super useful, if we don't do it we get a more accurate KeyError when trying to use the key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe we should think of removing it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice to raise an error early on (during fit
instead of during predict
), IMHO. I'm +1 on keeping it or putting it into check_key
as @glemaitre suggests since both checks will be performed in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it in a different function as check_key is rather for figuring out what key to use from main_key, aux_key and key, and because this check happens in fit and transform whereas check_key is only in fit
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I am undecided regarding removing or not the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few additional comments!
skrub/_joiner.py
Outdated
aux_table : :obj:`~pandas.DataFrame` | ||
The auxiliary table, which will be fuzzy-joined to the main table when | ||
calling ``transform``. | ||
main_key : str or list of str or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a Pandas user, this is what I'm expecting ;)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
Note that they don't document the None
skrub/_joiner.py
Outdated
for col in self._aux_key: | ||
if col not in self.aux_table.columns: | ||
raise ValueError( | ||
f"Column key {col!r} not found in columns of " | ||
f"auxiliary table: {self.aux_table.columns.tolist()}. " | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nice to raise an error early on (during fit
instead of during predict
), IMHO. I'm +1 on keeping it or putting it into check_key
as @glemaitre suggests since both checks will be performed in practice.
Co-authored-by: Vincent M <maladiere.vincent@yahoo.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Let's move forward with #802
start applying changes decided in #751