Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] harmonizing the Joiner parameters #757

Merged

Conversation

jeromedockes
Copy link
Contributor

start applying changes decided in #751

@jeromedockes jeromedockes changed the title [WIP] harmonizing the Joiner parameters harmonizing the Joiner parameters Sep 27, 2023
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I would expect only a couple of changes in the documentation mainly.

skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated
aux_table : :obj:`~pandas.DataFrame`
The auxiliary table, which will be fuzzy-joined to the main table when
calling ``transform``.
main_key : str or list of str or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
main_key : str or list of str or None
main_key : str, list of str or None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that in scikit-learn we will have the following convention

main_key: str or list of str, default=None

And somehow, we want to have details regarding None when the meaning is not straightforward: scikit-learn/scikit-learn#17295

We might want to have something similar (I still need to figure out what the class are doign :))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am not sure what default=None means. As a pandas user, I would expect to match the index of the tables (let see if this is this behaviour ;))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, this is not. It is only to know a sentinel to know if we have a matching key (using key) or different keys (so using main_key and aux_key)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a Pandas user, this is what I'm expecting ;)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Note that they don't document the None

skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated
Comment on lines 158 to 163
for col in self._aux_key:
if col not in self.aux_table.columns:
raise ValueError(
f"Column key {col!r} not found in columns of "
f"auxiliary table: {self.aux_table.columns.tolist()}. "
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would not make sense to have it in check_key? Or is it really specific to this class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the check_key expect aux_key, main_key so I would expect to make such check as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I'm not sure this check is super useful, if we don't do it we get a more accurate KeyError when trying to use the key

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe we should think of removing it then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice to raise an error early on (during fit instead of during predict), IMHO. I'm +1 on keeping it or putting it into check_key as @glemaitre suggests since both checks will be performed in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in a different function as check_key is rather for figuring out what key to use from main_key, aux_key and key, and because this check happens in fit and transform whereas check_key is only in fit

skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/tests/test_joiner.py Outdated Show resolved Hide resolved
examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am undecided regarding removing or not the check.

@jeromedockes jeromedockes changed the title harmonizing the Joiner parameters [MRG] harmonizing the Joiner parameters Oct 20, 2023
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few additional comments!

examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated
aux_table : :obj:`~pandas.DataFrame`
The auxiliary table, which will be fuzzy-joined to the main table when
calling ``transform``.
main_key : str or list of str or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a Pandas user, this is what I'm expecting ;)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Note that they don't document the None

examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
skrub/_join_utils.py Outdated Show resolved Hide resolved
skrub/_join_utils.py Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated
Comment on lines 158 to 163
for col in self._aux_key:
if col not in self.aux_table.columns:
raise ValueError(
f"Column key {col!r} not found in columns of "
f"auxiliary table: {self.aux_table.columns.tolist()}. "
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice to raise an error early on (during fit instead of during predict), IMHO. I'm +1 on keeping it or putting it into check_key as @glemaitre suggests since both checks will be performed in practice.

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's move forward with #802

@Vincent-Maladiere Vincent-Maladiere merged commit 3ffd9df into skrub-data:main Nov 2, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants