Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joiner store state in fit + add other distance scaling strategies #821

Merged
merged 76 commits into from
Dec 12, 2023

Conversation

jeromedockes
Copy link
Contributor

@jeromedockes jeromedockes commented Nov 13, 2023

fixes #762, fixes #760, fixes #758

@jeromedockes jeromedockes marked this pull request as draft November 13, 2023 14:21
@jeromedockes
Copy link
Contributor Author

ATM, grid-searching the threshold for the distance between matched rows is very inefficient: we redo the full vectorization, nearest-neighbor search and joining just to apply a different threshold to the same column. Some options could be

  • add caching to the Joiner itself
  • separate the steps of joining and filling rows that are too far apart with nans, to take advantage of the Pipeline's existing caching mechanism
  • investigate more in which situations the thresholding is necessary. Maybe a good learner with enough data, if provided with the distance (or score) column, can learn to disregard the untrustworthy features when the distance is too large
  • ... ?

@jeromedockes
Copy link
Contributor Author

ok @Vincent-Maladiere @jovan-stojanovic I think I've addressed the main comments if you want to have another look

@jeromedockes jeromedockes changed the title [WIP] Joiner store state in fit + add other distance scaling strategies Joiner store state in fit + add other distance scaling strategies Dec 1, 2023
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional comments

skrub/_fuzzy_join.py Show resolved Hide resolved
skrub/_join_utils.py Outdated Show resolved Hide resolved
skrub/_join_utils.py Outdated Show resolved Hide resolved
skrub/_join_utils.py Outdated Show resolved Hide resolved
skrub/_joiner.py Show resolved Hide resolved
skrub/_joiner.py Show resolved Hide resolved
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thank you, @jeromedockes, LGTM!

@jeromedockes
Copy link
Contributor Author

@jovan-stojanovic (and others @LeoGrin @GaelVaroquaux if you have the time) would you like to have another look I think we are converging on this one?

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! Here is a final review @jeromedockes, some small things to change and I guess we are ready for the release! 🚀

# score, that we will use later to show what are the worst matches.
# We set the ``add_match_info`` parameter to `True` to show distances
# between the rows that have been matched, that we will use later to show
# what are the worst matches.

###############################################################################
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment below but L146-147:
"Czechia"/"Czech Republic" and "Luxembourg*"/"Luxembourg" should be replaced by "Egypt"/"Egypt, Arab Rep." and "Lesotho*"/"Lesotho" to reflect well what was printed above.

examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
examples/04_fuzzy_joining.py Outdated Show resolved Hide resolved
Comment on lines 370 to 373
# We create a selector that we will insert at the end of our pipeline, to
# select the relevant columns before fitting the regressor

pipeline = make_pipeline(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO maybe its better to do it in two times:

  • create the selector
  • add it to the pipeline

Just to help the user grasp it more easily.

Suggested change
# We create a selector that we will insert at the end of our pipeline, to
# select the relevant columns before fitting the regressor
pipeline = make_pipeline(
# We create a selector that we will insert at the end of our pipeline, to
# select the relevant columns before fitting the regressor
selector = SelectCols(
[
"GDP per capita (current US$)",
"Life expectancy at birth, total (years)",
"Strength of legal rights index (0=weak to 12=strong)",
"GDP per capita (current US$) gdp",
"Life expectancy at birth, total (years) life_exp",
"Strength of legal rights index (0=weak to 12=strong) legal_rights",
]
# We create our pipeline
pipeline = make_pipeline(

skrub/_fuzzy_join.py Outdated Show resolved Hide resolved
Comment on lines +96 to +105
def check_column_name_duplicates(
main_table,
aux_table,
suffix,
main_table_name="main_table",
aux_table_name="aux_table",
):
"""Check that there are no duplicate column names after applying a suffix.

The suffix is applied to (a copy of) `aux_columns` before checking for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super useful! 🎉

Comment on lines 132 to 134
suffix : str, default=""
Suffix to append to the ``aux_table``'s column names. You can use it
to avoid duplicate column names in the join.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT, shouldn't the suffix by default be something like _aux ?
In any, case this is applied only if there are duplicate columns. (same remark for the fuzzy_join)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is always applied, we decided not to apply it only when there are duplicates (at least for now). note that the pandas and polars approach does not work 100% because they add the suffix only if there are duplicates but then don't check if there are duplicates after adding the suffix. also we thought it is useful to be able to easily know what will be the output column names. however in a later pr we want to add an option for generating an automatic suffix.

Re what should be the default, _aux does make sense although many users will want no suffix (if they don't have duplicated column names), and at the same time _aux might be too short to prevent duplicates in some cases.
So I'm not really sure what's best, I guess in many cases users will have to provide their own suffix
WDYT @Vincent-Maladiere and @skrub-data/devs ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks, ah yes you are right that checking duplicates after the suffix is (a great asset of the Joiner) changing the logic here.
I guess this is anyway not a blocking issue for this PR, I'm ok for resolving this with future issues.

skrub/_joiner.py Show resolved Hide resolved
@jeromedockes
Copy link
Contributor Author

thanks a lot for the review, @jovan-stojanovic ! I think the last outstanding question is what should be the default for "suffix". (this also applies to other joiners AggJoiner InterpolationJoiner)

@Vincent-Maladiere
Copy link
Member

Let's discuss the suffix strategy outside of this PR and move forward :)

@Vincent-Maladiere Vincent-Maladiere merged commit 9fa3bd3 into skrub-data:main Dec 12, 2023
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants