Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix the match score scaling #802

Merged

Conversation

jeromedockes
Copy link
Member

@jeromedockes jeromedockes commented Oct 19, 2023

closes #763 , #764

note example 04 will fail due to an issue with duplicate column names; this will have to wait until #757 is merged

@jeromedockes jeromedockes changed the title fix the match score scaling [MRG] fix the match score scaling Oct 20, 2023
@jeromedockes jeromedockes changed the title [MRG] fix the match score scaling fix the match score scaling Oct 20, 2023
@jeromedockes jeromedockes marked this pull request as draft October 20, 2023 10:19
@jeromedockes
Copy link
Member Author

this is ready for review but I made it a draft so it is not merged by mistake, indeed #757 must be merged before to fix the example 04

@jeromedockes jeromedockes marked this pull request as ready for review October 30, 2023 14:00
@jeromedockes jeromedockes changed the title fix the match score scaling [MRG] Fix the match score scaling Oct 30, 2023
"joiner-2__match_score": [0.2, 0.9],
"joiner-3__match_score": [0.2, 0.9],
"joiner-1__match_score": [0.1, 0.9],
"joiner-2__match_score": [0.1, 0.9],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on it but you need to update the description regarding the best match_score parameter here and in other places of the example.
"The grid searching gave us the best value of 0.5 for the parameter"

# ``match_score``. Let's use this value in our regression:
#

print(f"Mean R2 score with pipeline is {grid.score(df, y):.2f}")
Copy link
Member Author

@jeromedockes jeromedockes Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vincent-Maladiere having a closer look, this part is not informative: we are scoring on the training data. if we cross-validate the grid search correctly the score does not improve as much

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on the cv random state the grid search score can be the same or worse than the one without grid search

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jeromedockes
Copy link
Member Author

@jovan-stojanovic when you have time would you mind having a look at this one? (Note the example might still be improved in a later PR eg to highlight the gain of fuzzy_join vs a regular join, and we will think about other score rescaling strategies, but the goal of this PR is just to fix the scale so it starts at 0 and to remove the division by 0)

@Vincent-Maladiere Vincent-Maladiere merged commit 77b1ccc into skrub-data:main Nov 8, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fuzzy_join's match_score starts at 0.5 not 0.0
3 participants