[MRG] Fix the match score scaling #802

jeromedockes · 2023-10-19T16:01:00Z

note example 04 will fail due to an issue with duplicate column names; this will have to wait until #757 is merged

jeromedockes · 2023-10-20T10:20:06Z

this is ready for review but I made it a draft so it is not merged by mistake, indeed #757 must be merged before to fix the example 04

skrub/_fuzzy_join.py

Vincent-Maladiere · 2023-10-30T14:29:30Z

examples/04_fuzzy_joining.py

-    "joiner-2__match_score": [0.2, 0.9],
-    "joiner-3__match_score": [0.2, 0.9],
+    "joiner-1__match_score": [0.1, 0.9],
+    "joiner-2__match_score": [0.1, 0.9],


I can't comment on it but you need to update the description regarding the best match_score parameter here and in other places of the example.
"The grid searching gave us the best value of 0.5 for the parameter"

jeromedockes · 2023-10-31T13:26:50Z

examples/04_fuzzy_joining.py

-# ``match_score``. Let's use this value in our regression:
-#
-
-print(f"Mean R2 score with pipeline is {grid.score(df, y):.2f}")


@Vincent-Maladiere having a closer look, this part is not informative: we are scoring on the training data. if we cross-validate the grid search correctly the score does not improve as much

depending on the cv random state the grid search score can be the same or worse than the one without grid search

Vincent-Maladiere

LGTM!

jeromedockes · 2023-11-06T10:25:06Z

@jovan-stojanovic when you have time would you mind having a look at this one? (Note the example might still be improved in a later PR eg to highlight the gain of fuzzy_join vs a regular join, and we will think about other score rescaling strategies, but the goal of this PR is just to fix the scale so it starts at 0 and to remove the division by 0)

jeromedockes added 2 commits October 19, 2023 17:58

fix the match score scaling

aa19518

changes

03b27fe

jeromedockes changed the title ~~fix the match score scaling~~ [MRG] fix the match score scaling Oct 20, 2023

jeromedockes changed the title ~~[MRG] fix the match score scaling~~ fix the match score scaling Oct 20, 2023

jeromedockes mentioned this pull request Oct 20, 2023

division by 0 in fuzzy_join when there are only perfect matches #764

Closed

jeromedockes added 4 commits October 20, 2023 12:03

add non-regression test for skrub-data#764

5fff3f1

check matching_score

681e6e3

do not subtract min distance ie best score is not necessarily 1

45a5771

update changes

fb9eb3f

jeromedockes marked this pull request as draft October 20, 2023 10:19

Tialo reviewed Oct 20, 2023

View reviewed changes

skrub/_fuzzy_join.py Show resolved Hide resolved

jeromedockes added 2 commits October 30, 2023 14:55

Merge remote-tracking branch 'upstream/main' into matching_score_scaling

cbb83d9

update example

2ec7055

jeromedockes marked this pull request as ready for review October 30, 2023 14:00

jeromedockes changed the title ~~fix the match score scaling~~ [MRG] Fix the match score scaling Oct 30, 2023

Vincent-Maladiere reviewed Oct 30, 2023

View reviewed changes

jeromedockes added 3 commits October 31, 2023 14:17

fix overfitting in example

89a35a7

Merge remote-tracking branch 'upstream/main' into matching_score_scaling

fac938d

add note

a49e0a6

jeromedockes commented Oct 31, 2023

View reviewed changes

jeromedockes added 2 commits October 31, 2023 15:09

correct comment

307674b

update comment

8bfee9f

jeromedockes mentioned this pull request Oct 31, 2023

[MRG] harmonizing the Joiner parameters #757

Merged

jeromedockes added 4 commits November 2, 2023 11:04

typo

f98478a

Merge remote-tracking branch 'upstream/main' into matching_score_scaling

fbef8e0

iter example

b1d2f7b

Merge remote-tracking branch 'upstream/main' into matching_score_scaling

87d8bd4

Vincent-Maladiere approved these changes Nov 3, 2023

View reviewed changes

jeromedockes requested a review from jovan-stojanovic November 6, 2023 10:23

Vincent-Maladiere merged commit 77b1ccc into skrub-data:main Nov 8, 2023
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Fix the match score scaling #802

[MRG] Fix the match score scaling #802

jeromedockes commented Oct 19, 2023 •

edited

Loading

jeromedockes commented Oct 20, 2023

Vincent-Maladiere Oct 30, 2023

jeromedockes Oct 31, 2023 •

edited

Loading

jeromedockes Oct 31, 2023

Vincent-Maladiere left a comment

jeromedockes commented Nov 6, 2023

[MRG] Fix the match score scaling #802

[MRG] Fix the match score scaling #802

Conversation

jeromedockes commented Oct 19, 2023 • edited Loading

jeromedockes commented Oct 20, 2023

Vincent-Maladiere Oct 30, 2023

Choose a reason for hiding this comment

jeromedockes Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

jeromedockes Oct 31, 2023

Choose a reason for hiding this comment

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

jeromedockes commented Nov 6, 2023

jeromedockes commented Oct 19, 2023 •

edited

Loading

jeromedockes Oct 31, 2023 •

edited

Loading