New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace fuzzywuzzy #858
replace fuzzywuzzy #858
Conversation
for eachLetter in str2: | ||
if eachLetter.isalnum() or eachLetter.isspace(): | ||
newStr2 += eachLetter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty certain, that this should be str2, since otherwise it would just compare str1 with str1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yesh, 😅
|
||
A wrapper around `fuzzywuzzy.partial_ratio` to handle UTF-8 encoded | ||
A wrapper around `rapidfuzz.fuzz.partial_ratio` to handle UTF-8 encoded | ||
emojis that usually cause errors | ||
''' | ||
|
||
#! this will throw an error if either string contains a UTF-8 encoded emoji |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example when this occurs? I failed to reproduce this both with fuzzywuzzy and rapidfuzz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly happens when you try to parse YouTube Music playlists. We actually blanket-parse all YouTube Music results so if a playlist has emojis. The whole thing breaks even though we have no use for emojis. The same goes for video results with emojis in them.
Eg. EMOJI CHALLENGE ★ Guess the Fifth Harmony (including Camila) Song Titles, the ⭐ will cause an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm strange I still fail to reproduce this
fuzz.partial_ratio("EMOJI CHALLENGE ★ Guess the Fifth Harmony (including Camila) Song Titles", "EMOJI CHALLENGE ★")
works fine for me. At least in RapidFuzz I would consider this as a bug.
btw I did see you usually lowercase the strings aswell before you match them, so you could use
fuzz.partial_ratio(str1, str2, score_cutoff=score_cutoff, processor=True)
instead which would lowercase the strings, remove all non alphanumeric characters and trim whitespaces at the start and end of the string (but is faster than doing the same thing in python)
or pass a custom function when you want some different kind of preprocessing (the preprocessor has to accept a string as argument and return the preprocessed string)
fuzz.partial_ratio(str1, str2, score_cutoff=score_cutoff, processor=your_preprocessor_function)
@@ -402,15 +406,15 @@ def search_and_order_ytm_results(songName: str, songArtists: List[str], | |||
#! we use fuzzy matching because YouTube spellings might be mucked up | |||
if result['type'] == 'song': | |||
for artist in songArtists: | |||
if match_percentage (artist.lower(), result['artist'].lower()) > 85: | |||
if match_percentage (artist.lower(), result['artist'].lower(), 85): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the score_cutoff this way, allows the fuzzy matching to exit early when the score can not be reached
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good design decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job.
Use commit messsages, not just commit titles.
Your right I should really do this 👍 |
Author: @maxbachmann comma missing in install_requires since #858. closes #869
FuzzyWuzzy is GPLv2 licensed which would force you to licence the whole project under GPLv2.
For this reason this Pullrequest replaces FuzzyWuzzy with rapidfuzz which is implementing the same algorithm but is based on a version of fuzzywuzzy that was MIT licensed.
Rapidfuzz is:
Since it is written in C++14 on Windows it requires the C++ Redistributable 2015 to be installed (or newer versions like 2019, that include the 2015 version automatically)