Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace fuzzywuzzy #858

Merged
1 commit merged into from Oct 4, 2020
Merged

replace fuzzywuzzy #858

1 commit merged into from Oct 4, 2020

Conversation

maxbachmann
Copy link
Contributor

FuzzyWuzzy is GPLv2 licensed which would force you to licence the whole project under GPLv2.
For this reason this Pullrequest replaces FuzzyWuzzy with rapidfuzz which is implementing the same algorithm but is based on a version of fuzzywuzzy that was MIT licensed.
Rapidfuzz is:

  • Mit licensed so it can be used with the license used by this project
  • Is faster than FuzzyWuzzy

Since it is written in C++14 on Windows it requires the C++ Redistributable 2015 to be installed (or newer versions like 2019, that include the 2015 version automatically)

Comment on lines +60 to 62
for eachLetter in str2:
if eachLetter.isalnum() or eachLetter.isspace():
newStr2 += eachLetter
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty certain, that this should be str2, since otherwise it would just compare str1 with str1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesh, 😅


A wrapper around `fuzzywuzzy.partial_ratio` to handle UTF-8 encoded
A wrapper around `rapidfuzz.fuzz.partial_ratio` to handle UTF-8 encoded
emojis that usually cause errors
'''

#! this will throw an error if either string contains a UTF-8 encoded emoji
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example when this occurs? I failed to reproduce this both with fuzzywuzzy and rapidfuzz

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly happens when you try to parse YouTube Music playlists. We actually blanket-parse all YouTube Music results so if a playlist has emojis. The whole thing breaks even though we have no use for emojis. The same goes for video results with emojis in them.

Eg. EMOJI CHALLENGE ★ Guess the Fifth Harmony (including Camila) Song Titles, the ⭐ will cause an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm strange I still fail to reproduce this

fuzz.partial_ratio("EMOJI CHALLENGE ★ Guess the Fifth Harmony (including Camila) Song Titles", "EMOJI CHALLENGE ★")

works fine for me. At least in RapidFuzz I would consider this as a bug.

btw I did see you usually lowercase the strings aswell before you match them, so you could use

fuzz.partial_ratio(str1, str2, score_cutoff=score_cutoff, processor=True)

instead which would lowercase the strings, remove all non alphanumeric characters and trim whitespaces at the start and end of the string (but is faster than doing the same thing in python)

or pass a custom function when you want some different kind of preprocessing (the preprocessor has to accept a string as argument and return the preprocessed string)

fuzz.partial_ratio(str1, str2, score_cutoff=score_cutoff, processor=your_preprocessor_function)

@@ -402,15 +406,15 @@ def search_and_order_ytm_results(songName: str, songArtists: List[str],
#! we use fuzzy matching because YouTube spellings might be mucked up
if result['type'] == 'song':
for artist in songArtists:
if match_percentage (artist.lower(), result['artist'].lower()) > 85:
if match_percentage (artist.lower(), result['artist'].lower(), 85):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the score_cutoff this way, allows the fuzzy matching to exit early when the score can not be reached

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good design decision.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job.

Use commit messsages, not just commit titles.

@ghost ghost merged commit 845fd41 into spotDL:master Oct 4, 2020
@maxbachmann
Copy link
Contributor Author

maxbachmann commented Oct 4, 2020

Use commit messsages, not just commit titles.

Your right I should really do this 👍

@maxbachmann maxbachmann mentioned this pull request Oct 4, 2020
ghost pushed a commit that referenced this pull request Oct 5, 2020
Author: @maxbachmann 

comma missing in install_requires since #858.
closes #869
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant