Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting progress #1

Open
28 of 59 tasks
maxbachmann opened this issue Nov 15, 2023 · 5 comments
Open
28 of 59 tasks

Porting progress #1

maxbachmann opened this issue Nov 15, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@maxbachmann
Copy link
Member

maxbachmann commented Nov 15, 2023

This issue tracks missing components in the rust port:

  • Levenshtein
    • basic distances
    • cached distances
    • simd implementation
    • edit operations
  • DamerauLevenshtein
    • basic distances
    • cached distances
  • Hamming
    • basic distances
    • cached distances
    • edit operations
  • Indel
    • basic distances
    • cached distances
    • simd implementation
    • edit operations
  • LCS
    • basic distances
    • cached distances
    • simd implementation
    • edit operations
  • OSA
    • basic distances
    • cached distances
  • Postfix
    • basic distances
    • cached distances
  • Prefix
    • basic distances
    • cached distances
  • Jaro
    • basic distances
    • cached distances
  • JaroWinkler
    • basic distances
    • cached distances
  • fuzz
    • basic distances
      • ratio
      • partial_ratio
      • token_ratio
      • token_set_ratio
      • token_sort_ratio
      • partial_token_ratio
      • partial_token_set_ratio
      • partial_token_sort_ratio
      • wratio
    • cached distances
      • ratio
      • partial_ratio
      • token_ratio
      • token_set_ratio
      • token_sort_ratio
      • partial_token_ratio
      • partial_token_set_ratio
      • partial_token_sort_ratio
      • wratio
    • simd implementation

The goal for the first release is to have all of the basic + cached distances implemented. Edit operations and simd are more niche and so not required for a first release.

@abstractqqq
Copy link

First, rapid fuzz is a beautiful library with cutting edge algorithms and I admire the work a lot. Second, I am working on an in-dataframe (polars dataframe plugin) data analysis tool. For strings, fuzzy string matching, similarity metrics are crucial. Right now I am relying on strsim and some of my own code for that, which has mediocre performance and would be really happy to see rapid fuzz coming to Rust.

I have done some benchmarks using my Rust impl vs. rapid fuzz (via Python UDF), and you can see results here: abstractqqq/polars_ds_extension#17 . Those are some interesting numbers.

Is there any way I can support this project? Thank you.

@maxbachmann
Copy link
Member Author

maxbachmann commented Nov 28, 2023

A couple of comments to the issue you linked:

  1. in regards to jaro and jaro-winkler in strsim, they are actually both implemented incorrectly 😅
    • Jaro Winkler doesn't properly restrict the prefix boosting mechanism. This appears to be intended, but since it breaks the algorithm I really consider it a bug
    • the Jaro implementation doesn't count transpositions correctly.
  2. The python test is far from optimal, since it involves calling the function N times. The Python implementation heavily relies on the user calling higher level functions in the process module which allow me to do some trickery behind the scenes to avoid the Python overhead (I store a C function pointer with every Python function that allows me to lower them). So in practice it could be a lot faster.

I uploaded a first in progress version of the library to cargo a couple of hours ago, which includes most basic implementations which is pretty much what you are after. To get the best out of it you should:

  1. try to call the batched implementations if you compare one string to multiple strings, since the allows caching parts of the algorithms.
  2. pass in the score_cutoff value if you are only interested in results above a certain threshold. For some algorithms this allows a more performant implementation (e.g. for Levenshtein it allows using Ukkonen bands to calculate only parts of the Levenshtein matrix)

Things still missing in the port are:

  • SIMD implementations of some of the algorithms which allow comparing multiple short sequences in parallel. This allows e.g. comparing 16 16 character texts using AVX2
  • multithreaded probably won't be implemented in the lib similar to the C++ implementation since this is better left to the user.
  • editops implementations which allow backtracing edits e.g. to visualize them in documents
  • most of the APIs from fuzzywuzzy/thefuzz in the fuzz module

Overall I am really surprised how fast the port went so far. Especially considering I didn't touch the language before. I would say I have probably around 2/3 of the code volume ported over.

As to how the project can be supported: Right now I am especially in need of people who have experience in rust and can help with code review. In particular any suggestions for improvements of the public interface would be very useful, so I can reduce breaking changes in the future.

@maxbachmann
Copy link
Member Author

@abstractqqq I did now pretty much finalize the API for rust. This changes all function signatures and so will need updates in your project as well. Let me know if you run into any issues with this.

This should now be pretty much the final API. I do not expect any more signature changes in the closer future, unless something about it is fundamentally broken.

@abstractqqq
Copy link

@abstractqqq I did now pretty much finalize the API for rust. This changes all function signatures and so will need updates in your project as well. Let me know if you run into any issues with this.

This should now be pretty much the final API. I do not expect any more signature changes in the closer future, unless something about it is fundamentally broken.

Thank you. I will update once I come back

@abstractqqq
Copy link

I updated to the latest version of rapidfuzz and everything is working great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants