-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General architecture feedback #52
Comments
For string similarities, these are very close adaptations of NLTK Python code, except for Dice similarity which is relatively straightforward. There is probably some room for improvement, in particular for Levenshtein. Actually, I just discovered https://github.com/dguo/strsim-rs which also covers most of these. |
@rth thanks for the welcome, I'll be going through the package this weekend and next week; If I'm productive enough I'll have some examples worked through (and maybe even a blog post!).
|
Great, thank you @jbowles! Feel free to write any general comments you have about this project here (or in any of the related issues).
To give you some background, I have been working on topics related to
CountVectorizer
/HashingVectorizer
in scikit-learn for a few years and this project originated as an attempt in making those faster. A few things got added along the way. I'm a fairly beginner Rust programmer, so general feedback about the architecture of this crate would be very welcome. In particular, adding more common traits per module would probably be good (I started some of the work on it in #48). Some of it was also limited by the fact that I wanted to make a thin wrapper in PyO3 to expose the functionality in Python which adds some constraints (e.g. #48 (comment))For tokenization, one thing I saw was that if one takes the
unicode-segmentation
crate, it will tokenize the text almost exactly as expected for NLP applications, with a few exceptions. The nice thing about it is that it's language independent and based on the Unicode spec, which removes the need to maintain a large number of regexp / custom rules. To improve the F1 score for tokenization on the UD treebank a few custom rules are additionally applied.On the other side, we can imagine other tokenizers. In particular, the fact that some tasks require custom processing is a valid point. I'm not sure how to make that easier.
Yes, it looks quite good. Related issue #51
Generally if can do anything to make this collaboration easier please let me know :)
The text was updated successfully, but these errors were encountered: