-
-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add regex tokenizer #1759
Add regex tokenizer #1759
Conversation
Wow, snap! I just authored #1759 to do much the same, albeit with capture group support. |
Haha, great. If you have capture group support then you have probably the better version. |
Codecov Report
@@ Coverage Diff @@
## main #1759 +/- ##
==========================================
+ Coverage 94.13% 94.14% +0.01%
==========================================
Files 267 269 +2
Lines 50900 50985 +85
==========================================
+ Hits 47915 48002 +87
+ Misses 2985 2983 -2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
4e931f0
to
7672b0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would work. Some nits on the formatting, one concern about the intended meaning of regular expression anchors.
I would suggest doing capture group support as a follow-up.
I think the performance question of whether to clone Regex
or Arc<Regex>
can only be solved via benchmarks. I would suggest keeping it as it is for now if only for simplicity until such benchmarks are written.
@adamreichold Thank you for your feedback and the review. I like your idea of first merging this simpler version and then going ahead and adding the group capture support afterwards. I can wrap this up pretty quickly. @Gearme would that be OK for you as well? We could make the follow-up based on your implementation. I would also like to try out a benchmark to address #1759 (review), but also as a follow-up. |
Ups, @PSeitz removing you from review was not intentionally. |
746d533
to
f4a43f9
Compare
Sorry for requesting multiple reviews. I did not realize I can only request one review at the time. |
f4a43f9
to
43537a5
Compare
This adds a regex tokenizer which tokenizes the text by using a regex pattern to split.
43537a5
to
0c99e75
Compare
I just discovered there is a off-by-one on the offsets of the second token. I will address that. |
I think this is fine, wrong alarm. |
This adds a regex tokenizer which tokenizes the text by using a regex pattern to split. Co-authored-by: Michael Kleen <mkleen@gmailw.com>
This adds a regex tokenizer which tokenizes the text by using a regex pattern to split. Co-authored-by: Michael Kleen <mkleen@gmailw.com>
This adds a regex tokenizer which tokenizes the text by using a regex pattern to split. This is my first attempt and works for my usecase, but it's not ideal from a code and a configuration perspective.
Closes #1670