Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex tokenizer #1759

Merged
merged 1 commit into from
Jan 10, 2023
Merged

Conversation

mkleen
Copy link
Contributor

@mkleen mkleen commented Jan 4, 2023

This adds a regex tokenizer which tokenizes the text by using a regex pattern to split. This is my first attempt and works for my usecase, but it's not ideal from a code and a configuration perspective.

Closes #1670

@Gearme
Copy link

Gearme commented Jan 4, 2023

Wow, snap! I just authored #1759 to do much the same, albeit with capture group support.
Didn't find a better solution for the cloned Regex either though ;)

@mkleen mkleen marked this pull request as ready for review January 4, 2023 11:48
@mkleen
Copy link
Contributor Author

mkleen commented Jan 4, 2023

Wow, snap! I just authored #1759 to do much the same, albeit with capture group support. Didn't find a better solution for the cloned Regex either though ;)

Haha, great. If you have capture group support then you have probably the better version.

@codecov-commenter
Copy link

codecov-commenter commented Jan 4, 2023

Codecov Report

Merging #1759 (0c99e75) into main (b78dc5e) will increase coverage by 0.01%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##             main    #1759      +/-   ##
==========================================
+ Coverage   94.13%   94.14%   +0.01%     
==========================================
  Files         267      269       +2     
  Lines       50900    50985      +85     
==========================================
+ Hits        47915    48002      +87     
+ Misses       2985     2983       -2     
Impacted Files Coverage Δ
src/tokenizer/mod.rs 95.61% <ø> (ø)
src/tokenizer/regex_tokenizer.rs 92.85% <92.85%> (ø)
src/lib.rs 96.04% <0.00%> (-0.44%) ⬇️
src/fastfield/readers.rs 89.47% <0.00%> (-0.36%) ⬇️
src/fastfield/mod.rs 99.73% <0.00%> (-0.01%) ⬇️
src/query/mod.rs 100.00% <0.00%> (ø)
src/query/range_query.rs
src/query/range_query_ip_fastfield.rs
src/query/range_query/range_query.rs 91.16% <0.00%> (ø)
src/query/range_query/range_query_ip_fastfield.rs 97.00% <0.00%> (ø)
... and 6 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Collaborator

@adamreichold adamreichold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would work. Some nits on the formatting, one concern about the intended meaning of regular expression anchors.

I would suggest doing capture group support as a follow-up.

I think the performance question of whether to clone Regex or Arc<Regex> can only be solved via benchmarks. I would suggest keeping it as it is for now if only for simplicity until such benchmarks are written.

src/tokenizer/regex_tokenizer.rs Outdated Show resolved Hide resolved
src/tokenizer/regex_tokenizer.rs Show resolved Hide resolved
src/tokenizer/regex_tokenizer.rs Outdated Show resolved Hide resolved
src/tokenizer/regex_tokenizer.rs Outdated Show resolved Hide resolved
src/tokenizer/regex_tokenizer.rs Show resolved Hide resolved
@mkleen
Copy link
Contributor Author

mkleen commented Jan 6, 2023

@adamreichold Thank you for your feedback and the review. I like your idea of first merging this simpler version and then going ahead and adding the group capture support afterwards. I can wrap this up pretty quickly. @Gearme would that be OK for you as well? We could make the follow-up based on your implementation. I would also like to try out a benchmark to address #1759 (review), but also as a follow-up.

@mkleen mkleen requested review from adamreichold and fulmicoton and removed request for PSeitz, adamreichold and fulmicoton January 6, 2023 13:12
@mkleen
Copy link
Contributor Author

mkleen commented Jan 6, 2023

Ups, @PSeitz removing you from review was not intentionally.

@mkleen mkleen force-pushed the mkleen/regex_tokenizer branch 3 times, most recently from 746d533 to f4a43f9 Compare January 6, 2023 13:19
@mkleen mkleen requested review from fulmicoton and adamreichold and removed request for adamreichold and fulmicoton January 6, 2023 14:11
@mkleen
Copy link
Contributor Author

mkleen commented Jan 6, 2023

Sorry for requesting multiple reviews. I did not realize I can only request one review at the time.

This adds a regex tokenizer which tokenizes the text by using a
regex pattern to split.
@mkleen
Copy link
Contributor Author

mkleen commented Jan 6, 2023

I just discovered there is a off-by-one on the offsets of the second token. I will address that.

@mkleen
Copy link
Contributor Author

mkleen commented Jan 6, 2023

I just discovered there is a off-by-one on the offsets of the second token. I will address that.

I think this is fine, wrong alarm.

@fulmicoton fulmicoton merged commit 196e42f into quickwit-oss:main Jan 10, 2023
This was referenced Jan 13, 2023
fulmicoton pushed a commit that referenced this pull request Jan 16, 2023
This adds a regex tokenizer which tokenizes the text by using a
regex pattern to split.

Co-authored-by: Michael Kleen <mkleen@gmailw.com>
Hodkinson pushed a commit to Hodkinson/tantivy that referenced this pull request Jan 30, 2023
This adds a regex tokenizer which tokenizes the text by using a
regex pattern to split.

Co-authored-by: Michael Kleen <mkleen@gmailw.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a configurable RegexTokenizer
5 participants