Add regex tokenizer #1759

mkleen · 2023-01-04T11:40:29Z

This adds a regex tokenizer which tokenizes the text by using a regex pattern to split. This is my first attempt and works for my usecase, but it's not ideal from a code and a configuration perspective.

Closes #1670

src/tokenizer/regex_tokenizer.rs

Gearme · 2023-01-04T11:47:01Z

Wow, snap! I just authored #1759 to do much the same, albeit with capture group support.
Didn't find a better solution for the cloned Regex either though ;)

src/tokenizer/regex_tokenizer.rs

mkleen · 2023-01-04T11:50:33Z

Wow, snap! I just authored #1759 to do much the same, albeit with capture group support. Didn't find a better solution for the cloned Regex either though ;)

Haha, great. If you have capture group support then you have probably the better version.

src/tokenizer/regex_tokenizer.rs

codecov-commenter · 2023-01-04T13:41:59Z

Codecov Report

Merging #1759 (0c99e75) into main (b78dc5e) will increase coverage by 0.01%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##             main    #1759      +/-   ##
==========================================
+ Coverage   94.13%   94.14%   +0.01%     
==========================================
  Files         267      269       +2     
  Lines       50900    50985      +85     
==========================================
+ Hits        47915    48002      +87     
+ Misses       2985     2983       -2

Impacted Files	Coverage Δ
src/tokenizer/mod.rs	`95.61% <ø> (ø)`
src/tokenizer/regex_tokenizer.rs	`92.85% <92.85%> (ø)`
src/lib.rs	`96.04% <0.00%> (-0.44%)`	⬇️
src/fastfield/readers.rs	`89.47% <0.00%> (-0.36%)`	⬇️
src/fastfield/mod.rs	`99.73% <0.00%> (-0.01%)`	⬇️
src/query/mod.rs	`100.00% <0.00%> (ø)`
src/query/range_query.rs
src/query/range_query_ip_fastfield.rs
src/query/range_query/range_query.rs	`91.16% <0.00%> (ø)`
src/query/range_query/range_query_ip_fastfield.rs	`97.00% <0.00%> (ø)`
... and 6 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

adamreichold

I think this would work. Some nits on the formatting, one concern about the intended meaning of regular expression anchors.

I would suggest doing capture group support as a follow-up.

I think the performance question of whether to clone Regex or Arc<Regex> can only be solved via benchmarks. I would suggest keeping it as it is for now if only for simplicity until such benchmarks are written.

src/tokenizer/regex_tokenizer.rs

mkleen · 2023-01-06T11:08:03Z

@adamreichold Thank you for your feedback and the review. I like your idea of first merging this simpler version and then going ahead and adding the group capture support afterwards. I can wrap this up pretty quickly. @Gearme would that be OK for you as well? We could make the follow-up based on your implementation. I would also like to try out a benchmark to address #1759 (review), but also as a follow-up.

mkleen · 2023-01-06T13:13:48Z

Ups, @PSeitz removing you from review was not intentionally.

mkleen · 2023-01-06T14:29:31Z

Sorry for requesting multiple reviews. I did not realize I can only request one review at the time.

This adds a regex tokenizer which tokenizes the text by using a regex pattern to split.

mkleen · 2023-01-06T18:32:07Z

I just discovered there is a off-by-one on the offsets of the second token. I will address that.

mkleen · 2023-01-06T20:29:48Z

I just discovered there is a off-by-one on the offsets of the second token. I will address that.

I think this is fine, wrong alarm.

This adds a regex tokenizer which tokenizes the text by using a regex pattern to split. Co-authored-by: Michael Kleen <mkleen@gmailw.com>

mkleen commented Jan 4, 2023

View reviewed changes

src/tokenizer/regex_tokenizer.rs Show resolved Hide resolved

mkleen commented Jan 4, 2023

View reviewed changes

src/tokenizer/regex_tokenizer.rs Show resolved Hide resolved

mkleen marked this pull request as ready for review January 4, 2023 11:48

fulmicoton reviewed Jan 4, 2023

View reviewed changes

src/tokenizer/regex_tokenizer.rs Outdated Show resolved Hide resolved

mkleen force-pushed the mkleen/regex_tokenizer branch from 4e931f0 to 7672b0a Compare January 5, 2023 07:10

fulmicoton requested review from adamreichold and PSeitz January 6, 2023 03:45

adamreichold reviewed Jan 6, 2023

View reviewed changes

adamreichold mentioned this pull request Jan 6, 2023

Implement RegexTokenizer #1758

Closed

mkleen requested review from adamreichold and fulmicoton and removed request for PSeitz, adamreichold and fulmicoton January 6, 2023 13:12

mkleen force-pushed the mkleen/regex_tokenizer branch 3 times, most recently from 746d533 to f4a43f9 Compare January 6, 2023 13:19

mkleen requested review from fulmicoton and adamreichold and removed request for adamreichold and fulmicoton January 6, 2023 14:11

mkleen requested a review from adamreichold January 6, 2023 14:12

adamreichold approved these changes Jan 6, 2023

View reviewed changes

mkleen force-pushed the mkleen/regex_tokenizer branch from f4a43f9 to 43537a5 Compare January 6, 2023 17:21

Add regex tokenizer

0c99e75

This adds a regex tokenizer which tokenizes the text by using a regex pattern to split.

mkleen force-pushed the mkleen/regex_tokenizer branch from 43537a5 to 0c99e75 Compare January 6, 2023 17:25

fulmicoton approved these changes Jan 10, 2023

View reviewed changes

fulmicoton merged commit 196e42f into quickwit-oss:main Jan 10, 2023

This was referenced Jan 13, 2023

truncation comment PSeitz/tantivy#30

Closed

use stats PSeitz/tantivy#31

Closed

fulmicoton pushed a commit that referenced this pull request Jan 16, 2023

Add regex tokenizer (#1759)

4363e11

This adds a regex tokenizer which tokenizes the text by using a regex pattern to split. Co-authored-by: Michael Kleen <mkleen@gmailw.com>

Hodkinson pushed a commit to Hodkinson/tantivy that referenced this pull request Jan 30, 2023

Add regex tokenizer (quickwit-oss#1759)

5c08f6e

This adds a regex tokenizer which tokenizes the text by using a regex pattern to split. Co-authored-by: Michael Kleen <mkleen@gmailw.com>

PSeitz mentioned this pull request Jan 31, 2023

update lz4 flex PSeitz/tantivy#33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add regex tokenizer #1759

Add regex tokenizer #1759

mkleen commented Jan 4, 2023 •

edited

Gearme commented Jan 4, 2023

mkleen commented Jan 4, 2023

codecov-commenter commented Jan 4, 2023 •

edited

adamreichold left a comment

mkleen commented Jan 6, 2023

mkleen commented Jan 6, 2023

mkleen commented Jan 6, 2023

mkleen commented Jan 6, 2023 •

edited

mkleen commented Jan 6, 2023

Add regex tokenizer #1759

Add regex tokenizer #1759

Conversation

mkleen commented Jan 4, 2023 • edited

Gearme commented Jan 4, 2023

mkleen commented Jan 4, 2023

codecov-commenter commented Jan 4, 2023 • edited

Codecov Report

adamreichold left a comment

Choose a reason for hiding this comment

mkleen commented Jan 6, 2023

mkleen commented Jan 6, 2023

mkleen commented Jan 6, 2023

mkleen commented Jan 6, 2023 • edited

mkleen commented Jan 6, 2023

mkleen commented Jan 4, 2023 •

edited

codecov-commenter commented Jan 4, 2023 •

edited

mkleen commented Jan 6, 2023 •

edited