-
-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numerical range query on an "optional" fast field not returning all results #2225
Comments
Fix range query end check in advance Rename vars to reduce ambiguity Fixes #2225
Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225
Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225
Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225
@appaquet Thanks for the great bug report, your analysis is correct. The fast field range query tries to check blocks of docids instead of single calls to the fast field. If you have something that equals a full-scan this variant is much faster, especially since we employ SIMD in the range matching on the fast field. The minimum fetch range is 128, so this is not well covered by existing tests, since most tests have less than 128 docs. I created a PR here and added your test (#2226), and also updated the catch-all test-suite in index_writer to test for more range queries. On your comment. The minimum buffer size is 15_000_000, or it will commit every doc. This is fixed on main.
|
Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225
Thanks for the swift response! The fix was a bit more involved than anticipated, but happy that my analysis was right. |
Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225
At it's core the fix is just one line, but I also like to do postmortems to fix anything that contributed to a bug. |
When running a numerical range query over a fast field that contains optional values (i.e. no multi-valued, but either 0 or 1 value), not all matching results are returned. The number of results returned is nondeterministic when segment merge and multi-threads indexation are used, but becomes deterministic in single-thread with a no merge policy.
I think I have narrowed down the problem here: https://github.com/tantivy-search/tantivy/blob/66ff53b0f48b83373b4595abc4af15995876f614/src/query/range_query/fast_field_range_query.rs#L139
I may be wrong, but I think it's early exiting when building the matching docset since it's comparing next document start to the number of values in the column. By comparing it to number of documents in the column instead (
self.column.num_docs()
), tests are still passing and it's fixing my issue.If my fix seems adequate and I'm not missing anything, I can open a PR for the fix. Of course, I'd like to include a test as well, so advise on where such a test could ideally be written would be nice.
Thanks!
Which version of tantivy are you using?
Latest version (0.21)
To Reproduce
Minimum producible code
Deterministic reproducible code with breadcrumbs info
The text was updated successfully, but these errors were encountered: