Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix proximity search #1934 #2000

Closed
wants to merge 2 commits into from
Closed

fix proximity search #1934 #2000

wants to merge 2 commits into from

Conversation

sftse
Copy link

@sftse sftse commented Apr 17, 2023

This is one possible way to fix proximity search.

It might be worthwhile to note that after we made this fix, we discovered that the tantivy docs are inconsistent with Lucene. Our fix targets the definition of slop that is the maximum allowed offset distance between terms.

We've split the changes into two commits to highlight the race condition. One can test that one commit still leads to nondeterminism in the originally stated case in #1934 while the second commit eliminates this non-determinism.

 the left array into the left array instead of storing the matching entries
 from the right array. Also, do not skip possible matches within the slop window.
 This makes the property (A intersect B) intersect C = (A intersect C) intersect B
 hold when used with a slop, which makes reasoning about correctness
 straight forward.
Copy link
Author

@sftse sftse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've added a test case to demonstrate why we think skipping the saving of intermediate matches within the slop window in fn intersection_with_slop is incorrect.

@codecov-commenter
Copy link

Codecov Report

Merging #2000 (d41b384) into main (9e2faec) will increase coverage by 94.43%.
The diff coverage is 97.14%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff            @@
##           main    #2000       +/-   ##
=========================================
+ Coverage      0   94.43%   +94.43%     
=========================================
  Files         0      320      +320     
  Lines         0    59258    +59258     
=========================================
+ Hits          0    55961    +55961     
- Misses        0     3297     +3297     
Impacted Files Coverage Δ
src/query/intersection.rs 97.76% <ø> (ø)
src/query/phrase_query/phrase_scorer.rs 88.88% <97.10%> (ø)
columnar/src/column_values/mod.rs 81.08% <100.00%> (ø)

... and 317 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@fulmicoton fulmicoton requested a review from PSeitz April 25, 2023 07:14
@fulmicoton
Copy link
Collaborator

@PSeitz can you have a look?

@PSeitz
Copy link
Contributor

PSeitz commented May 1, 2023

Thanks for the PR, but as I already have shown in the minimal example in the original issue, this is not a race condition. In one commit it's mentioned that the segments are sorted before they are passed. These are not segments, but docsets on a single segment.

The root cause is the algorithm expects the order of the docsets to match the order of the terms. That is not uphold when the the docsets are sorted by size. #2020 fixes that, by allowing slop in both directions.

@sftse
Copy link
Author

sftse commented May 3, 2023

The minimal example is independent to what we're claiming. We split our changes into two commits to show that 1) our first commit fixes your minimal example, but 2) continues to be non-deterministic in the multithreaded case. We agree the root cause is the sorting of docsets, which should have been clear from our PR. It is possible we're mistaken, but we don't see how the minimal example provides the evidence that this is not a race condition. We don't disagree with your statement, but you seem to object to ours.

#2020 is incorrect. A slop that works in both directions interacts poorly with PostingsWithOffset: "wendy subject subject captain" matches the query "wendy subject captain"~1 which should match only "wendy subject captain" or "captain subject wendy". Another counterexample is "captain subject wendy" that does not match the query "wendy captain"~1.

We provided a test demonstrating that (updated to match new functions)

let mut left = vec![3, 5];
intersection_count_with_slop(&mut left, &[1, 2], 2, true);
intersection_count_with_slop(&mut left, &[4], 2, true);

is not equivalent with

let mut left = vec![3, 5];
intersection_count_with_slop(&mut left, &[4], 2, true);
intersection_count_with_slop(&mut left, &[1, 2], 2, true);

which seems like an obvious bug, which #2020 does not address, and our PR suggests a solution. We were looking for clarification in the subsequent comment to OP, can we ask you directly to clarify if this is intended behavior? We have not attempted to construct a minimal example with the public API to trigger these arguments, but it would be odd if intersection doesn't need to be commutative.

Edit: Our PR also does not make intersection commutative, but makes (A intersection B) intersection C = (A intersection C) intersection B hold which is the weakest property under which we could reason our solution to be correct. This is because we sort all docsets except the first. If arbitrary sorting is allowed, we don't know how to argue about the correctness without assuming A intersection B = B intersection A.

@PSeitz
Copy link
Contributor

PSeitz commented May 5, 2023

It is possible we're mistaken, but we don't see how the minimal example provides the evidence that this is not a race condition.

It shows that it's an error in the algorithm. 100% deterministic without any threading.

If you create an index, what data gets into which segment is not deterministic. A lot of algorithms in tantivy depend on the actual data in a segment. Just because they depend on what data is in a segment does not mean it's a race condition. After segment creation queries are deterministic.

A slop that works in both directions interacts poorly with PostingsWithOffset: "wendy subject subject captain" matches the query "wendy subject captain"~1

Yes, as it should be. The distance is 1.

Another counterexample is "captain subject wendy" that does not match the query "wendy captain"~1

Yes, this should not match, the distance is bigger than 1.

We provided a test demonstrating that (updated to match new functions)

let mut left = vec![3, 5];
intersection_count_with_slop(&mut left, &[1, 2], 2, true);
intersection_count_with_slop(&mut left, &[4], 2, true);

is not equivalent with

let mut left = vec![3, 5];
intersection_count_with_slop(&mut left, &[4], 2, true);
intersection_count_with_slop(&mut left, &[1, 2], 2, true);

Thanks for bringing that up, I was wondering how this algorithm could work correctly by only storing one value in left.
It doesn't, it's buggy.
This test fails for both PRs. On your PR even for slop = 2.

    #[test]
    pub fn test_phrase_score_with_slop_bug_2() -> crate::Result<()> {
        let index = create_index(&["a x b x c", "a a c"])?;
        let scores = test_query(1, &index, vec!["a", "b", "c"]);
        assert_eq!(scores.len(), 1);
        Ok(())
    }

I think we need to store multiple hits similar like in your PR.

I also checked how slop is handled in elastic, and there the slop is more like budget for between all terms. This would required a different approach, probably where we carry the remaining slop distance with each hit.

@sftse
Copy link
Author

sftse commented May 5, 2023

I see, we were erroneously under the impression that the slop indicates the size of a window within which all the terms must be found. Our counterexamples and the PR that additionally assumes the terms must appear in order don't make much sense with any other definition.

@fulmicoton
Copy link
Collaborator

@sftse @PSeitz Is the problem entirely solved by #2020 and should we close this PR?

@sftse sftse closed this May 8, 2023
@PSeitz
Copy link
Contributor

PSeitz commented May 9, 2023

I created a PR, that fixes open issues with terms > 2 and adds missing clarifications
#2031

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants