Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical range query on an "optional" fast field not returning all results #2225

Closed
appaquet opened this issue Oct 24, 2023 · 3 comments · Fixed by #2226
Closed

Numerical range query on an "optional" fast field not returning all results #2225

appaquet opened this issue Oct 24, 2023 · 3 comments · Fixed by #2226
Assignees

Comments

@appaquet
Copy link
Collaborator

appaquet commented Oct 24, 2023

When running a numerical range query over a fast field that contains optional values (i.e. no multi-valued, but either 0 or 1 value), not all matching results are returned. The number of results returned is nondeterministic when segment merge and multi-threads indexation are used, but becomes deterministic in single-thread with a no merge policy.

I think I have narrowed down the problem here: https://github.com/tantivy-search/tantivy/blob/66ff53b0f48b83373b4595abc4af15995876f614/src/query/range_query/fast_field_range_query.rs#L139

I may be wrong, but I think it's early exiting when building the matching docset since it's comparing next document start to the number of values in the column. By comparing it to number of documents in the column instead (self.column.num_docs()), tests are still passing and it's fixing my issue.

If my fix seems adequate and I'm not missing anything, I can open a PR for the fix. Of course, I'd like to include a test as well, so advise on where such a test could ideally be written would be nice.

Thanks!

Which version of tantivy are you using?
Latest version (0.21)

To Reproduce

Minimum producible code
#[test]
fn range_query_fast_optional_field_minimum() {
    let mut schema_builder = tantivy::schema::SchemaBuilder::new();
    let id_field = schema_builder.add_text_field("id", tantivy::schema::STRING);
    let score_field =
        schema_builder.add_u64_field("score", tantivy::schema::FAST | tantivy::schema::INDEXED);

    let dir = tantivy::directory::RamDirectory::default();
    let index = tantivy::IndexBuilder::new()
        .schema(schema_builder.build())
        .open_or_create(dir)
        .unwrap();

    {
        let mut writer = index.writer(100_000_000).unwrap();

        let count = 1000;
        for i in 0..count {
            let mut doc = tantivy::Document::new();
            doc.add_text(id_field, format!("doc{i}"));

            let nb_scores = i % 2; // 0 or 1 scores
            for _ in 0..nb_scores {
                doc.add_u64(score_field, 80);
            }

            writer.add_document(doc).unwrap();
        }
        writer.commit().unwrap();
    }

    let reader = index.reader().unwrap();
    let searcher = reader.searcher();

    let query = tantivy::query::RangeQuery::new_u64_bounds(
        "score".to_string(),
        std::ops::Bound::Included(70),
        std::ops::Bound::Unbounded,
    );

    let count = searcher.search(&query, &tantivy::collector::Count).unwrap();
    assert_eq!(count, 500);
}
Deterministic reproducible code with breadcrumbs info
    let mut schema_builder = tantivy::schema::SchemaBuilder::new();
    let id_field = schema_builder.add_text_field("id", tantivy::schema::STRING);

    // This is working:
    // let score_field =
    //     schema_builder.add_u64_field("score", tantivy::schema::STORED | tantivy::schema::INDEXED);

    // But this doesn't work, so this is in the fast field section of the code.
    let score_field =
        schema_builder.add_u64_field("score", tantivy::schema::FAST | tantivy::schema::INDEXED);

    // Tested with both u64 and f64, same result.

    let dir = tantivy::directory::RamDirectory::default();
    let index = tantivy::IndexBuilder::new()
        .schema(schema_builder.build())
        .open_or_create(dir)
        .unwrap();

    {
        // At 1 thread, it's consistently wrong with the same value (448)
        // Varying arena doesn't impact, as long as there is enough to fit in 1 segment
        // FIXME: BUT USING A BUFFER OF 10_000_000 makes it loop forever. probably another bug.
        let mut writer = index.writer_with_num_threads(1, 20_000_000).unwrap();
        // let mut writer = index.writer_with_num_threads(2, 500_000_000).unwrap();
        // let mut writer = index.writer_with_num_threads(3, 500_000_000).unwrap();

        // Fails with or without merging, so it's not a merge issue
        // But disabling merge makes it easier to debug
        writer.set_merge_policy(Box::new(NoMergePolicy));

        let count = 1000;
        for i in 0..count {
            let mut doc = tantivy::Document::new();
            doc.add_text(id_field, format!("doc{i}"));

            let nb_scores = i % 2; // 0 or 1, but as soon as we have more than 1, it always works
            for _ in 0..nb_scores {
                doc.add_u64(score_field, 80); // high value has no impact here
            }

            // this works, so it seems to be when we interleave
            // if i < 500 {
            //     doc.add_u64(score_field, 80);
            // }

            writer.add_document(doc).unwrap();
        }
        writer.commit().unwrap();
    }

    let reader = index.reader().unwrap();
    let searcher = reader.searcher();

    let query = tantivy::query::RangeQuery::new_u64_bounds(
        "score".to_string(),
        std::ops::Bound::Included(70),
        std::ops::Bound::Unbounded,
    );

    // This is consistently wrong, so it's not just some random corruption when reading.
    // Either the index is corrupted or the reader reads it wrong.
    // for _ in 0..1 {
    //     let count = searcher.search(&query, &tantivy::collector::Count).unwrap();
    //     println!("got count: {}, expected 500", count);
    // }

    let count = searcher.search(&query, &tantivy::collector::Count).unwrap();
    assert_eq!(count, 500);
PSeitz added a commit that referenced this issue Oct 25, 2023
Fix range query end check in advance
Rename vars to reduce ambiguity
Fixes #2225
PSeitz added a commit that referenced this issue Oct 25, 2023
Fix range query end check in advance
Rename vars to reduce ambiguity
add tests

Fixes #2225
PSeitz added a commit that referenced this issue Oct 25, 2023
Fix range query end check in advance
Rename vars to reduce ambiguity
add tests

Fixes #2225
PSeitz added a commit that referenced this issue Oct 25, 2023
Fix range query end check in advance
Rename vars to reduce ambiguity
add tests

Fixes #2225
@PSeitz
Copy link
Contributor

PSeitz commented Oct 25, 2023

@appaquet Thanks for the great bug report, your analysis is correct.

The fast field range query tries to check blocks of docids instead of single calls to the fast field. If you have something that equals a full-scan this variant is much faster, especially since we employ SIMD in the range matching on the fast field.
A little bit odd is that it operates on docs and not on values. So the actual checked values depends on the field cardinality.

The minimum fetch range is 128, so this is not well covered by existing tests, since most tests have less than 128 docs.
The fast field range query preceded optional values on the fast field, so this check got incorrectly changed without any tests failing, when we added optional cardinality.

I created a PR here and added your test (#2226), and also updated the catch-all test-suite in index_writer to test for more range queries.

On your comment. The minimum buffer size is 15_000_000, or it will commit every doc. This is fixed on main.

// FIXME: BUT USING A BUFFER OF 10_000_000 makes it loop forever. probably another bug.

PSeitz added a commit that referenced this issue Oct 25, 2023
Fix range query end check in advance
Rename vars to reduce ambiguity
add tests

Fixes #2225
@appaquet
Copy link
Collaborator Author

Thanks for the swift response! The fix was a bit more involved than anticipated, but happy that my analysis was right.
Any plan to release a new version soon with the fix? I'll switch to the main branch to fix the issue on our side for now.

PSeitz added a commit that referenced this issue Oct 25, 2023
Fix range query end check in advance
Rename vars to reduce ambiguity
add tests

Fixes #2225
@PSeitz
Copy link
Contributor

PSeitz commented Oct 26, 2023

At it's core the fix is just one line, but I also like to do postmortems to fix anything that contributed to a bug.
The bugfix has been released with 0.21.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants