Numerical range query on an "optional" fast field not returning all results #2225

appaquet · 2023-10-24T19:54:23Z

When running a numerical range query over a fast field that contains optional values (i.e. no multi-valued, but either 0 or 1 value), not all matching results are returned. The number of results returned is nondeterministic when segment merge and multi-threads indexation are used, but becomes deterministic in single-thread with a no merge policy.

I think I have narrowed down the problem here: https://github.com/tantivy-search/tantivy/blob/66ff53b0f48b83373b4595abc4af15995876f614/src/query/range_query/fast_field_range_query.rs#L139

I may be wrong, but I think it's early exiting when building the matching docset since it's comparing next document start to the number of values in the column. By comparing it to number of documents in the column instead (self.column.num_docs()), tests are still passing and it's fixing my issue.

If my fix seems adequate and I'm not missing anything, I can open a PR for the fix. Of course, I'd like to include a test as well, so advise on where such a test could ideally be written would be nice.

Thanks!

Which version of tantivy are you using?
Latest version (0.21)

To Reproduce

Minimum producible code

#[test]
fn range_query_fast_optional_field_minimum() {
    let mut schema_builder = tantivy::schema::SchemaBuilder::new();
    let id_field = schema_builder.add_text_field("id", tantivy::schema::STRING);
    let score_field =
        schema_builder.add_u64_field("score", tantivy::schema::FAST | tantivy::schema::INDEXED);

    let dir = tantivy::directory::RamDirectory::default();
    let index = tantivy::IndexBuilder::new()
        .schema(schema_builder.build())
        .open_or_create(dir)
        .unwrap();

    {
        let mut writer = index.writer(100_000_000).unwrap();

        let count = 1000;
        for i in 0..count {
            let mut doc = tantivy::Document::new();
            doc.add_text(id_field, format!("doc{i}"));

            let nb_scores = i % 2; // 0 or 1 scores
            for _ in 0..nb_scores {
                doc.add_u64(score_field, 80);
            }

            writer.add_document(doc).unwrap();
        }
        writer.commit().unwrap();
    }

    let reader = index.reader().unwrap();
    let searcher = reader.searcher();

    let query = tantivy::query::RangeQuery::new_u64_bounds(
        "score".to_string(),
        std::ops::Bound::Included(70),
        std::ops::Bound::Unbounded,
    );

    let count = searcher.search(&query, &tantivy::collector::Count).unwrap();
    assert_eq!(count, 500);
}

Deterministic reproducible code with breadcrumbs info

    let mut schema_builder = tantivy::schema::SchemaBuilder::new();
    let id_field = schema_builder.add_text_field("id", tantivy::schema::STRING);

    // This is working:
    // let score_field =
    //     schema_builder.add_u64_field("score", tantivy::schema::STORED | tantivy::schema::INDEXED);

    // But this doesn't work, so this is in the fast field section of the code.
    let score_field =
        schema_builder.add_u64_field("score", tantivy::schema::FAST | tantivy::schema::INDEXED);

    // Tested with both u64 and f64, same result.

    let dir = tantivy::directory::RamDirectory::default();
    let index = tantivy::IndexBuilder::new()
        .schema(schema_builder.build())
        .open_or_create(dir)
        .unwrap();

    {
        // At 1 thread, it's consistently wrong with the same value (448)
        // Varying arena doesn't impact, as long as there is enough to fit in 1 segment
        // FIXME: BUT USING A BUFFER OF 10_000_000 makes it loop forever. probably another bug.
        let mut writer = index.writer_with_num_threads(1, 20_000_000).unwrap();
        // let mut writer = index.writer_with_num_threads(2, 500_000_000).unwrap();
        // let mut writer = index.writer_with_num_threads(3, 500_000_000).unwrap();

        // Fails with or without merging, so it's not a merge issue
        // But disabling merge makes it easier to debug
        writer.set_merge_policy(Box::new(NoMergePolicy));

        let count = 1000;
        for i in 0..count {
            let mut doc = tantivy::Document::new();
            doc.add_text(id_field, format!("doc{i}"));

            let nb_scores = i % 2; // 0 or 1, but as soon as we have more than 1, it always works
            for _ in 0..nb_scores {
                doc.add_u64(score_field, 80); // high value has no impact here
            }

            // this works, so it seems to be when we interleave
            // if i < 500 {
            //     doc.add_u64(score_field, 80);
            // }

            writer.add_document(doc).unwrap();
        }
        writer.commit().unwrap();
    }

    let reader = index.reader().unwrap();
    let searcher = reader.searcher();

    let query = tantivy::query::RangeQuery::new_u64_bounds(
        "score".to_string(),
        std::ops::Bound::Included(70),
        std::ops::Bound::Unbounded,
    );

    // This is consistently wrong, so it's not just some random corruption when reading.
    // Either the index is corrupted or the reader reads it wrong.
    // for _ in 0..1 {
    //     let count = searcher.search(&query, &tantivy::collector::Count).unwrap();
    //     println!("got count: {}, expected 500", count);
    // }

    let count = searcher.search(&query, &tantivy::collector::Count).unwrap();
    assert_eq!(count, 500);

The text was updated successfully, but these errors were encountered:

Fix range query end check in advance Rename vars to reduce ambiguity Fixes #2225

Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225

PSeitz · 2023-10-25T05:28:36Z

@appaquet Thanks for the great bug report, your analysis is correct.

The fast field range query tries to check blocks of docids instead of single calls to the fast field. If you have something that equals a full-scan this variant is much faster, especially since we employ SIMD in the range matching on the fast field.
A little bit odd is that it operates on docs and not on values. So the actual checked values depends on the field cardinality.

The minimum fetch range is 128, so this is not well covered by existing tests, since most tests have less than 128 docs.
The fast field range query preceded optional values on the fast field, so this check got incorrectly changed without any tests failing, when we added optional cardinality.

I created a PR here and added your test (#2226), and also updated the catch-all test-suite in index_writer to test for more range queries.

On your comment. The minimum buffer size is 15_000_000, or it will commit every doc. This is fixed on main.

// FIXME: BUT USING A BUFFER OF 10_000_000 makes it loop forever. probably another bug.

Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225

appaquet · 2023-10-25T12:27:18Z

Thanks for the swift response! The fix was a bit more involved than anticipated, but happy that my analysis was right.
Any plan to release a new version soon with the fix? I'll switch to the main branch to fix the issue on our side for now.

Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225

PSeitz · 2023-10-26T04:25:19Z

At it's core the fix is just one line, but I also like to do postmortems to fix anything that contributed to a bug.
The bugfix has been released with 0.21.1

fulmicoton added bug high priority labels Oct 24, 2023

fulmicoton assigned PSeitz Oct 24, 2023

PSeitz added a commit that referenced this issue Oct 25, 2023

Fix range query

19ba038

Fix range query end check in advance Rename vars to reduce ambiguity Fixes #2225

PSeitz mentioned this issue Oct 25, 2023

Fix range query #2226

Merged

PSeitz added a commit that referenced this issue Oct 25, 2023

Fix range query

32f662f

Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225

PSeitz added a commit that referenced this issue Oct 25, 2023

Fix range query

833fe37

Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225

PSeitz added a commit that referenced this issue Oct 25, 2023

Fix range query

e166e64

Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225

PSeitz closed this as completed in #2226 Oct 25, 2023

PSeitz added a commit that referenced this issue Oct 25, 2023

Fix range query (#2226)

83af14c

Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225

PSeitz added a commit that referenced this issue Oct 25, 2023

Fix range query (#2226)

21aabf9

Fix range query end check in advance Rename vars to reduce ambiguity add tests Fixes #2225

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical range query on an "optional" fast field not returning all results #2225

Numerical range query on an "optional" fast field not returning all results #2225

appaquet commented Oct 24, 2023 •

edited

Loading

PSeitz commented Oct 25, 2023

appaquet commented Oct 25, 2023

PSeitz commented Oct 26, 2023

Numerical range query on an "optional" fast field not returning all results #2225

Numerical range query on an "optional" fast field not returning all results #2225

Comments

appaquet commented Oct 24, 2023 • edited Loading

PSeitz commented Oct 25, 2023

appaquet commented Oct 25, 2023

PSeitz commented Oct 26, 2023

appaquet commented Oct 24, 2023 •

edited

Loading