Skip to content

A little optimization about BKDReader #14717

@kkewwei

Description

@kkewwei
Contributor

Description

1.Reduce the match times for BKD node whose relation=CELL_CROSSES_QUERY.

When performing range query, if the Relation is CELL_CROSSES_QUERY, we need to traverse each value in this leaf node and matches. A segment query typically involves two leaf nodes with Relation set to CELL_CROSSES_QUERY, leading to 512×2 matches.

In reality, the 512 values on each BKD leaf node are stored in a compressed format, such as "30 consecutive 1, 30 consecutive 2, 30 consecutive 3.....". Notably, matching for 30 consecutive 1 only requires 1 times, instead of 30. With this optimization, the BKD tree only needs a few matching rather than 512 per leaf.

For a shard with 25 segments, the original approach would involve 25×512×2 = 25,600 matching operations. Through the above optimization, however, a lot of reduction in matching can be achieved.

visitor.visit(scratchIterator, scratchPackedValue);

......
scratchIterator.reset(i, length);
 if (visitor.compare(scratchPackedValue, scratchPackedValue) == PointValues.Relation.CELL_INSIDE_QUERY) {
          visitor.visit(new IntsRef(scratchIterator.docIDs, i, length));
}
......

2.When calling PointValues.intersect(), if there is no matching node left, we should exit promptly.

For example, in the following case, only leaf node 7 matches. In reality, we would visit nodes from 1 to 13.

  • When visiting leaf node 7: relation = CELL_CROSSES_QUERY
  • When visiting leaf node 8: relation = CELL_OUTSIDE_QUERY

Once the relation changes to CELL_OUTSIDE_QUERY in node 8, we should exit immediately instead of continuing to visit nodes 9 to 13.
Image

Activity

iverase

iverase commented on May 26, 2025

@iverase
Contributor

This approach is only considering one dimensional data with queries represented by simple ranges. I don't think this will work for high dimensional data, e.g LatLonShape and complex queries, e.g multipolygon queries.

gf2121

gf2121 commented on May 27, 2025

@gf2121
Contributor

Is this a duplicate of #14244?

jainankitk

jainankitk commented on May 29, 2025

@jainankitk
Contributor

While I can see the concerns with high dimensional data, I am wondering if this can be a good improvement for PointTreeBulkCollector, as it is limited to single dimensional fields. We also have specific jmh benchmarks to understand the impact better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @jainankitk@kkewwei@iverase@gf2121

      Issue actions

        A little optimization about BKDReader · Issue #14717 · apache/lucene