-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description
1.Reduce the match times for BKD node whose relation=CELL_CROSSES_QUERY
.
When performing range query, if the Relation is CELL_CROSSES_QUERY
, we need to traverse each value in this leaf node and matches
. A segment query typically involves two leaf nodes with Relation set to CELL_CROSSES_QUERY
, leading to 512×2 matches
.
In reality, the 512 values on each BKD leaf node are stored in a compressed format, such as "30 consecutive 1, 30 consecutive 2, 30 consecutive 3.....". Notably, matching
for 30 consecutive 1
only requires 1 times, instead of 30. With this optimization, the BKD tree only needs a few matching
rather than 512 per leaf.
For a shard with 25 segments, the original approach would involve 25×512×2 = 25,600 matching
operations. Through the above optimization, however, a lot of reduction in matching
can be achieved.
visitor.visit(scratchIterator, scratchPackedValue); |
......
scratchIterator.reset(i, length);
if (visitor.compare(scratchPackedValue, scratchPackedValue) == PointValues.Relation.CELL_INSIDE_QUERY) {
visitor.visit(new IntsRef(scratchIterator.docIDs, i, length));
}
......
2.When calling PointValues.intersect()
, if there is no matching node left, we should exit promptly.
For example, in the following case, only leaf node 7 matches. In reality, we would visit nodes from 1 to 13.
- When visiting leaf node 7: relation = CELL_CROSSES_QUERY
- When visiting leaf node 8: relation = CELL_OUTSIDE_QUERY
Once the relation changes to CELL_OUTSIDE_QUERY in node 8, we should exit immediately instead of continuing to visit nodes 9 to 13.
Activity
iverase commentedon May 26, 2025
This approach is only considering one dimensional data with queries represented by simple ranges. I don't think this will work for high dimensional data, e.g LatLonShape and complex queries, e.g multipolygon queries.
gf2121 commentedon May 27, 2025
Is this a duplicate of #14244?
jainankitk commentedon May 29, 2025
While I can see the concerns with high dimensional data, I am wondering if this can be a good improvement for PointTreeBulkCollector, as it is limited to single dimensional fields. We also have specific jmh benchmarks to understand the impact better.