Add aggregation bucket limit #1363

PSeitz · 2022-05-10T08:33:05Z

Change SegmentCollector.collect to return a Result

Validation happens on different phases depending on the aggregation
Term: During segment collection
Histogram: At the end when converting in intermediate buckets (we preallocate empty buckets for the range) Revisit after #1370
Range: When validating the request

Closes #1331

codecov-commenter · 2022-05-11T06:30:30Z

Codecov Report

Merging #1363 (2e2822f) into main (cbd06ab) will increase coverage by 0.01%.
The diff coverage is 99.15%.

@@            Coverage Diff             @@
##             main    #1363      +/-   ##
==========================================
+ Coverage   94.29%   94.31%   +0.01%     
==========================================
  Files         234      236       +2     
  Lines       42769    43524     +755     
==========================================
+ Hits        40331    41050     +719     
- Misses       2438     2474      +36

Impacted Files	Coverage Δ
src/aggregation/agg_result.rs	`73.46% <33.33%> (-19.79%)`	⬇️
src/aggregation/segment_agg_result.rs	`91.50% <97.14%> (+0.53%)`	⬆️
src/aggregation/agg_req_with_accessor.rs	`94.36% <100.00%> (+0.66%)`	⬆️
src/aggregation/bucket/histogram/histogram.rs	`99.60% <100.00%> (+<0.01%)`	⬆️
src/aggregation/bucket/range.rs	`96.31% <100.00%> (+0.07%)`	⬆️
src/aggregation/bucket/term_agg.rs	`99.10% <100.00%> (+0.03%)`	⬆️
src/aggregation/collector.rs	`100.00% <100.00%> (ø)`
src/aggregation/intermediate_agg_result.rs	`98.24% <100.00%> (+0.65%)`	⬆️
src/aggregation/metric/stats.rs	`97.20% <100.00%> (ø)`
src/aggregation/mod.rs	`98.77% <100.00%> (+<0.01%)`	⬆️
... and 51 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbd06ab...2e2822f. Read the comment docs.

Validation happens on different phases depending on the aggregation Term: During segment collection Histogram: At the end when converting in intermediate buckets (we preallocate empty buckets for the range) Revisit after #1370 Range: When validating the request update CHANGELOG

fulmicoton · 2022-05-16T07:48:40Z

examples/custom_collector.rs

        let value = self.fast_field_reader.get(doc) as f64;
        self.stats.count += 1;
        self.stats.sum += value;
        self.stats.squared_sum += value * value;
+        Ok(())


Can you run the bench before and after this change to check it does not impact the compiler perf?
(Top + Count on "the" term query)

The MultiCollector is likely to be the trickiest because of the dynamic dispatch. Unfortunately, it is not part of the benchmark.

Yep, but in my benches so far adding Result didn't add overhead, except for very tight and simple loops.

E.g. here let value = self.fast_field_reader.get(doc) as f64; probably considerably outweighs the Result overhead.

fulmicoton

I'm scared of the tantivy API change (Collector::collect returning a Result).

If I understand correctly, the goal is to detect that we have reached the collection limit and abort the query.

An alternative could be to detect the excess of buckets, and keep on the collection but avoid adding extra bucket, and return an error when harvesting the segment.
Would that be unreasonable?

PSeitz · 2022-05-17T03:57:46Z

I'm scared of the tantivy API change (Collector::collect returning a Result).

If I understand correctly, the goal is to detect that we have reached the collection limit and abort the query.

Scared because of performance?

I checked the performance and the "the" TOP10_COUNT query is indeed slower. (TOP_10 and COUNT are the same speed though) I introduced a change which I considered for some time now, which is to collect hits and pass them as block. I need to add a benchmark for the MultiCollector, I think this would profit above average from that. Aggregation may also benefit.

In the current change, the collect_block is quite simple implemented, just fowarding to the underlying function, but I with manual unrolling, we could probably gain some more performance.

Orig	With Result	With Result collect_block
1,632 μs	1,847 μs +13.2 %	1,664 μs +2.3 %
336,574 docs	336,574 docs	336,574 docs

Currently untested which may help is to pass to collect_block always a fixed size array of e.g. 64.

Currently we always pass the score, but having some mechanism to only request docs without score could make make some gains on some scenarios.

An alternative could be to detect the excess of buckets, and keep on the collection but avoid adding extra bucket, and return an error when harvesting the segment. Would that be unreasonable?

I thought about that, but returning a result in collect is the right approach imo, since we could also want to read from other sources than infallible in-memory sources in the segment collection.

fulmicoton · 2022-05-17T07:38:58Z

Scared because of performance?

Performance and misuse yes.
I prefer the collect method to not return a Result.

Maybe a bool would do the trick?

COUNT and TOP_10 have a special code path. They are irrelevant here.

PSeitz · 2022-05-17T14:34:13Z

Performance and misuse yes. I prefer the collect method to not return a Result.

Misuse how? Every tantivy user should know how to handle Result.

Maybe a bool would do the trick?

Like a C-style API, on false collect the error somewhere? It's likely faster, but also pretty ugly.

I added a bench with the MultiCollector (Count + Top10). It is faster with the collect_block approach.

Orig	With Result collect_block
2,275 μs +26.1 %	1,804 μs
336,574 docs	336,574 docs

collect_block has other potential performance gains, e.g. for TopN we could check upfront if the best score from the block would make it in the TopN, if not skip the whole block. That would probably work pretty well for large results, when the TopN is becoming saturated.

add collect_block in segment_collector to handle groups of documents as performance optimization add collect_block for MultiCollector

This reverts commit c5c2e59.

This reverts commit a99e545.

CHANGELOG.md

fulmicoton · 2022-06-23T00:46:07Z

src/aggregation/bucket/range.rs

@@ -153,7 +154,7 @@ impl SegmentRangeCollector {
    ) -> crate::Result<IntermediateBucketResult> {
        let field_type = self.field_type;

-        let buckets = self
+        let buckets: FnvHashMap<SerializedKey, IntermediateRangeBucketEntry> = self


that's super helpful! thanks

PSeitz force-pushed the refactor_aggregation branch from 379d702 to 7c8e1e3 Compare May 11, 2022 06:12

PSeitz force-pushed the refactor_aggregation branch from 7c8e1e3 to 7bf00b3 Compare May 11, 2022 08:32

PSeitz added 4 commits May 12, 2022 12:26

refactor aggregations

3f88718

return result from segment collector

a99e545

forward error in aggregation collect

6a46322

PSeitz force-pushed the refactor_aggregation branch from cfa4d1b to 11ac451 Compare May 12, 2022 04:26

PSeitz changed the title ~~refactor aggregations~~ Add aggregation bucket limit May 12, 2022

PSeitz requested a review from fulmicoton May 12, 2022 04:28

PSeitz force-pushed the refactor_aggregation branch 3 times, most recently from f83dcb7 to 11792a0 Compare May 13, 2022 05:17

set max bucket size as parameter

44ea731

PSeitz force-pushed the refactor_aggregation branch from 11792a0 to 44ea731 Compare May 13, 2022 05:21

fulmicoton reviewed May 16, 2022

View reviewed changes

fulmicoton requested changes May 16, 2022

View reviewed changes

PSeitz force-pushed the refactor_aggregation branch from ac5f3bf to 9ca04b4 Compare May 17, 2022 03:37

introduce optional collect_block in segmentcollector

c5c2e59

add collect_block in segment_collector to handle groups of documents as performance optimization add collect_block for MultiCollector

PSeitz force-pushed the refactor_aggregation branch from f5a9123 to c5c2e59 Compare May 19, 2022 08:24

PSeitz added 3 commits May 19, 2022 16:25

Revert "introduce optional collect_block in segmentcollector"

17dcc99

This reverts commit c5c2e59.

Revert "return result from segment collector"

b114e55

This reverts commit a99e545.

cache and return error in aggregations

71f7507

fulmicoton reviewed Jun 23, 2022

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

fulmicoton reviewed Jun 23, 2022

View reviewed changes

Apply suggestions from code review

2e2822f

fulmicoton approved these changes Jun 23, 2022

View reviewed changes

PSeitz merged commit 6ca5f77 into main Jun 23, 2022

PSeitz deleted the refactor_aggregation branch June 23, 2022 02:27

PSeitz mentioned this pull request Nov 30, 2022

Limit number of histogram buckets. quickwit-oss/quickwit#2503

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add aggregation bucket limit #1363

Add aggregation bucket limit #1363

PSeitz commented May 10, 2022 •

edited

Loading

codecov-commenter commented May 11, 2022 •

edited

Loading

fulmicoton May 16, 2022 •

edited

Loading

PSeitz May 16, 2022

fulmicoton left a comment

PSeitz commented May 17, 2022

fulmicoton commented May 17, 2022

PSeitz commented May 17, 2022

fulmicoton Jun 23, 2022

Add aggregation bucket limit #1363

Add aggregation bucket limit #1363

Conversation

PSeitz commented May 10, 2022 • edited Loading

codecov-commenter commented May 11, 2022 • edited Loading

Codecov Report

fulmicoton May 16, 2022 • edited Loading

Choose a reason for hiding this comment

PSeitz May 16, 2022

Choose a reason for hiding this comment

fulmicoton left a comment

Choose a reason for hiding this comment

PSeitz commented May 17, 2022

fulmicoton commented May 17, 2022

PSeitz commented May 17, 2022

fulmicoton Jun 23, 2022

Choose a reason for hiding this comment

PSeitz commented May 10, 2022 •

edited

Loading

codecov-commenter commented May 11, 2022 •

edited

Loading

fulmicoton May 16, 2022 •

edited

Loading