Add `clickbench_pushdown` benchmark #16731

alamb · 2025-07-09T21:02:26Z

Which issue does this PR close?

Related to of Enable parquet filter pushdown (filter_pushdown) by default #3463
Closes Add a datafusion benchmark for filter_pushdown #16729

Rationale for this change

In order to enable filter_pushdown by default, we need to ensure it doesn't regress existing performance

However, it has been very hard to make forward progress on improving filter pushdown because all our benchmarks compare filter pushdown to not filter pushdown, so the bar for change is quite high.
Here is the most recent example:

POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) #16711

It seems obvious but the the right metric for improvements to the filter pushdown are comparing when filter pushdown is already on. However, we don't have any such benchmark (see #16729 and #16730 for why the existing benchmarks are not good enough)

What changes are included in this PR?

Add a benchmark (clickbench_pushdown) that turns on filter_pushdown and reorder_filters on

You can run it like this:

`./benchmarks/bench.sh run clickbench_pushdown

Which then invokes

+ cargo run --release --bin dfbench -- clickbench --pushdown --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries -o /Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json

Are these changes tested?

I tested it manually . You can see Q30 increase in time when --pushdown is enabled, as expected

with --pushdown:

...     Running `target/profiling/dfbench clickbench --pushdown --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries -o /Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json --query 30`
Running benchmarks with the following options: RunOpt { query: Some(30), pushdown: true, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned", queries_path: "/Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries", output_path: Some("/Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json") }
Q30: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;

Query 30 iteration 0 took 546.0 ms and returned 10 rows
Query 30 iteration 1 took 503.1 ms and returned 10 rows
Query 30 iteration 2 took 488.2 ms and returned 10 rows
Query 30 iteration 3 took 462.6 ms and returned 10 rows
Query 30 iteration 4 took 462.3 ms and returned 10 rows
Query 30 avg time: 492.42 ms

Without pushdown

...
     Running `target/profiling/dfbench clickbench --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries -o /Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json --query 30`
Running benchmarks with the following options: RunOpt { query: Some(30), pushdown: false, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned", queries_path: "/Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries", output_path: Some("/Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json") }
Q30: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;

Query 30 iteration 0 took 305.7 ms and returned 10 rows
Query 30 iteration 1 took 289.1 ms and returned 10 rows
Query 30 iteration 2 took 287.7 ms and returned 10 rows
Query 30 iteration 3 took 266.3 ms and returned 10 rows
Query 30 iteration 4 took 268.3 ms and returned 10 rows
Query 30 avg time: 283.43 ms

Are there any user-facing changes?

No this is a development process change only

zhuqi-lucas

LGTM, thank you @alamb !

This is very helpful for pushdown case performance monitor!

alamb · 2025-07-10T12:27:43Z

I tested this benchmark with our filter pushdown work here, and I think it is useful

#16711 (comment)

Thank you @zhuqi-lucas for the review

Add clickbench_pushdown benchmark

5cda900

alamb marked this pull request as draft July 9, 2025 21:02

alamb marked this pull request as ready for review July 9, 2025 21:39

alamb mentioned this pull request Jul 9, 2025

POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) #16711

Draft

zhuqi-lucas approved these changes Jul 10, 2025

View reviewed changes

adjust benchmark name

7ffd8e9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `clickbench_pushdown` benchmark #16731

Add `clickbench_pushdown` benchmark #16731

Uh oh!

alamb commented Jul 9, 2025 •

edited

Loading

Uh oh!

zhuqi-lucas left a comment

Uh oh!

alamb commented Jul 10, 2025

Uh oh!

Uh oh!

Add clickbench_pushdown benchmark #16731

Are you sure you want to change the base?

Add clickbench_pushdown benchmark #16731

Uh oh!

Conversation

alamb commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

zhuqi-lucas left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 10, 2025

Uh oh!

Uh oh!

Add `clickbench_pushdown` benchmark #16731

Add `clickbench_pushdown` benchmark #16731

alamb commented Jul 9, 2025 •

edited

Loading