Skip to content

Add clickbench_pushdown benchmark #16731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 9, 2025

Which issue does this PR close?

Rationale for this change

In order to enable filter_pushdown by default, we need to ensure it doesn't regress existing performance

However, it has been very hard to make forward progress on improving filter pushdown because all our benchmarks compare filter pushdown to not filter pushdown, so the bar for change is quite high.
Here is the most recent example:

It seems obvious but the the right metric for improvements to the filter pushdown are comparing when filter pushdown is already on. However, we don't have any such benchmark (see #16729 and #16730 for why the existing benchmarks are not good enough)

What changes are included in this PR?

Add a benchmark (clickbench_pushdown) that turns on filter_pushdown and reorder_filters on

You can run it like this:

`./benchmarks/bench.sh run clickbench_pushdown

Which then invokes

+ cargo run --release --bin dfbench -- clickbench --pushdown --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries -o /Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json

Are these changes tested?

I tested it manually . You can see Q30 increase in time when --pushdown is enabled, as expected

with --pushdown:

...     Running `target/profiling/dfbench clickbench --pushdown --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries -o /Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json --query 30`
Running benchmarks with the following options: RunOpt { query: Some(30), pushdown: true, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned", queries_path: "/Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries", output_path: Some("/Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json") }
Q30: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;

Query 30 iteration 0 took 546.0 ms and returned 10 rows
Query 30 iteration 1 took 503.1 ms and returned 10 rows
Query 30 iteration 2 took 488.2 ms and returned 10 rows
Query 30 iteration 3 took 462.6 ms and returned 10 rows
Query 30 iteration 4 took 462.3 ms and returned 10 rows
Query 30 avg time: 492.42 ms

Without pushdown

...
     Running `target/profiling/dfbench clickbench --iterations 5 --path /Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned --queries-path /Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries -o /Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json --query 30`
Running benchmarks with the following options: RunOpt { query: Some(30), pushdown: false, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned", queries_path: "/Users/andrewlamb/Software/datafusion/benchmarks/queries/clickbench/queries", output_path: Some("/Users/andrewlamb/Software/datafusion/benchmarks/results/alamb_new_filter_pushdown/clickbench_partitioned.json") }
Q30: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591
-- set datafusion.execution.parquet.binary_as_string = true

SELECT "SearchEngineID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "ClientIP" ORDER BY c DESC LIMIT 10;

Query 30 iteration 0 took 305.7 ms and returned 10 rows
Query 30 iteration 1 took 289.1 ms and returned 10 rows
Query 30 iteration 2 took 287.7 ms and returned 10 rows
Query 30 iteration 3 took 266.3 ms and returned 10 rows
Query 30 iteration 4 took 268.3 ms and returned 10 rows
Query 30 avg time: 283.43 ms

Are there any user-facing changes?

No this is a development process change only

@alamb alamb marked this pull request as draft July 9, 2025 21:02
@alamb alamb marked this pull request as ready for review July 9, 2025 21:39
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @alamb !

This is very helpful for pushdown case performance monitor!

@alamb
Copy link
Contributor Author

alamb commented Jul 10, 2025

I tested this benchmark with our filter pushdown work here, and I think it is useful

#16711 (comment)

Thank you @zhuqi-lucas for the review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a datafusion benchmark for filter_pushdown
2 participants