[Data] Add TPCH Q20, 21, 22 benchmark scripts to nightly tests#62333
[Data] Add TPCH Q20, 21, 22 benchmark scripts to nightly tests#62333goutamvenkat-anyscale merged 3 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces Ray Data implementations for TPC-H queries 20, 21, and 22, along with their corresponding configurations in the nightly release test suite for scale factor 100. The review feedback highlights several performance optimization opportunities, specifically recommending against hardcoding low partition counts (like num_partitions=16) which can cause memory pressure at 100GB scale. Additionally, suggestions were made for Query 21 to push filters down before materialization and remove redundant filter steps to improve efficiency.
| ps_forest = partsupp.join( | ||
| forest_parts, | ||
| join_type="left_semi", | ||
| num_partitions=16, |
There was a problem hiding this comment.
Hardcoding num_partitions=16 is likely too low for Scale Factor 100 (100GB), where tables like lineitem and partsupp contain hundreds of millions of rows. This can lead to excessively large partitions (several GBs each), causing memory pressure or underutilization of the cluster. It is generally better to let Ray Data automatically determine the number of partitions or set it to a much higher value (e.g., 200+) for this scale.
There was a problem hiding this comment.
it looks like a convention across existing test.
e9620ce to
752350c
Compare
|
fixed at 44f1bbe |
Signed-off-by: ryankert01 <ryankert01@gmail.com>
44f1bbe to
a884618
Compare
…roject#62333) ## Description As title. All three queries follow the established patterns from the existing TPC-H benchmark suite ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: ryankert01 <ryankert01@gmail.com> Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Description
As title. All three queries follow the established patterns from the existing TPC-H benchmark suite
Additional information