[Data] Add TPCH queries 7,8,9 for benchmarking by daiping8 · Pull Request #60662 · ray-project/ray

daiping8 · 2026-02-02T09:40:34Z

Description

Adding Query Q7, Q8, Q9 for TPCH tests

There are some issues with TPCH Q5. For details, see #61354.

gemini-code-assist

Code Review

This pull request adds TPC-H queries 5, 7, 8, and 9 for benchmarking purposes. The overall structure of the new query files is consistent. However, I've found several correctness issues where the implementations deviate significantly from the TPC-H specifications for queries 7, 8, and 9. These need to be addressed to ensure the benchmarks are valid. Additionally, there are opportunities to improve performance in queries 5 and 9 by optimizing the join logic. The configuration changes in the YAML file are appropriate.

release/nightly_tests/dataset/tpch/tpch_q9.py

release/nightly_tests/dataset/tpch/tpch_q7.py

release/nightly_tests/dataset/tpch/tpch_q8.py

release/nightly_tests/dataset/tpch/tpch_q5.py

release/nightly_tests/dataset/tpch/tpch_q9.py

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: ZTE Ray <dai.ping88@zte.com.cn>

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

release/nightly_tests/dataset/tpch/tpch_q8.py

release/nightly_tests/dataset/tpch/tpch_q5.py

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

… for improved clarity and consistency. Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

release/nightly_tests/dataset/tpch/tpch_q9.py

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

daiping8 · 2026-02-03T07:51:47Z

@owenowenisme Please review the code. Looking forward to any suggestions.

iamjustinhsu · 2026-02-04T18:59:07Z

Hi @daiping8, can you help me understand why are you adding these benchmarks?

daiping8 · 2026-02-05T01:46:25Z

Hi @daiping8, can you help me understand why are you adding these benchmarks?

Hi. This is a task assigned by the Ray Data Team. https://docs.google.com/document/d/1OFFp2jMMnrCPiE0Gxdi0ronXGVqtDYDbUoS3fsNc54Q/edit?pli=1&tab=t.0

owenowenisme

I think you're missing some tables in common.py ? How about let's open up a pr first to add the name mapping?

FYI

=== region ===
Column names: ['column0', 'column1', 'column2', 'column3']
column0: int64
column1: string
column2: string
column3: string

=== supplier ===
Column names: ['column0', 'column1', 'column2', 'column3', 'column4', 'column5', 'column6', 'column7']
column0: int64
column1: string
column2: string
column3: int64
column4: string
column5: double
column6: string
column7: string

… nation, supplier, customer, orders, part, and partsupp Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

owenowenisme

@daiping8 I ran the release tests and there are some tailed tests, could you take a look?

release/nightly_tests/dataset/tpch/tpch_q8.py

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

release/nightly_tests/dataset/tpch/tpch_q8.py

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

owenowenisme · 2026-02-13T18:07:04Z

Release test passed:
https://buildkite.com/ray-project/release/builds/80273

release/nightly_tests/dataset/tpch/common.py

alexeykudinkin · 2026-02-13T18:18:57Z

release/nightly_tests/dataset/tpch/tpch_q5.py

+        nation_region_pd = nation_region.to_pandas()[["n_nationkey", "n_name"]].copy()
+
+        def _join_supplier(batch: pd.DataFrame) -> pd.DataFrame:
+            out = batch.merge(
+                nation_region_pd,
+                left_on="s_nationkey",
+                right_on="n_nationkey",
+                how="inner",
+            )
+            return out.rename(
+                columns={
+                    "n_nationkey": "n_nationkey_supp",
+                    "n_name": "n_name_supp",
+                }
+            )
+
+        supplier_nation = supplier.map_batches(
+            _join_supplier,
+            batch_format="pandas",
+        )
+
+        def _join_customer(batch: pd.DataFrame) -> pd.DataFrame:


Why are we doing this?

Through broadcast join, after filtering nation_region to ASIA, there are only about 5 rows, which is very small in size. Converting it to pandas and merging it in each batch along with map_batches can avoid shuffling the large-scale tables supplier and customer, thereby reducing the costs of shuffling and network transmission.

After previous tests, using ray data join will result in an OOM error. I have added comments here and a todo for future improvements.

Totally understand where you're coming from @daiping8, but the purpose of tests is to actually track the way people are expecting to be able to be use Ray Data -- meaning that we'd use join as an operation for joining and then continuously improving it to make it perform better over time.

Does that make sense?

So what i'm gonna ask you to do is following

Please impl it using join (tuning it as necessary to make it work out of the box first)

Then, if you see opportunities to improve joins themselves, please go ahead and impl them too

release/nightly_tests/dataset/tpch/tpch_q5.py

…CH dataset loading. Update tpch_q5.py with comments on future optimizations for column selection and join operations. Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

owenowenisme

I see there's a lot of branching in the pipeline, I think the logic will be much more readable to linearize the join chain. By moving away from a bushy DAG and into a single-stream pipeline, we allow Ray to better orchestrate task execution and data flow. This can ensures we don't hold multiple large intermediate datasets in memory simultaneously, which reduces the risk of object store spilling.

release/nightly_tests/dataset/tpch/tpch_q5.py

owenowenisme · 2026-02-27T09:24:37Z

release/nightly_tests/dataset/tpch/tpch_q5.py

+        supplier_nation = supplier.join(
+            asia_for_supplier,
+            num_partitions=16,
+            join_type="inner",
+            on=("s_nationkey",),
+            right_on=("n_nationkey",),
+        )
+        supplier_nation = (
+            supplier_nation.rename_columns({"n_name": "n_name_supp"})
+            .select_columns(["s_suppkey", "s_nationkey", "n_name_supp"])
+        )
+
+        # customer ⋈ asia_for_customer (Ray join), get customer nations
+        customer_nation = customer.join(
+            asia_for_customer,
+            num_partitions=16,
+            join_type="inner",
+            on=("c_nationkey",),
+            right_on=("n_nationkey",),
+        )
+        customer_nation = (
+            customer_nation.rename_columns({"n_name": "n_name_cust"})
+            .select_columns(["c_custkey", "c_nationkey", "n_name_cust"])
+        )


I think we can simplify this. For a 6-table join, we only need 5 join operations. By joining the tables in a linear chain (Region → Nation → Customer → Orders → Lineitem → Supplier), we eliminate the redundant branch where nation is joined twice. This reduces the number of global shuffles, keeps the intermediate dataset smaller by filtering for the target region early, and allows us to implement the c_nationkey = s_nationkey constraint as a local filter rather than an expensive distributed join.

And also the logic is hard to follow here.

owenowenisme · 2026-02-27T10:56:47Z

Also it would be easier to understand if we comment the sql query in the file like #61305, would you mind adding it?

…, and Q9 Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

release/release_data_tests.yaml

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

owenowenisme · 2026-03-02T09:00:27Z

release/nightly_tests/dataset/tpch/common.py

+    "region": {
+        "column0": "r_regionkey",
+        "column1": "r_name",
+        "column2": "r_comment",


I think there are 4 columns in region?

https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-H_v3.0.1.pdf Page 17

In fact, all the data in the fourth column of the parquet file is None. The data in the first three columns have actual meanings and match the definitions.

schema: column0: int64 column1: string column2: string column3: string column0 column1 column2 column3 0 AFRICA lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to None 1 AMERICA hs use ironic, even requests. s None 2 ASIA ges. thinly even pinto beans ca None 3 EUROPE ly final courts cajole furiously final excuse None 4 MIDDLE EAST uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl None

Got it thanks for clarifying

…he purpose of each join. Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

owenowenisme

Release tests passed for Q7,8,9

## Description Adding Query Q7, Q8, Q9 for TPCH tests There are some issues with TPCH Q5. For details, see #61354. --------- Signed-off-by: daiping8 <dai.ping88@zte.com.cn> Signed-off-by: ZTE Ray <dai.ping88@zte.com.cn> Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>

## Description Adding Query Q7, Q8, Q9 for TPCH tests There are some issues with TPCH Q5. For details, see ray-project#61354. --------- Signed-off-by: daiping8 <dai.ping88@zte.com.cn> Signed-off-by: ZTE Ray <dai.ping88@zte.com.cn> Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: Ayush Kumar <ayushk7102@gmail.com>

## Description Adding Query Q7, Q8, Q9 for TPCH tests There are some issues with TPCH Q5. For details, see ray-project#61354. --------- Signed-off-by: daiping8 <dai.ping88@zte.com.cn> Signed-off-by: ZTE Ray <dai.ping88@zte.com.cn> Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: You-Cheng Lin <mses010108@gmail.com>

gemini-code-assist bot reviewed Feb 2, 2026

View reviewed changes

[Data] Add TPCH queries 5 to 9 for benchmarking

7ea156e

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

daiping8 force-pushed the tpchq5 branch from 1612e29 to 7ea156e Compare February 2, 2026 10:01

daiping8 and others added 2 commits February 2, 2026 18:14

Update release/nightly_tests/dataset/tpch/tpch_q5.py

03fe459

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: ZTE Ray <dai.ping88@zte.com.cn>

gemini-code-assist suggestion

4d09ca5

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

daiping8 changed the title ~~[Data] Add TPCH queries 5 to 9 for benchmarking~~ [Data] Add TPCH queries 5,7,8,9 for benchmarking Feb 2, 2026

daiping8 changed the title ~~[Data] Add TPCH queries 5,7,8,9 for benchmarking~~ [WIP][Data] Add TPCH queries 5,7,8,9 for benchmarking Feb 2, 2026

daiping8 marked this pull request as ready for review February 2, 2026 10:40

cursor bot reviewed Feb 2, 2026

View reviewed changes

release/nightly_tests/dataset/tpch/tpch_q8.py Show resolved Hide resolved

release/nightly_tests/dataset/tpch/tpch_q5.py Outdated Show resolved Hide resolved

daiping8 added 2 commits February 2, 2026 18:56

cursor suggestion

036e722

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Refactor join parameters in tpch_q5.py to use tuples instead of lists…

37da629

… for improved clarity and consistency. Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

cursor bot reviewed Feb 2, 2026

View reviewed changes

release/nightly_tests/dataset/tpch/tpch_q9.py Outdated Show resolved Hide resolved

1

9d4addd

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Feb 2, 2026

Merge branch 'master' into tpchq5

593ce76

daiping8 changed the title ~~[WIP][Data] Add TPCH queries 5,7,8,9 for benchmarking~~ [Data] Add TPCH queries 5,7,8,9 for benchmarking Feb 3, 2026

iamjustinhsu assigned owenowenisme Feb 4, 2026

owenowenisme reviewed Feb 6, 2026

View reviewed changes

[tpch] Extend TABLE_COLUMNS with additional TPC-H schemas for region,…

66560ab

… nation, supplier, customer, orders, part, and partsupp Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

daiping8 force-pushed the tpchq5 branch from 2178605 to 66560ab Compare February 7, 2026 06:25

Merge branch 'master' into tpchq5

b072623

owenowenisme reviewed Feb 8, 2026

View reviewed changes

Merge branch 'master' into tpchq5

70f74db

cursor bot reviewed Feb 9, 2026

View reviewed changes

release/nightly_tests/dataset/tpch/tpch_q8.py Show resolved Hide resolved

daiping8 added 2 commits February 9, 2026 09:30

1

0642432

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Merge branch 'master' into tpchq5

dc3f07c

daiping8 and others added 3 commits February 10, 2026 17:37

1

f225037

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Refactor

6655042

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Merge branch 'master' into tpchq5

dd1a083

cursor bot reviewed Feb 13, 2026

View reviewed changes

release/nightly_tests/dataset/tpch/tpch_q8.py Show resolved Hide resolved

add select_column and lower num_partition to 16

17cafb8

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

alexeykudinkin reviewed Feb 13, 2026

View reviewed changes

daiping8 and others added 5 commits February 14, 2026 10:28

Refactor load_table function to remove manual column projection in TP…

49afac4

…CH dataset loading. Update tpch_q5.py with comments on future optimizations for column selection and join operations. Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Merge branch 'master' into tpchq5

cb22999

todo

5093c77

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

1

d21131b

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

1

206b00e

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

owenowenisme added the go add ONLY when ready to merge, run all tests label Feb 27, 2026

owenowenisme reviewed Feb 27, 2026

View reviewed changes

daiping8 added 3 commits March 2, 2026 10:31

Add detailed comments and SQL equivalents for TPCH queries Q5, Q7, Q8…

feea4bc

…, and Q9 Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Stop tracking TPCH Q5 benchmark script

0ba42b3

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Merge branch 'master' into tpchq5

1ed5843

daiping8 changed the title ~~[Data] Add TPCH queries 5,7,8,9 for benchmarking~~ [Data] Add TPCH queries 7,8,9 for benchmarking Mar 2, 2026

cursor bot reviewed Mar 2, 2026

View reviewed changes

release/release_data_tests.yaml Outdated Show resolved Hide resolved

Remove TPCH Q5 benchmark script from release data tests configuration

99d633d

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

owenowenisme reviewed Mar 2, 2026

View reviewed changes

daiping8 and others added 3 commits March 2, 2026 20:34

Enhanced comments to reflect the new pipeline structure and clarify t…

c1224ae

…he purpose of each join. Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

1

0bcaa09

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

Merge branch 'master' into tpchq5

9a7fca5

owenowenisme approved these changes Mar 2, 2026

View reviewed changes

bveeramani merged commit 672e9fb into ray-project:master Mar 2, 2026
6 checks passed

Conversation

daiping8 commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daiping8 commented Feb 3, 2026

Uh oh!

iamjustinhsu commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daiping8 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

owenowenisme left a comment

Choose a reason for hiding this comment

Uh oh!

owenowenisme left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

owenowenisme commented Feb 13, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

owenowenisme left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

owenowenisme commented Feb 27, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

owenowenisme left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

daiping8 commented Feb 2, 2026 •

edited

Loading

iamjustinhsu commented Feb 4, 2026 •

edited

Loading

daiping8 commented Feb 5, 2026 •

edited

Loading