Skip to content

[Data]Improve hash partition for columns contains Struct/list/map#62550

Closed
laysfire wants to merge 3 commits into
ray-project:masterfrom
laysfire:improve_hash_partition_performance
Closed

[Data]Improve hash partition for columns contains Struct/list/map#62550
laysfire wants to merge 3 commits into
ray-project:masterfrom
laysfire:improve_hash_partition_performance

Conversation

@laysfire
Copy link
Copy Markdown
Contributor

@laysfire laysfire commented Apr 13, 2026

Description

To improve the performance of hash partitioning, operations that retrieve table columns need to be moved out of the loop.
Use the following script to verify:

import time
import pyarrow as pa
from ray.data._internal.arrow_ops.transform_pyarrow import hash_partition

idx = list(range(50000000))
ints = list(range(50000000))
t = pa.Table.from_pydict(
   {
       "idx": pa.array(idx),
       "ints": pa.array(ints),
   }
)

start = time.time()
hash_partition(t, hash_cols=["idx", "ints"], num_partitions=10)
end = time.time()
print(end - start)

The test result is:

CPU spec Code Version Time consumed
Apple M4 original 60s
Apple M4 optimized 15s

Signed-off-by: yifan.xie <xyfabcd@163.com>
@laysfire laysfire requested a review from a team as a code owner April 13, 2026 03:01
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the hash partitioning logic in transform_pyarrow.py by replacing manual column indexing with a zip-based iteration. Feedback suggests using enumerate for a more idiomatic Python implementation and highlights a potential behavioral discrepancy when partitioning a table with no columns.

Comment on lines +84 to +87
i = 0
for _tuple in zip(*table.columns):
partitions[i] = hash(_tuple) % num_partitions
i += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using enumerate is more idiomatic in Python than manually managing an index variable.

Also, note that zip(*table.columns) will result in an empty iterator if the table has no columns, causing the loop to be skipped entirely. In this case, partitions will remain all zeros. This differs from the previous implementation which would fill the array with hash(()) % num_partitions. While this edge case (partitioning on zero columns) is rare, you might want to handle it explicitly if exact backward compatibility is required.

Suggested change
i = 0
for _tuple in zip(*table.columns):
partitions[i] = hash(_tuple) % num_partitions
i += 1
for i, _tuple in enumerate(zip(*table.columns)):
partitions[i] = hash(_tuple) % num_partitions

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit d1c5c4d. Configure here.

Comment thread python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated
Signed-off-by: yifan.xie <xyfabcd@163.com>
@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Apr 13, 2026
Signed-off-by: laysfire <xyfabcd@163.com>
@laysfire
Copy link
Copy Markdown
Contributor Author

Since this pr #62757 propose a better solution, close this pr.

@laysfire laysfire closed this Apr 20, 2026
@owenowenisme
Copy link
Copy Markdown
Member

Sorry I didn't see your PR, thanks for your contribution anyway! If you need other stuff to work on you can DM me on ray slack.

@laysfire laysfire changed the title [Data]Improve hash partition performance [Data]Improve hash partition for columns contains Struct/list/map May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants