Skip to content

Improve hash partition for columns contains Struct/list/map#63152

Open
laysfire wants to merge 3 commits into
ray-project:masterfrom
laysfire:improve_hash_partition_performance
Open

Improve hash partition for columns contains Struct/list/map#63152
laysfire wants to merge 3 commits into
ray-project:masterfrom
laysfire:improve_hash_partition_performance

Conversation

@laysfire
Copy link
Copy Markdown
Contributor

@laysfire laysfire commented May 6, 2026

Description

This pr is to improve hash partition performance when table contains pandas can't handle types by moving retrieve table columns operations out of loop.
Use the following script to verify:

import time
import pyarrow as pa
from ray.data._internal.arrow_ops.transform_pyarrow import hash_partition

idx = list(range(50000000))
ints = [[i]for i in range(50000000)]
t = pa.Table.from_pydict(
   {
       "idx": pa.array(idx),
       "ints": pa.array(ints),
   }
)

start = time.time()
hash_partition(t, hash_cols=["idx", "ints"], num_partitions=10)
end = time.time()
print(end - start)

The test result is:

CPU spec Code Version Time consumed
Apple M4 original 66s
Apple M4 optimized 18s

Related issues

Link related issues: "Closes #62550", Cant' reopen after force push, recreate new pr.

Signed-off-by: yifan.xie <xyfabcd@163.com>
@laysfire laysfire requested a review from a team as a code owner May 6, 2026 03:45
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the row-by-row hashing logic in _hash_partition by replacing manual column indexing with a more efficient enumerate(zip(*table.columns)) approach. This change improves code readability and potentially performance when processing unhashable PyArrow columns. I have no feedback to provide as there are no review comments.

@laysfire
Copy link
Copy Markdown
Contributor Author

laysfire commented May 6, 2026

@owenowenisme hi, could you help review?

@ray-gardener ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels May 6, 2026
@owenowenisme owenowenisme self-assigned this May 7, 2026
Copy link
Copy Markdown
Member

@owenowenisme owenowenisme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement! Could you add comments explaining this improvement in the code?

@owenowenisme owenowenisme added the go add ONLY when ready to merge, run all tests label May 16, 2026
Signed-off-by: yifan.xie <xyfabcd@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants