[Data]Improve hash partition for columns contains Struct/list/map#62550
[Data]Improve hash partition for columns contains Struct/list/map#62550laysfire wants to merge 3 commits into
Conversation
Signed-off-by: yifan.xie <xyfabcd@163.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the hash partitioning logic in transform_pyarrow.py by replacing manual column indexing with a zip-based iteration. Feedback suggests using enumerate for a more idiomatic Python implementation and highlights a potential behavioral discrepancy when partitioning a table with no columns.
| i = 0 | ||
| for _tuple in zip(*table.columns): | ||
| partitions[i] = hash(_tuple) % num_partitions | ||
| i += 1 |
There was a problem hiding this comment.
Using enumerate is more idiomatic in Python than manually managing an index variable.
Also, note that zip(*table.columns) will result in an empty iterator if the table has no columns, causing the loop to be skipped entirely. In this case, partitions will remain all zeros. This differs from the previous implementation which would fill the array with hash(()) % num_partitions. While this edge case (partitioning on zero columns) is rare, you might want to handle it explicitly if exact backward compatibility is required.
| i = 0 | |
| for _tuple in zip(*table.columns): | |
| partitions[i] = hash(_tuple) % num_partitions | |
| i += 1 | |
| for i, _tuple in enumerate(zip(*table.columns)): | |
| partitions[i] = hash(_tuple) % num_partitions |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit d1c5c4d. Configure here.
Signed-off-by: yifan.xie <xyfabcd@163.com>
Signed-off-by: laysfire <xyfabcd@163.com>
|
Since this pr #62757 propose a better solution, close this pr. |
|
Sorry I didn't see your PR, thanks for your contribution anyway! If you need other stuff to work on you can DM me on ray slack. |

Description
To improve the performance of hash partitioning, operations that retrieve table columns need to be moved out of the loop.
Use the following script to verify:
The test result is: