[data] Defer hash-shuffle schema broadcast to first row-bearing block#63136
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the hash shuffle operator to defer schema broadcasting until the first row-bearing block is received, preventing issues where zero-row marker blocks cause downstream join failures. A regression test was added to verify this fix. Feedback suggests refining the broadcast logic to handle cases where a dataset is entirely empty but contains a valid schema by also checking for the presence of columns in the block metadata.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 6d27c8f. Configure here.
7118771 to
242f1ac
Compare
Dataset.join()'s hash shuffle used the first input block of each side to broadcast the input schema to every aggregator. When the upstream is itself a hash shuffle (e.g. groupby.map_groups whose finalize emits empty marker blocks for empty groups), the first input block can be a zero-row, zero-column block. Using it for the broadcast caused _create_empty_table to propagate column-less blocks to all aggregators, after which any aggregator whose partition was empty on a side ended up with only schema-less blocks. Downstream join finalize then failed to resolve the join key, raising e.g. `ArrowInvalid: No match or multiple matches for key field reference FieldRef.Name(<key>) on left side of the join`. Defer the broadcast to the first block that actually carries a schema - either by having rows (in which case its schema is intact) or by being described by a RefBundle whose schema field is non-empty (e.g. an empty-but-typed table from `from_arrow`). The streaming executor strips per-block schema off `BlockMetadata` and hoists it to `RefBundle.schema` in DataOpTask.on_data_ready, so the bundle is the right place to read from. Adds regression tests for the `from_pandas -> groupby.map_groups -> join` shape (where the upstream emits column-less marker blocks) and for the `from_arrow(empty_typed) U from_arrow(populated) -> join` shape (where the first block has zero rows but a real schema). Signed-off-by: Curtis Howard <curtis.james.howard@gmail.com>
242f1ac to
51e0af6
Compare
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |

Why are these changes needed?
The hash shuffle used by
Dataset.join()broadcasts the input schema to aggregators. When the upstream step is itself a hash shuffle (e.g.groupby.map_groups) it can emit empty / zero-column blocks. If an empty block is the first to reach the join's hash shuffle, the empty schema is broadcast to the aggregators. Aggregators end up with schema-less blocks, andpa.Table.joinfails resolving the join key:This is reproducible on the latest stable (Ray 2.55.1) with a
from_pandas → groupby.map_groups → joinwheredefault_hash_shuffle_parallelismis high enough relative to the distinct key cardinality to leave partitions empty.The fix defers the broadcast to the first row-bearing block. Zero-row marker blocks fall through with
send_empty_blocks=Falseand are filtered as before; a row-bearing block with a real schema will arrive whenever the dataset is non-empty, and the broadcast then propagates the correct schema to every aggregator.Minimal repro (fails on master, passes with this PR)
Related issue number
N/A.
Checks
-sflag, i.e.,git commit -s) in this PR.scripts/format.shto lint the changes in this PR.doc/source/tune/api/under the corresponding.rstfile.