Skip to content

[Data] Add SchedulingHints sibling field on BlockEntry / wire envelope#64517

Open
goutamvenkat-anyscale wants to merge 1 commit into
ray-project:masterfrom
goutamvenkat-anyscale:scheduling-hints
Open

[Data] Add SchedulingHints sibling field on BlockEntry / wire envelope#64517
goutamvenkat-anyscale wants to merge 1 commit into
ray-project:masterfrom
goutamvenkat-anyscale:scheduling-hints

Conversation

@goutamvenkat-anyscale

Copy link
Copy Markdown
Contributor

Adds an opt-in, producer-supplied forecast about what the next consumer of a block will need (memory today; cpu/gpu/locality/strategy in future additions on the same dataclass). Designed to be layered on top of the BlockEntry foundation without disturbing existing call sites.

Surface:

  • New module ray.data._internal.scheduling_hints with:
    • SchedulingHints frozen dataclass (memory: Optional[int]; additive)
    • stage_scheduling_hints(hints) — producer helper; writes to TaskContext.next_block_scheduling_hints before each yield.
    • stage_memory_hint(memory) — single-axis convenience.
  • BlockEntry.scheduling_hints: Optional[SchedulingHints] — sibling field to metadata, defaulting to None.
  • BlockMetadataWithSchema.scheduling_hints — wire-envelope field that carries the hint from worker to driver alongside the per-block metadata. from_metadata / from_block thread it through.
  • TaskContext.next_block_scheduling_hints + consume_next_block_scheduling_hints() — staging slot consumed and cleared by _map_task after each yield, so a stale value can't silently mis-tag later blocks.
  • RefBundle.scheduling_hints accessor — parallel list to block_refs / metadata, returns the per-block forecasts.

Plumbing:

  • _map_task reads the staged hints, attaches them to the BlockMetadataWithSchema it pickles per yield.
  • PhysicalOperator driver-side bundle assembly lifts hints from BMWS into BlockEntry.scheduling_hints so consumers see them via the bundle's accessor.

This PR ships the infra only — no operator currently stages or reads hints. The Download operator wires producer (file-size totals → memory forecast) and consumer (bundle hint sum → per-task memory resource) in a follow-on PR.

Adds an opt-in, producer-supplied forecast about what the *next consumer*
of a block will need (memory today; cpu/gpu/locality/strategy in future
additions on the same dataclass). Designed to be layered on top of the
BlockEntry foundation without disturbing existing call sites.

Surface:
- New module `ray.data._internal.scheduling_hints` with:
  * `SchedulingHints` frozen dataclass (memory: Optional[int]; additive)
  * `stage_scheduling_hints(hints)` — producer helper; writes to
    ``TaskContext.next_block_scheduling_hints`` before each yield.
  * `stage_memory_hint(memory)` — single-axis convenience.
- `BlockEntry.scheduling_hints: Optional[SchedulingHints]` — sibling
  field to `metadata`, defaulting to None.
- `BlockMetadataWithSchema.scheduling_hints` — wire-envelope field that
  carries the hint from worker to driver alongside the per-block
  metadata. `from_metadata` / `from_block` thread it through.
- `TaskContext.next_block_scheduling_hints` +
  `consume_next_block_scheduling_hints()` — staging slot consumed and
  cleared by `_map_task` after each yield, so a stale value can't
  silently mis-tag later blocks.
- `RefBundle.scheduling_hints` accessor — parallel list to
  `block_refs` / `metadata`, returns the per-block forecasts.

Plumbing:
- `_map_task` reads the staged hints, attaches them to the
  ``BlockMetadataWithSchema`` it pickles per yield.
- `PhysicalOperator` driver-side bundle assembly lifts hints from BMWS
  into `BlockEntry.scheduling_hints` so consumers see them via the
  bundle's accessor.

This PR ships the infra only — no operator currently stages or reads
hints. The Download operator wires producer (file-size totals → memory
forecast) and consumer (bundle hint sum → per-task `memory` resource)
in a follow-on PR.

Tests:
- `test_scheduling_hints.py` covers dataclass behavior, staging
  helpers, TaskContext consume/clear, and BMWS pickle round-trip.
- `test_ref_bundle.py` gains BlockEntry hint-field tests and the
  parallel-list accessor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner July 2, 2026 19:27
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label Jul 2, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the infrastructure for pre-scheduling hints, allowing producers to stage prospective resource forecasts (such as memory requirements) for downstream consumer tasks. The changes include adding a new SchedulingHints dataclass, updating TaskContext to stage and consume these hints, carrying them on the wire envelope via BlockMetadataWithSchema and BlockEntry, and integrating them into the map operator's block yielding process. Comprehensive unit tests have also been added to verify the staging, consumption, and serialization of these hints. I have no feedback to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant