[Data] Add scheduling-loop max metric to DatasetStatsSummary by xinyuangui2 · Pull Request #63345 · ray-project/ray

xinyuangui2 · 2026-05-14T17:14:56Z

Description

Adds two new fields to DatasetStatsSummary:

Field	Source	Meaning
`streaming_exec_schedule_s`	`Timer.get()` (unchanged)	total wall-clock time across scheduler iterations
`streaming_exec_schedule_avg_s` (new)	`Timer.avg()`	per-iteration average
`streaming_exec_schedule_max_s` (new)	`Timer.max()`	per-iteration max

The existing streaming_exec_schedule_s keeps its total-duration meaning so runtime_metrics() (which divides it by total_wall_time to compute a percentage) continues to produce a correct breakdown. Per-iteration values are exposed under separate field names.

Memory stays O(1) per Timer — no sample buffer; _max is already tracked.

Tested using https://buildkite.com/ray-project/release/builds/92867#019e28c9-c8ad-4fb3-9776-c89f52f93200

gemini-code-assist

Code Review

This pull request introduces percentile and average tracking for dataset execution statistics, specifically for the streaming scheduling loop duration, and adds optional py-spy profiling support to the worker scaling benchmark. The code review identified several critical issues: returning infinity instead of zero when no samples exist will cause crashes in formatting utilities due to overflow errors; storing all timing samples in an unbounded list creates a memory leak risk; and changing the scheduling duration metric from a total to an average breaks existing wall-time breakdown calculations. Additionally, the reviewer noted that sorting the sample list on every percentile calculation is inefficient and suggested using more performant data structures or caching.

Adds two new fields to ``DatasetStatsSummary``: - ``streaming_exec_schedule_avg_s``: per-iteration average (sourced from ``Timer.avg()``) - ``streaming_exec_schedule_max_s``: per-iteration max (sourced from the already-tracked ``Timer.max()``) ``streaming_exec_schedule_s`` keeps its existing meaning (total wall-clock time across all scheduler iterations) so the ``runtime_metrics()`` breakdown — which divides this value by total_wall_time to compute a percentage — remains correct. Memory stays O(1) per Timer; no sample buffer. Release-test helper ``collect_dataset_stats`` is updated to surface the per-iteration values as ``avg_/max_scheduling_loop_duration_s``. Zero-iteration safety: the previous ``if self.streaming_exec_schedule_s`` guard at the build site was a dead check (``Timer()`` is always truthy). With no scheduler iterations recorded (e.g. non-streaming execution), ``Timer.avg()`` returns ``float("inf")``, which would break JSON serialization in release-test output and produce nonsense in ``runtime_metrics()``. Replaced with an explicit ``_total_count > 0`` check that returns 0 for all three fields when no samples are present. Rationale: total scheduler time scales with run duration and is hard to compare across runs of different lengths. The new avg/max fields give per-iteration quantities that directly reflect scheduler efficiency, without breaking the existing total-based breakdown. Signed-off-by: xgui <xgui@anyscale.com>

Three small follow-ups to the review thread, none requiring a guard or private-attribute access at the call site: 1. ``Timer.avg()`` now returns 0 when no samples have been recorded (was ``float("inf")``). Matches the zero-sample semantics of ``Timer.get()`` and ``Timer.max()``, which both return 0. The previous ``inf`` return broke JSON serialization downstream and forced consumers to special-case the empty path. 2. ``StreamingExecutor._generate_stats`` now assigns ``Timer()`` (an empty Timer) instead of ``None`` when ``_initial_stats`` is falsy. The type annotation on the field is ``Timer``, not ``Optional[Timer]``; this makes runtime match the annotation. 3. The ``DatasetStats.to_summary`` call site drops the ``schedule_timer._total_count > 0`` guard. It now just calls ``.get()`` / ``.avg()`` / ``.max()`` directly — the Timer is always present (per #2) and the methods all return 0 for an empty Timer (per #1). The pickle PR's reviewers asked for both: - "not access ``schedule_timer._total_count``" - "make ``streaming_exec_schedule_s`` always non-None" This commit does both, plus the Timer.avg() consistency fix that makes the simplified call site safe. Signed-off-by: xgui <xgui@anyscale.com>

Reverts the global ``Timer.avg() -> 0`` change from the prior commit. The previous behavior of returning ``float("inf")`` for an empty Timer is retained as the default to preserve the "undefined" signal for display callers (``fmt(timer.avg())`` renders it as ``"inf s"``). The summary build site that needs a JSON-safe / arithmetic-safe value opts out explicitly: schedule_timer.avg(default=0.0) Only the schedule-Timer call site uses the override. All other ``Timer.avg()`` callers (the iterator-side metrics) keep the existing ``inf`` semantics unchanged. Signed-off-by: xgui <xgui@anyscale.com>

Reverts the ``default=`` kwarg added in the previous commit; call ``avg()`` directly with its existing signature. ``StreamingExecutor. _generate_stats`` always assigns a ``Timer``, and downstream consumers of ``streaming_exec_schedule_avg_s`` can interpret the ``float('inf')`` empty-Timer signal the same way they already do for the other ``Timer.avg()`` call sites. Signed-off-by: xgui <xgui@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 29ba77e. Configure here.}

xinyuangui2 requested a review from a team as a code owner May 14, 2026 17:14

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Comment thread python/ray/data/_internal/stats.py Outdated

Comment thread python/ray/data/_internal/stats.py Outdated

Comment thread python/ray/data/_internal/stats.py Outdated

Comment thread python/ray/data/_internal/stats.py Outdated

xinyuangui2 force-pushed the xgui/data-pyspy-flag-stats branch from 7a5ccc9 to 622cafb Compare May 14, 2026 17:17

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread python/ray/data/_internal/stats.py Outdated

Comment thread python/ray/data/_internal/stats.py Outdated

Comment thread python/ray/data/_internal/stats.py Outdated

ray-gardener Bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling release-test release test labels May 14, 2026

xinyuangui2 force-pushed the xgui/data-pyspy-flag-stats branch from 622cafb to a68bf78 Compare May 14, 2026 20:18

xinyuangui2 changed the title ~~[Data] Add scheduling-loop p90/max stats and py-spy flag to worker_scaling_benchmark~~ [Data] Add scheduling-loop max stat, py-spy profiling coordinator, and benchmark wiring May 14, 2026

xinyuangui2 force-pushed the xgui/data-pyspy-flag-stats branch from a68bf78 to e9366e2 Compare May 14, 2026 20:20

xinyuangui2 changed the title ~~[Data] Add scheduling-loop max stat, py-spy profiling coordinator, and benchmark wiring~~ [Data] Add scheduling-loop max metric to DatasetStatsSummary May 14, 2026

xinyuangui2 force-pushed the xgui/data-pyspy-flag-stats branch 2 times, most recently from 77d7af2 to 49300ca Compare May 14, 2026 20:25

xinyuangui2 requested a review from iamjustinhsu May 14, 2026 21:06

xinyuangui2 added the go add ONLY when ready to merge, run all tests label May 14, 2026

iamjustinhsu approved these changes May 14, 2026

View reviewed changes

Comment thread python/ray/data/_internal/stats.py Outdated

xinyuangui2 force-pushed the xgui/data-pyspy-flag-stats branch from 49300ca to 8a150e0 Compare May 14, 2026 23:21

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread python/ray/data/_internal/stats.py Outdated

xinyuangui2 force-pushed the xgui/data-pyspy-flag-stats branch from 8a150e0 to a1fd253 Compare May 14, 2026 23:31

xinyuangui2 mentioned this pull request May 15, 2026

[DRAFT][Data] Wide-schema worker_scaling release test + scheduling-loop avg/max stats #63377

Closed

3 tasks

bveeramani approved these changes May 15, 2026

View reviewed changes

Comment thread python/ray/data/_internal/stats.py Outdated

xinyuangui2 added 2 commits May 15, 2026 23:36

cursor Bot reviewed May 15, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/streaming_executor.py

bveeramani merged commit 944df88 into ray-project:master May 16, 2026
6 checks passed

xinyuangui2 mentioned this pull request May 20, 2026

[Data] Reducing StreamingExecutor scheduling-loop overhead #63544

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add scheduling-loop max metric to DatasetStatsSummary#63345

[Data] Add scheduling-loop max metric to DatasetStatsSummary#63345
bveeramani merged 4 commits into
ray-project:masterfrom
xinyuangui2:xgui/data-pyspy-flag-stats

xinyuangui2 commented May 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xinyuangui2 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xinyuangui2 commented May 14, 2026 •

edited

Loading