Skip to content

[Data] Add scheduling-loop max metric to DatasetStatsSummary#63345

Merged
bveeramani merged 4 commits into
ray-project:masterfrom
xinyuangui2:xgui/data-pyspy-flag-stats
May 16, 2026
Merged

[Data] Add scheduling-loop max metric to DatasetStatsSummary#63345
bveeramani merged 4 commits into
ray-project:masterfrom
xinyuangui2:xgui/data-pyspy-flag-stats

Conversation

@xinyuangui2
Copy link
Copy Markdown
Contributor

@xinyuangui2 xinyuangui2 commented May 14, 2026

Description

Adds two new fields to DatasetStatsSummary:

Field Source Meaning
streaming_exec_schedule_s Timer.get() (unchanged) total wall-clock time across scheduler iterations
streaming_exec_schedule_avg_s (new) Timer.avg() per-iteration average
streaming_exec_schedule_max_s (new) Timer.max() per-iteration max

The existing streaming_exec_schedule_s keeps its total-duration meaning so runtime_metrics() (which divides it by total_wall_time to compute a percentage) continues to produce a correct breakdown. Per-iteration values are exposed under separate field names.

Memory stays O(1) per Timer — no sample buffer; _max is already tracked.

Tested using https://buildkite.com/ray-project/release/builds/92867#019e28c9-c8ad-4fb3-9776-c89f52f93200

@xinyuangui2 xinyuangui2 requested a review from a team as a code owner May 14, 2026 17:14
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces percentile and average tracking for dataset execution statistics, specifically for the streaming scheduling loop duration, and adds optional py-spy profiling support to the worker scaling benchmark. The code review identified several critical issues: returning infinity instead of zero when no samples exist will cause crashes in formatting utilities due to overflow errors; storing all timing samples in an unbounded list creates a memory leak risk; and changing the scheduling duration metric from a total to an average breaks existing wall-time breakdown calculations. Additionally, the reviewer noted that sorting the sample list on every percentile calculation is inefficient and suggested using more performant data structures or caching.

Comment thread python/ray/data/_internal/stats.py Outdated
Comment thread python/ray/data/_internal/stats.py Outdated
Comment thread python/ray/data/_internal/stats.py Outdated
Comment thread python/ray/data/_internal/stats.py Outdated
@xinyuangui2 xinyuangui2 force-pushed the xgui/data-pyspy-flag-stats branch from 7a5ccc9 to 622cafb Compare May 14, 2026 17:17
Comment thread python/ray/data/_internal/stats.py Outdated
Comment thread python/ray/data/_internal/stats.py Outdated
Comment thread python/ray/data/_internal/stats.py Outdated
@ray-gardener ray-gardener Bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling release-test release test labels May 14, 2026
@xinyuangui2 xinyuangui2 force-pushed the xgui/data-pyspy-flag-stats branch from 622cafb to a68bf78 Compare May 14, 2026 20:18
@xinyuangui2 xinyuangui2 changed the title [Data] Add scheduling-loop p90/max stats and py-spy flag to worker_scaling_benchmark [Data] Add scheduling-loop max stat, py-spy profiling coordinator, and benchmark wiring May 14, 2026
@xinyuangui2 xinyuangui2 force-pushed the xgui/data-pyspy-flag-stats branch from a68bf78 to e9366e2 Compare May 14, 2026 20:20
@xinyuangui2 xinyuangui2 changed the title [Data] Add scheduling-loop max stat, py-spy profiling coordinator, and benchmark wiring [Data] Add scheduling-loop max metric to DatasetStatsSummary May 14, 2026
@xinyuangui2 xinyuangui2 force-pushed the xgui/data-pyspy-flag-stats branch 2 times, most recently from 77d7af2 to 49300ca Compare May 14, 2026 20:25
@xinyuangui2 xinyuangui2 requested a review from iamjustinhsu May 14, 2026 21:06
@xinyuangui2 xinyuangui2 added the go add ONLY when ready to merge, run all tests label May 14, 2026
Comment thread python/ray/data/_internal/stats.py Outdated
@xinyuangui2 xinyuangui2 force-pushed the xgui/data-pyspy-flag-stats branch from 49300ca to 8a150e0 Compare May 14, 2026 23:21
Comment thread python/ray/data/_internal/stats.py Outdated
Adds two new fields to ``DatasetStatsSummary``:

  - ``streaming_exec_schedule_avg_s``: per-iteration average
    (sourced from ``Timer.avg()``)
  - ``streaming_exec_schedule_max_s``: per-iteration max
    (sourced from the already-tracked ``Timer.max()``)

``streaming_exec_schedule_s`` keeps its existing meaning (total
wall-clock time across all scheduler iterations) so the
``runtime_metrics()`` breakdown — which divides this value by
total_wall_time to compute a percentage — remains correct.

Memory stays O(1) per Timer; no sample buffer.

Release-test helper ``collect_dataset_stats`` is updated to surface
the per-iteration values as ``avg_/max_scheduling_loop_duration_s``.

Zero-iteration safety: the previous ``if self.streaming_exec_schedule_s``
guard at the build site was a dead check (``Timer()`` is always
truthy). With no scheduler iterations recorded (e.g. non-streaming
execution), ``Timer.avg()`` returns ``float("inf")``, which would
break JSON serialization in release-test output and produce nonsense
in ``runtime_metrics()``. Replaced with an explicit
``_total_count > 0`` check that returns 0 for all three fields when
no samples are present.

Rationale: total scheduler time scales with run duration and is hard
to compare across runs of different lengths. The new avg/max fields
give per-iteration quantities that directly reflect scheduler
efficiency, without breaking the existing total-based breakdown.

Signed-off-by: xgui <xgui@anyscale.com>
Three small follow-ups to the review thread, none requiring a guard
or private-attribute access at the call site:

1. ``Timer.avg()`` now returns 0 when no samples have been recorded
   (was ``float("inf")``). Matches the zero-sample semantics of
   ``Timer.get()`` and ``Timer.max()``, which both return 0. The
   previous ``inf`` return broke JSON serialization downstream and
   forced consumers to special-case the empty path.

2. ``StreamingExecutor._generate_stats`` now assigns ``Timer()`` (an
   empty Timer) instead of ``None`` when ``_initial_stats`` is falsy.
   The type annotation on the field is ``Timer``, not
   ``Optional[Timer]``; this makes runtime match the annotation.

3. The ``DatasetStats.to_summary`` call site drops the
   ``schedule_timer._total_count > 0`` guard. It now just calls
   ``.get()`` / ``.avg()`` / ``.max()`` directly — the Timer is
   always present (per #2) and the methods all return 0 for an
   empty Timer (per #1).

The pickle PR's reviewers asked for both:
  - "not access ``schedule_timer._total_count``"
  - "make ``streaming_exec_schedule_s`` always non-None"

This commit does both, plus the Timer.avg() consistency fix that
makes the simplified call site safe.

Signed-off-by: xgui <xgui@anyscale.com>
Comment thread python/ray/data/_internal/stats.py Outdated
Reverts the global ``Timer.avg() -> 0`` change from the prior commit.
The previous behavior of returning ``float("inf")`` for an empty Timer
is retained as the default to preserve the "undefined" signal for
display callers (``fmt(timer.avg())`` renders it as ``"inf s"``).

The summary build site that needs a JSON-safe / arithmetic-safe value
opts out explicitly:

    schedule_timer.avg(default=0.0)

Only the schedule-Timer call site uses the override. All other
``Timer.avg()`` callers (the iterator-side metrics) keep the existing
``inf`` semantics unchanged.

Signed-off-by: xgui <xgui@anyscale.com>
Reverts the ``default=`` kwarg added in the previous commit; call
``avg()`` directly with its existing signature. ``StreamingExecutor.
_generate_stats`` always assigns a ``Timer``, and downstream
consumers of ``streaming_exec_schedule_avg_s`` can interpret the
``float('inf')`` empty-Timer signal the same way they already do for
the other ``Timer.avg()`` call sites.

Signed-off-by: xgui <xgui@anyscale.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 29ba77e. Configure here.

Comment thread python/ray/data/_internal/execution/streaming_executor.py
@bveeramani bveeramani merged commit 944df88 into ray-project:master May 16, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling release-test release test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants