[DNR][data] Refractor OpRuntimeMetrics#58950

Draft

iamjustinhsu wants to merge 10 commits intoray-project:masterfrom

iamjustinhsu:jhsu/refractor-op-runtime-metrics

Contributor

iamjustinhsu commented Nov 24, 2025 •

edited

Loading

Description

Currently, ray data internal metrics were handled by the monolith OpRuntimeMetrics class, which handled task and input/output metrics. While I do think it simplifies abstractions by only thinking about one OpRuntimeMetrics, task metrics is only used for MapOperator and HashShuffleOperator.

Proposal

Split OpRuntimeMetrics into BaseOpMetrics and TaskOpMetrics, where BaseOpMetrics will handle input/outputs, and TaskOpMetrics will inherit from BaseOpMetrics and keep track of task metrics.

Why

I think it's pretty hard to understand which metrics are relevant for an operator, and which are not. For example, limit operator launches tasks to split the blocks, but doesn't keep track of task metrics.
In the future, I would like to have dedicated metrics on aggregator health. For example, I would be able to extend the TaskOpMetrics(BaseOpMetrics). This will include partition_size, num_partitions_per_aggregator, etc...
It's pretty confusing which metrics are exposed publicly (by public, i mean outside the operator, like the StreamingExecutor, ResourceManager), and which metrics are only kept internal for prometheus. By implementing the BaseOpMetrics, you know what gets exposed.
Slightly smaller ray-data.log log file because we only export the metrics the operator actually uses.

Why not

More abstractions can lead to additional complexity, so need to understand which metrics an operator uses
Not as easily extendible to use metrics outside the operator, so would need to implement in the BaseOpMetrics class

Related issues

Additional information

Will run release test, assuming PR is worth merging to make sure I didn't miss anything.

iamjustinhsu added 2 commits

November 24, 2025 09:06


          [data] Refractor OpRuntimeMetrics

b0078dc

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>


          rebase

9895e24

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu requested a review from a team as a code owner

November 24, 2025 21:40

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request is a well-executed refactoring of OpRuntimeMetrics into a more modular and understandable BaseOpMetrics and TaskOpMetrics hierarchy. The changes are applied consistently throughout the codebase, improving clarity and maintainability. I've found one critical issue related to a circular import that would cause a runtime error, which I've commented on.

python/ray/data/_internal/execution/interfaces/op_runtime_metrics/common.py Outdated Show resolved Hide resolved


          add num_tasks_running

a4a065a

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/refractor-op-runtime-metrics branch from 45da494 to a4a065a Compare

November 24, 2025 21:50

cursor bot reviewed

View reviewed changes

python/ray/data/_internal/issue_detection/detectors/hanging_detector.py Outdated Show resolved Hide resolved

python/ray/data/_internal/issue_detection/detectors/high_memory_detector.py Show resolved Hide resolved

iamjustinhsu added 2 commits

November 24, 2025 14:02

fix

f8e8e43

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

fix

ed4d013

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor bot reviewed

View reviewed changes

python/ray/data/_internal/execution/interfaces/op_runtime_metrics/common.py Outdated Show resolved Hide resolved

iamjustinhsu added 2 commits

November 24, 2025 14:16

fix

f6b26b5

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>


          rebase

c5faf7a

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor bot reviewed

View reviewed changes

python/ray/data/_internal/execution/interfaces/op_runtime_metrics/task.py Outdated Show resolved Hide resolved

iamjustinhsu added 2 commits

November 24, 2025 16:36

fix

23bb111

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>


          fix + lint

00b787d

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/refractor-op-runtime-metrics branch from 766b9c6 to 00b787d Compare

November 25, 2025 00:38

ray-gardener bot added the data label


          fix abstractions a lil bit

1b5d066

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor bot reviewed

View reviewed changes

python/ray/data/_internal/execution/interfaces/op_runtime_metrics/task.py

+                      if actor_info is not None:
+                          self.num_alive_actors = actor_info.running
+                          self.num_pending_actors = actor_info.pending
+                          self.num_restarting_actors = actor_info.restarting

cursor bot Nov 25, 2025

Bug: Actor metrics only updated on task completion

Actor pool metrics (num_alive_actors, num_pending_actors, num_restarting_actors) are now only updated when on_task_finished is called with actor_info. Previously, these metrics were updated in add_output (in streaming_executor_state.py), which was called more frequently. This causes actor metrics to become stale between task completions, failing to reflect real-time changes in actor pool state like newly started or restarted actors.

iamjustinhsu marked this pull request as draft

December 2, 2025 18:44

iamjustinhsu changed the title ~~[data] Refractor OpRuntimeMetrics~~ [DNR][data] Refractor OpRuntimeMetrics

github-actions bot commented Dec 17, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions bot added the stale label

iamjustinhsu removed the stale label

github-actions bot commented Jan 1, 2026

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions bot added the stale label

iamjustinhsu added unstale and removed stale labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels