Skip to content

[DNR][data] Refractor OpRuntimeMetrics#58950

Draft
iamjustinhsu wants to merge 10 commits intoray-project:masterfrom
iamjustinhsu:jhsu/refractor-op-runtime-metrics
Draft

[DNR][data] Refractor OpRuntimeMetrics#58950
iamjustinhsu wants to merge 10 commits intoray-project:masterfrom
iamjustinhsu:jhsu/refractor-op-runtime-metrics

Conversation

@iamjustinhsu
Copy link
Copy Markdown
Contributor

@iamjustinhsu iamjustinhsu commented Nov 24, 2025

Description

Currently, ray data internal metrics were handled by the monolith OpRuntimeMetrics class, which handled task and input/output metrics. While I do think it simplifies abstractions by only thinking about one OpRuntimeMetrics, task metrics is only used for MapOperator and HashShuffleOperator.

Proposal

Split OpRuntimeMetrics into BaseOpMetrics and TaskOpMetrics, where BaseOpMetrics will handle input/outputs, and TaskOpMetrics will inherit from BaseOpMetrics and keep track of task metrics.

Why

  1. I think it's pretty hard to understand which metrics are relevant for an operator, and which are not. For example, limit operator launches tasks to split the blocks, but doesn't keep track of task metrics.
  2. In the future, I would like to have dedicated metrics on aggregator health. For example, I would be able to extend the TaskOpMetrics(BaseOpMetrics). This will include partition_size, num_partitions_per_aggregator, etc...
  3. It's pretty confusing which metrics are exposed publicly (by public, i mean outside the operator, like the StreamingExecutor, ResourceManager), and which metrics are only kept internal for prometheus. By implementing the BaseOpMetrics, you know what gets exposed.
  4. Slightly smaller ray-data.log log file because we only export the metrics the operator actually uses.

Why not

  1. More abstractions can lead to additional complexity, so need to understand which metrics an operator uses
  2. Not as easily extendible to use metrics outside the operator, so would need to implement in the BaseOpMetrics class

Related issues

Additional information

Will run release test, assuming PR is worth merging to make sure I didn't miss anything.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner November 24, 2025 21:40
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a well-executed refactoring of OpRuntimeMetrics into a more modular and understandable BaseOpMetrics and TaskOpMetrics hierarchy. The changes are applied consistently throughout the codebase, improving clarity and maintainability. I've found one critical issue related to a circular import that would cause a runtime error, which I've commented on.

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/refractor-op-runtime-metrics branch from 45da494 to a4a065a Compare November 24, 2025 21:50
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@iamjustinhsu iamjustinhsu force-pushed the jhsu/refractor-op-runtime-metrics branch from 766b9c6 to 00b787d Compare November 25, 2025 00:38
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Nov 25, 2025
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
if actor_info is not None:
self.num_alive_actors = actor_info.running
self.num_pending_actors = actor_info.pending
self.num_restarting_actors = actor_info.restarting
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Actor metrics only updated on task completion

Actor pool metrics (num_alive_actors, num_pending_actors, num_restarting_actors) are now only updated when on_task_finished is called with actor_info. Previously, these metrics were updated in add_output (in streaming_executor_state.py), which was called more frequently. This causes actor metrics to become stale between task completions, failing to reflect real-time changes in actor pool state like newly started or restarted actors.

Fix in Cursor Fix in Web

@iamjustinhsu iamjustinhsu marked this pull request as draft December 2, 2025 18:44
@iamjustinhsu iamjustinhsu changed the title [data] Refractor OpRuntimeMetrics [DNR][data] Refractor OpRuntimeMetrics Dec 2, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 17, 2025
@iamjustinhsu iamjustinhsu removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 17, 2025
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 1, 2026

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 1, 2026
@iamjustinhsu iamjustinhsu added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Jan 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant