[DNR][data] Refractor OpRuntimeMetrics#58950
[DNR][data] Refractor OpRuntimeMetrics#58950iamjustinhsu wants to merge 10 commits intoray-project:masterfrom
Conversation
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request is a well-executed refactoring of OpRuntimeMetrics into a more modular and understandable BaseOpMetrics and TaskOpMetrics hierarchy. The changes are applied consistently throughout the codebase, improving clarity and maintainability. I've found one critical issue related to a circular import that would cause a runtime error, which I've commented on.
python/ray/data/_internal/execution/interfaces/op_runtime_metrics/common.py
Outdated
Show resolved
Hide resolved
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
45da494 to
a4a065a
Compare
python/ray/data/_internal/issue_detection/detectors/hanging_detector.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/issue_detection/detectors/high_memory_detector.py
Show resolved
Hide resolved
python/ray/data/_internal/execution/interfaces/op_runtime_metrics/common.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/interfaces/op_runtime_metrics/task.py
Outdated
Show resolved
Hide resolved
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
766b9c6 to
00b787d
Compare
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
| if actor_info is not None: | ||
| self.num_alive_actors = actor_info.running | ||
| self.num_pending_actors = actor_info.pending | ||
| self.num_restarting_actors = actor_info.restarting |
There was a problem hiding this comment.
Bug: Actor metrics only updated on task completion
Actor pool metrics (num_alive_actors, num_pending_actors, num_restarting_actors) are now only updated when on_task_finished is called with actor_info. Previously, these metrics were updated in add_output (in streaming_executor_state.py), which was called more frequently. This causes actor metrics to become stale between task completions, failing to reflect real-time changes in actor pool state like newly started or restarted actors.
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Description
Currently, ray data internal metrics were handled by the monolith
OpRuntimeMetricsclass, which handled task and input/output metrics. While I do think it simplifies abstractions by only thinking about oneOpRuntimeMetrics, task metrics is only used forMapOperatorandHashShuffleOperator.Proposal
Split
OpRuntimeMetricsintoBaseOpMetricsandTaskOpMetrics, whereBaseOpMetricswill handle input/outputs, andTaskOpMetricswill inherit fromBaseOpMetricsand keep track of task metrics.Why
TaskOpMetrics(BaseOpMetrics). This will includepartition_size,num_partitions_per_aggregator, etc...StreamingExecutor,ResourceManager), and which metrics are only kept internal for prometheus. By implementing theBaseOpMetrics, you know what gets exposed.ray-data.loglog file because we only export the metrics the operator actually uses.Why not
BaseOpMetricsclassRelated issues
Additional information
Will run release test, assuming PR is worth merging to make sure I didn't miss anything.