[Core] Optimize open telemetry metric recording calls by sampan-s-nayak · Pull Request #59337 · ray-project/ray

sampan-s-nayak · 2025-12-10T05:39:19Z

Description

this pr introduces the following optimizations in the opentelemetryMetricsRecorder and some of its consumers:

use asynchronous instruments wherever available (counter and up down counter)
introduce a batch api to record histogram metrics (to prevent lock contention caused by repeated set_metric_value() calls)
batch events received metric update in aggregator_agent instead of making individual calls

Signed-off-by: sampan <sampan@anyscale.com>

gemini-code-assist · 2025-12-10T05:39:23Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

cursor · 2025-12-10T05:44:22Z

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

+                    filtered = frozenset(
+                        (k, v) for k, v in tag_set if k not in high_cardinality_labels
+                    )
+                    aggregated[filtered] += val


Bug: Gauge aggregation now sums values instead of metric-specific function

The new _create_observable_callback method uses aggregated[filtered] += val for all metric types, including gauges. The previous implementation used MetricCardinality.get_aggregation_function(name)(values) which applies a metric-specific aggregation function. For metrics like "tasks" and "actors", this returns sum(values), but for other gauges it returns values[0] (the first value). The new code incorrectly sums all gauge values when high-cardinality labels are dropped and observations are aggregated, changing the semantics for gauges that aren't configured to use sum aggregation.

only a problem when dealing with high cardinality tags, but handling this just in case

jjyao · 2025-12-10T05:57:42Z

Could you add perf numbers

sampan-s-nayak · 2025-12-10T06:17:57Z

@jjyao with these changes (and when enabling aggregator to gcs), stress_test_many_tasks.aws (None) (0) stage 3 time comes down to around normal numbers (1978.6315276622772, run: https://buildkite.com/ray-project/release/builds/71111) from stage_3_time = 3084.111141204834 (run: https://buildkite.com/ray-project/release/builds/69176#019ab3f6-8bef-4fe2-aeae-4571adab0090).

sampan-s-nayak · 2025-12-10T06:21:09Z

also personally I feel we should revisit this again in the future and in the long term consider moving either the reporter agent/aggregator agent out of the dashboard agent event loop (I prefer moving reporters metric handling logic out of the event loop as it is performing synchronous work). The way we emit histogram metrics can also be further improved (this is the part adding the most overhead right now).

Signed-off-by: sampan <sampan@anyscale.com>

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

can-anyscale · 2025-12-10T18:21:00Z

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

+                    observations = self._counter_observations_by_name.get(
+                        metric_name, {}
+                    )
+                    # Don't clear - counters are cumulative


this might not be correct btw; if the sub-component (worker for example, but can also a ray data concept like operator) no longer emit metrics, this should be cleared (otherwise the metric will show as not changing in grafana instead of stop emitting values)

might be an acceptable behavior but i'm not sure

might also cause memory leak, explosion if a counter metric is never emitted from the set

in my experience with counter metrics, once you stop publishing metrics for a counter it just shows up as a straight line (with the value equal to the last emitted value), it dosent automatically get cleared.

also from prometheus docs:

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.

so reset is only possible during process restarts. https://prometheus.io/docs/concepts/metric_types/#counter

to hit memory issues we would need to be emitting a very large number of counters. do you think it is possible today? If yes then I can use cachetools.TTLCache to auto delete entries after a configured timeout

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

can-anyscale · 2025-12-10T18:26:08Z

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

-                elif isinstance(instrument, metrics.Histogram):
-                    instrument.record(value, attributes=tags)
+                if isinstance(instrument, metrics.Histogram):
+                    # Filter out high cardinality labels.


this implements cardinality for histogram, which is a good but I would recommend to split it in a separated PR (dropping label for histogram probably needs a separated discussion, to make sure the logic is correct)

the existing implementation of set_metric_value() is also filtering out high cardinality labels for histogram metrics(we create tags dict at the start and then use it for all metric types).

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

Signed-off-by: sampan <sampan@anyscale.com>

sampan-s-nayak · 2025-12-16T03:11:54Z

python/ray/dashboard/modules/reporter/reporter_agent.py

-                    continue
-                bucket_midpoint = bucket_midpoints[i]
-                for _ in range(bucket_count):
-                    self._open_telemetry_metric_recorder.set_metric_value(


previously we would call set_metric_value() for every histogram observation (we dont have the actual value with us so we call set_metric_value() with value = bucket_midpoint bucket_count number of times). this is the primary codeblock responsible for blocking the dashboard event loop.

ZacAttack · 2025-12-17T00:26:51Z

python/ray/_private/telemetry/metric_cardinality.py

+            return sum
+        # Gauge metrics use metric-specific aggregation or default to first value
+        if metric_name in HIGH_CARDINALITY_GAUGE_AGGREGATION:
+            return HIGH_CARDINALITY_GAUGE_AGGREGATION[metric_name]


Can we flag in the documentation that this implementation just doesn't work for histogram types?

added doc + I now throw an error if metric type is histogram.

ZacAttack · 2025-12-17T00:28:10Z

python/ray/_private/telemetry/metric_types.py

+class MetricType(Enum):
+    """Types of metrics supported by the telemetry system."""
+
+    GAUGE = 0


Maybe also call out why SUMMARY isn't supported (doesn't aggregate). Just for completeness.

Signed-off-by: sampan <sampan@anyscale.com>

sampan-s-nayak · 2025-12-17T04:33:19Z

Addressed @ZacAttack comment

python/ray/_private/telemetry/metric_cardinality.py

cursor · 2025-12-17T08:18:08Z

python/ray/_private/telemetry/open_telemetry_metric_recorder.py

+                # Sum - add the value for the given tags.
+                self._sum_observations_by_name[name][tag_key] = (
+                    self._sum_observations_by_name[name].get(tag_key, 0) + value
+                )


Bug: Unbounded counter tag storage growth

set_metric_value() accumulates counter/sum values in _counter_observations_by_name/_sum_observations_by_name keyed by the full tag set and the observable callback never clears or evicts them. If tag values churn or are high-cardinality, this can cause unbounded in-memory growth in long-running processes even when labels are later dropped during export.

Additional Locations (1)

python/ray/_private/telemetry/open_telemetry_metric_recorder.py#L60-L69

Signed-off-by: sampan <sampan@anyscale.com>

ZacAttack

Looks good!

) ## Description this pr introduces the following optimizations in the `opentelemetryMetricsRecorder` and some of its consumers: - use asynchronous instruments wherever available (counter and up down counter) - introduce a batch api to record histogram metrics (to prevent lock contention caused by repeated `set_metric_value()` calls) - batch events received metric update in aggregator_agent instead of making individual calls --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>

## Description update metrics export docs based on changes in #59337 ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>

## Description update metrics export docs based on changes in ray-project#59337 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

## Description update metrics export docs based on changes in ray-project#59337 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>

[Core] Optimise open telemetry metric recording calls

2b5df24

Signed-off-by: sampan <sampan@anyscale.com>

sampan-s-nayak requested a review from a team as a code owner December 10, 2025 05:39

sampan-s-nayak added the go add ONLY when ready to merge, run all tests label Dec 10, 2025

sampan-s-nayak requested a review from can-anyscale December 10, 2025 05:39

cursor bot reviewed Dec 10, 2025

View reviewed changes

sampan-s-nayak changed the title ~~[Core] Optimise open telemetry metric recording calls~~ [Core] Optimize open telemetry metric recording calls Dec 10, 2025

address comment

b3427a2

Signed-off-by: sampan <sampan@anyscale.com>

ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Dec 10, 2025

Merge branch 'master' into optimize-otel

32e60fe

cursor bot reviewed Dec 10, 2025

View reviewed changes

python/ray/_private/telemetry/open_telemetry_metric_recorder.py Outdated Show resolved Hide resolved

Merge branch 'master' into optimize-otel

4ad444e

can-anyscale reviewed Dec 10, 2025

View reviewed changes

address comments + fix test

06b5bf0

Signed-off-by: sampan <sampan@anyscale.com>

sampan-s-nayak requested a review from can-anyscale December 11, 2025 04:47

sampan-s-nayak commented Dec 16, 2025

View reviewed changes

sampan-s-nayak mentioned this pull request Dec 16, 2025

[core] Run state api and task event tests using both existing and new event aggregator based flows #56880

Closed

8 tasks

ZacAttack reviewed Dec 17, 2025

View reviewed changes

address comments

2d61096

Signed-off-by: sampan <sampan@anyscale.com>

sampan-s-nayak requested a review from ZacAttack December 17, 2025 04:33

cursor bot reviewed Dec 17, 2025

View reviewed changes

python/ray/_private/telemetry/metric_cardinality.py Show resolved Hide resolved

Merge branch 'master' into optimize-otel

d46f02b

cursor bot reviewed Dec 17, 2025

View reviewed changes

reorder condition

f18dc98

Signed-off-by: sampan <sampan@anyscale.com>

ZacAttack approved these changes Dec 17, 2025

View reviewed changes

edoakes merged commit 05e7efd into ray-project:master Dec 17, 2025
6 checks passed

sampan-s-nayak mentioned this pull request Jan 6, 2026

[Core] Update metric exporter docs #59874

Merged

Conversation

sampan-s-nayak commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

gemini-code-assist bot commented Dec 10, 2025

Uh oh!

Uh oh!

cursor bot Dec 10, 2025

Choose a reason for hiding this comment

Bug: Gauge aggregation now sums values instead of metric-specific function

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjyao commented Dec 10, 2025

Uh oh!

sampan-s-nayak commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sampan-s-nayak commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sampan-s-nayak Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sampan-s-nayak commented Dec 17, 2025

Uh oh!

Uh oh!

cursor bot Dec 17, 2025

Choose a reason for hiding this comment

Bug: Unbounded counter tag storage growth

Uh oh!

ZacAttack left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sampan-s-nayak commented Dec 10, 2025 •

edited

Loading

sampan-s-nayak commented Dec 10, 2025 •

edited

Loading

sampan-s-nayak commented Dec 10, 2025 •

edited

Loading

sampan-s-nayak Dec 11, 2025 •

edited

Loading