Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Export Ray Counter as Prometheus Counter metric #43795

Merged
merged 5 commits into from
Mar 8, 2024

Conversation

jjyao
Copy link
Contributor

@jjyao jjyao commented Mar 7, 2024

Why are these changes needed?

Currently ray.utils.metrics.Counter is exported as Prometheus gauge metric which is the wrong metric type. However, directly fixing the metric type is backward incompatible since Prometheus changes the metric name for counter type (append _total suffix).

To address this issue, #41446 has a fix where we vendor Prometheus client and change its internal code to not append the _total suffix. I feel this is undesirable since it's a well known convention that counter has _total suffix (it's even mandatory in OpenMetrics spec https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md, which is a revision of the original Prometheus exposition format and Prometheus fully supports it now and what's acutally why the Prometheus always append _total suffix for counter since it's required by OpenMetrics). By not following this convention, we may accidentally break systems that rely on it (https://stackoverflow.com/questions/75202155/how-can-i-prevent-micrometer-from-adding-total-suffix-to-counter-metric-name).

Instead, this PR tries to keep backward compatibility in a different way by exporting both the wrong gauge metric and the correct counter metric for ray.utils.metrics.Counter. This will double the metrics but I searched in our codebase, we don't have that many Counter metrics (tens of them) so I think it's fine. Later on when we have a major release we can stop exporting the wrong gauge metric. This also gives users time to migrate their dashboards in the meantime.

# HELP ray_demo_total hello
# TYPE ray_demo_total counter
ray_demo_total{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2024-03-07_14-08-43_433336_3963",Version="3.0.0.dev0",WorkerId="01000000ffffffffffffffffffffffffffffffffffffffffffffffff"} 2.0
# HELP ray_demo (DEPRECATED, use ray_demo_total metric instead) hello
# TYPE ray_demo gauge
ray_demo{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2024-03-07_14-08-43_433336_3963",Version="3.0.0.dev0",WorkerId="01000000ffffffffffffffffffffffffffffffffffffffffffffffff"} 2.0

Related issue number

Closes #37768

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM.
A couple high level comments;

  • We should update the API doc
  • Can we allow to disable this via env var? (and document it in the API doc)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao requested a review from rkooo567 March 8, 2024 10:46
python/ray/util/metrics.py Outdated Show resolved Hide resolved
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao merged commit d4121f8 into ray-project:master Mar 8, 2024
9 checks passed
@jjyao jjyao deleted the jjyao/metrics branch March 8, 2024 18:22
@edoakes
Copy link
Contributor

edoakes commented Mar 8, 2024

@jjyao I assume you are going to follow up on a separate PR w/ documentation? Please make sure to update the serve monitoring page and docs for the serve metric wrappers serve/metrics.py

jjyao added a commit that referenced this pull request Mar 13, 2024
Update the Serve Counter metric doc to mention the change in #43795

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
jjyao added a commit that referenced this pull request Mar 14, 2024
Update the Serve Counter metric doc to mention the change in #43795

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
khluu pushed a commit that referenced this pull request Mar 14, 2024
Cherry pick #43901. This is needed as a follow-up of #43795 which is included in 2.10
Update the Serve Counter metric doc to mention the change in #43795

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
Update the Serve Counter metric doc to mention the change in ray-project#43795

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Metrics] Application metrics API Counter create Gauge when imported to Prometheus
3 participants