Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] add grpc opencensus plugin #39082

Merged
merged 16 commits into from
Sep 13, 2023
Merged

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented Aug 29, 2023

gRPC has a built in plugin to export metrics via opencensus, as detailed in https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md . We already use opencensus to collect Ray metrics, and here we add the gRPC-sourced metrics to the list.

In consideration of potential performance impacts, we disable it by default and can be enabled by component via config enable_grpc_metrics_collection_for. Now we only support gcs, in following patches we plan to support raylet and core_worker. The reason we don't support it right away is, we need to configure before any gRPC traffic, but in raylet and in core worker we init our stats after a grpc call to gcs/raylet to get the configs updated.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 30, 2023
@jjyao
Copy link
Contributor

jjyao commented Aug 30, 2023

Nice, it would be very valuable.

@rynewang
Copy link
Contributor Author

it works now!

image

@rynewang rynewang marked this pull request as ready for review August 30, 2023 20:40
@rynewang rynewang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 30, 2023
dashboard/modules/reporter/reporter_agent.py Outdated Show resolved Hide resolved
src/ray/stats/metric.cc Outdated Show resolved Hide resolved
@rynewang
Copy link
Contributor Author

wait, it compiles in my laptop but not in CI

Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM but I'll defer to @rkooo567 since he is more familiar with this code.

@rkooo567
Copy link
Contributor

can we make the grpc stats off by default (add a system config to enable it)?

Also can you give me a list of grpc stats? I wonder if we should filter some of them since it will be a lot.

Lastly, can you pick which stats we should track from the multi cloud setup?

@rynewang
Copy link
Contributor Author

can we make the grpc stats off by default (add a system config to enable it)?

config added.

Also can you give me a list of grpc stats? I wonder if we should filter some of them since it will be a lot.

https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md

Lastly, can you pick which stats we should track from the multi cloud setup?

grpc.io/client/roundtrip_latency and grpc.io/server/server_latency is my special interest

@rkooo567
Copy link
Contributor

SGTM. I guess we can also track the completed rpcs with status to understand the network failures

Why no error counts?
Error counts can be computed on your metrics backend by totalling the different per-status values.

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Excited to have grpc metrics :)! Nit comments on adding unit tests

python/ray/_private/metrics_agent.py Outdated Show resolved Hide resolved
python/ray/_private/metrics_agent.py Outdated Show resolved Hide resolved
python/ray/_private/metrics_agent.py Outdated Show resolved Hide resolved
src/ray/stats/metric_exporter.cc Show resolved Hide resolved
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 31, 2023
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

  • Allow to choose components (gcs, worker, raylet) to export grpc metrics due to cardinality concern.

I think we can do it by creating a separate method & call it in the beg of the entrypoint script

@rynewang
Copy link
Contributor Author

rynewang commented Sep 5, 2023

Blocked on #39210 because we need grpc to be upgraded to >= 1.49.0 to have grpc/grpc#30567. Once our #39210 is merged I will merge to this PR and CI should work.

@rynewang rynewang requested a review from a team as a code owner September 6, 2023 00:08
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
… fix.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
src/ray/common/ray_config_def.h Outdated Show resolved Hide resolved
python/ray/_private/metrics_agent.py Show resolved Hide resolved
@rkooo567
Copy link
Contributor

rkooo567 commented Sep 9, 2023

Can you merge the latest master? Also lint failure + left of comments

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 10, 2023
@rynewang
Copy link
Contributor Author

ready to merge

@rkooo567
Copy link
Contributor

Can you update the PR description? Also, test_prometheus_physical_stats_record failure seesm related

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 11, 2023
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 11, 2023
@rynewang
Copy link
Contributor Author

updated

@rkooo567 rkooo567 merged commit 77b4cb9 into ray-project:master Sep 13, 2023
107 of 112 checks passed
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Sep 15, 2023
gRPC has a built in plugin to export metrics via opencensus, as detailed in https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md . We already use opencensus to collect Ray metrics, and here we add the gRPC-sourced metrics to the list.

In consideration of potential performance impacts, we disable it by default and can be enabled by component via config enable_grpc_metrics_collection_for. Now we only support gcs, in following patches we plan to support raylet and core_worker. The reason we don't support it right away is, we need to configure before any gRPC traffic, but in raylet and in core worker we init our stats after a grpc call to gcs/raylet to get the configs updated.
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
gRPC has a built in plugin to export metrics via opencensus, as detailed in https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md . We already use opencensus to collect Ray metrics, and here we add the gRPC-sourced metrics to the list.

In consideration of potential performance impacts, we disable it by default and can be enabled by component via config enable_grpc_metrics_collection_for. Now we only support gcs, in following patches we plan to support raylet and core_worker. The reason we don't support it right away is, we need to configure before any gRPC traffic, but in raylet and in core worker we init our stats after a grpc call to gcs/raylet to get the configs updated.

Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants