[Observability] ray timeline errors with ray.rpc.GetAllProfileInfoReply exceeded maximum protobuf size of 2GB #27952

joshua-cogliati-inl opened this issue Aug 17, 2022 · 3 comments
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P2 Important issue, but not time-critical


What happened + What you expected to happen

  1. The bug: ray timeline does not seem to work.
  2. Expected behavior: ray timeline would output a json file.
  3. logs etc:
$ ray timeline --address=''
2022-08-17 08:38:21,165	INFO -- Connecting to Ray instance at
2022-08-17 08:38:21,165	INFO -- Connecting to existing Ray cluster at address:
2022-08-17 08:38:47,311	INFO -- Trace file being written to /tmp/ray-timeline-2022-08-17_08-38-47.json
(pid=gcs_server) [libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/] ray.rpc.GetAllProfileInfoReply exceeded maximum protobuf size of 2GB: 16494724187
(pid=gcs_server) [libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/] ray.rpc.GetAllProfileInfoReply exceeded maximum protobuf size of 2GB: 16509463013
(pid=gcs_server) [libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/] ray.rpc.GetAllProfileInfoReply exceeded maximum protobuf size of 2GB: 16524627863
(pid=gcs_server) [libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/] ray.rpc.GetAllProfileInfoReply exceeded maximum protobuf size of 2GB: 16539796049

Ray status seems fine:

 ray status --address=''
======== Autoscaler status: 2022-08-17 08:41:02.793078 ========
Node status
 1 node_769e5331d9364c80db37c1400aced64a0515045125e5274b263252ab
 1 node_1dc43a88c2e9f0d9862f7d9e73abcc8ed4612caa3bd642370730f8eb
 1 node_8d68e860fe245b90a2a7eecdac4c1605a65a9e1753094420b3647ffb
 1 node_7b21b4cce2e071c3a242f9968052ca33a39e75b9f595f8bdfaaacdfb
 1 node_049948c1360b2a340771483cb796626681a6a36b9aba91ad9abc38c2
 1 node_736d602ec8b9cad236273ee49af94b01ca54f22cfe8ad8eb672fc73d
 1 node_d8970ba57ac61fe6015ee223672a9985aaf21f18aadc72f5ae73216c
 (no pending nodes)
Recent failures:
 (no failures)

 4.0/70.0 CPU
 0.00/863.949 GiB memory
 0.94/374.255 GiB object_store_memory

 (no resource demands)

(Note:"Trace file being written to {filename}") added before calling ray.timeline in ray/scripts/ )

$ python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Reproduction script

Partial reproduction script.
Ray is started on head:

ray start --head

and the code uses ssh to switch to the nodes and starts them:

ray start --verbose --address= --num-cpus 11 --min-worker-port 10002 --max-worker-port 10090

ray start --verbose --address= --num-cpus 11 --min-worker-port 10002 --max-worker-port 10090

ray start --verbose --address= --num-cpus 11 --min-worker-port 10002 --max-worker-port 10090

ray start --verbose --address= --num-cpus 11 --min-worker-port 10002 --max-worker-port 10090

ray start --verbose --address= --num-cpus 11 --min-worker-port 10002 --max-worker-port 10090

ray start --verbose --address= --num-cpus 6 --min-worker-port 10002 --max-worker-port 10050

and then after ray has been running, I try and run ray timeline:

ray timeline --address=''

Issue Severity

No response

@joshua-cogliati-inl joshua-cogliati-inl changed the title [<Ray component: Core] ray timeline errors with ray.rpc.GetAllProfileInfoReply exceeded maximum protobuf size of 2GB [<Ray component: Core>] ray timeline errors with ray.rpc.GetAllProfileInfoReply exceeded maximum protobuf size of 2GB Aug 17, 2022
@jjyao jjyao changed the title [<Ray component: Core>] ray timeline errors with ray.rpc.GetAllProfileInfoReply exceeded maximum protobuf size of 2GB [Core] ray timeline errors with ray.rpc.GetAllProfileInfoReply exceeded maximum protobuf size of 2GB Aug 18, 2022
jjyao commented Aug 18, 2022

cc @rkooo567

@jjyao jjyao added the core Issues that should be addressed in Ray Core label Aug 18, 2022
rkooo567 commented Oct 4, 2022

I think it is a known issue. ray timeline doesn't work well at large scale right now. We are planning to fix this soon (optimistically by Ray 2.2, or 2.3)

@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 4, 2022
@rkooo567 rkooo567 added observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling triage Needs triage (eg: priority, bug/not-bug, and owning component) and removed core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks labels Oct 24, 2022
@rkooo567 rkooo567 removed their assignment Oct 24, 2022
@richardliaw richardliaw added the core Issues that should be addressed in Ray Core label Oct 29, 2022
@rkooo567 rkooo567 added the dashboard Issues specific to the Ray Dashboard label Oct 30, 2022
Actually what's the ray version? It may have been fixed in the master. We have upper limit of the amount of profile data stored (so the return payload may be way lower than 2GB all the time. Can you check it? )

@hora-anyscale hora-anyscale removed the core Issues that should be addressed in Ray Core label Dec 14, 2022
@hora-anyscale hora-anyscale changed the title [Core] ray timeline errors with ray.rpc.GetAllProfileInfoReply exceeded maximum protobuf size of 2GB [Observability] ray timeline errors with ray.rpc.GetAllProfileInfoReply exceeded maximum protobuf size of 2GB Dec 14, 2022
@hora-anyscale hora-anyscale added core Issues that should be addressed in Ray Core and removed core Issues that should be addressed in Ray Core labels Dec 16, 2022
@alanwguo alanwguo added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 17, 2022
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P2 Important issue, but not time-critical
