Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] Head node exited unexceptly because of dashboard process exited #31261

Open
AndreKuu opened this issue Dec 21, 2022 · 5 comments
Open
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical

Comments

@AndreKuu
Copy link
Contributor

AndreKuu commented Dec 21, 2022

What happened + What you expected to happen

Hello guys, have a nice day!

I follow the document of building wheels to build python wheels with new version. However, the head node exited with the error Some Ray subprocesses exited unexpectedly: dashboard [exit code=255]. when i started head node with new built version ray,

  1. I built my version ray wheel on my Ubuntu laptop after modify some source code in the dashboard module, job related function, in order to add a feature: persistent job info in JobInfoStorageClient.

    I followed the doc in python/build-wheel-manylinux2014.sh to build wheels. I found the size of wheel i build is smaller than the offical release version a lot. My branch is base on releases/2.0.0 . Here are the wheel i build on my laptop:

INFO     .whl/ray-2.0.0.10-cp310-cp310-manylinux2014_x86_64.whl (29.1 MB)                                                             
INFO     .whl/ray-2.0.0.10-cp36-cp36m-manylinux2014_x86_64.whl (29.0 MB)                                                              
INFO     .whl/ray-2.0.0.10-cp37-cp37m-manylinux2014_x86_64.whl (29.1 MB)                                                              
INFO     .whl/ray-2.0.0.10-cp38-cp38-manylinux2014_x86_64.whl (29.1 MB)                                                               
INFO     .whl/ray-2.0.0.10-cp39-cp39-manylinux2014_x86_64.whl (29.1 MB)                                                               
INFO     .whl/ray_cpp-2.0.0.10-cp310-cp310-manylinux2014_x86_64.whl (21.3 MB)                                                         
INFO     .whl/ray_cpp-2.0.0.10-cp36-cp36m-manylinux2014_x86_64.whl (21.7 MB)                                                          
INFO     .whl/ray_cpp-2.0.0.10-cp37-cp37m-manylinux2014_x86_64.whl (21.3 MB)                                                          
INFO     .whl/ray_cpp-2.0.0.10-cp38-cp38-manylinux2014_x86_64.whl (21.3 MB)                                                           
INFO     .whl/ray_cpp-2.0.0.10-cp39-cp39-manylinux2014_x86_64.whl (21.3 MB)   

However, the offical release version is bigger than mine.For example, python3.8 + linux + ray2.0.0 is Downloading ray-2.0.0-cp38-cp38-manylinux2014_x86_64.whl (59.2 MB)

Is this normal ?

  1. when i install the ray wheel which i built by the command ray start --head --dashboard-host 0.0.0.0 --dashboard-port 8265 --block, Strange things happened. After about 3 minutes, the process exited. Here is the stdout:
...
Some Ray subprocesses exited unexpectedly:
  dashboard [exit code=255]

Remaining processes will be killed.
  1. I found these log in session_latest/logs/dashboard.log
...
2022-12-20 21:59:21,240 INFO http_server_head.py:142 -- Registered 51 routes.
2022-12-20 21:59:21,242 INFO datacenter.py:70 -- Purge data.
2022-12-20 21:59:21,242 INFO event_utils.py:123 -- Monitor events logs modified after 1671542961.0622056 on /tmp/ray/session_2022-12-20_21-59-19_363980_201878/logs/events, the source types are ['GCS'].
2022-12-20 21:59:21,244 INFO usage_stats_head.py:102 -- Usage reporting is enabled.
2022-12-20 21:59:21,244 INFO actor_head.py:111 -- Getting all actor info from GCS.
2022-12-20 21:59:21,246 INFO actor_head.py:137 -- Received 0 actor info from GCS.
2022-12-20 21:59:32,244 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 21:59:48,245 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:00:04,248 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:00:20,252 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:00:36,255 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:00:52,257 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:01:08,260 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:01:24,263 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:01:40,267 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:01:56,270 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:02:12,273 ERROR head.py:127 -- Failed to check gcs health, client timed out.
2022-12-20 22:02:12,273 ERROR head.py:138 -- Dashboard exiting because it received too many GCS RPC errors count: 11, threshold is 10.

session_latest/logs/dashboard_agent.log

...
2022-12-20 21:59:23,084	INFO event_agent.py:46 -- Report events to 10.9.2.41:34684
2022-12-20 21:59:23,084	INFO event_utils.py:123 -- Monitor events logs modified after 1671542961.9415762 on /tmp/ray/session_2022-12-20_21-59-19_363980_201878/logs/events, the source types are ['COMMON', 'CORE_WORKER', 'RAYLET'].
2022-12-20 22:02:13,087	ERROR reporter_agent.py:809 -- Error publishing node physical stats.
Traceback (most recent call last):
  File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 806, in _perform_iteration
    await publisher.publish_resource_usage(self._key, jsonify_asdict(stats))
  File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 452, in publish_resource_usage
    await self._stub.GcsPublish(req)
  File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1671544933.087241918","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1671544933.087241207","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-12-20 22:02:13,602	ERROR agent.py:217 -- Raylet is terminated: ip=10.9.2.41, id=5fa7195f6fdcb2f6f9f378604ecc5253871ddde0dffc54186ee82d09. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [state-dump] 	NodeManagerService.grpc_server.RequestResourceReport - 1187 total (0 active), CPU time: mean = 52.014 us, total = 61.740 ms
    [state-dump] 	NodeManagerService.grpc_server.UpdateResourceUsage - 1186 total (0 active), CPU time: mean = 43.173 us, total = 51.204 ms
    [state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 600 total (1 active), CPU time: mean = 4.524 us, total = 2.715 ms
    [state-dump] 	NodeManager.deadline_timer.flush_free_objects - 120 total (1 active), CPU time: mean = 3.878 us, total = 465.349 us
    [state-dump] 	NodeManagerService.grpc_server.GetResourceLoad - 120 total (0 active), CPU time: mean = 38.647 us, total = 4.638 ms
    [state-dump] 	NodeManagerService.grpc_server.GetNodeStats - 119 total (0 active), CPU time: mean = 475.065 us, total = 56.533 ms
    [state-dump] 	NodeManager.deadline_timer.record_metrics - 24 total (1 active), CPU time: mean = 155.176 us, total = 3.724 ms
    [state-dump] 	NodeManager.deadline_timer.debug_state_dump - 12 total (1 active), CPU time: mean = 319.161 us, total = 3.830 ms
    [state-dump] 	PeriodicalRunner.RunFnPeriodically - 7 total (0 active), CPU time: mean = 271.586 us, total = 1.901 ms
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 75.287 us, total = 150.575 us
    [state-dump] 	NodeManager.deadline_timer.print_event_loop_stats - 2 total (1 active, 1 running), CPU time: mean = 265.413 us, total = 530.827 us
    [state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 50.795 ms, total = 50.795 ms
    [state-dump] 	AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 225.880 us, total = 225.880 us
    [state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 381.775 us, total = 381.775 us
    [state-dump] 	JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 4.573 us, total = 4.573 us
    [state-dump] 	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 39.528 us, total = 39.528 us
    [state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    [state-dump] DebugString() time ms: 0
    [state-dump] 
    [state-dump] 

2022-12-20 22:03:13,680	ERROR utils.py:224 -- Failed to publish error job_id: "\377\377\377\377"
type: "raylet_died"
error_message: "Raylet is terminated: ip=10.9.2.41, id=5fa7195f6fdcb2f6f9f378604ecc5253871ddde0dffc54186ee82d09. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:\n    [state-dump] \tNodeManagerService.grpc_server.RequestResourceReport - 1187 total (0 active), CPU time: mean = 52.014 us, total = 61.740 ms\n    [state-dump] \tNodeManagerService.grpc_server.UpdateResourceUsage - 1186 total (0 active), CPU time: mean = 43.173 us, total = 51.204 ms\n    [state-dump] \tRayletWorkerPool.deadline_timer.kill_idle_workers - 600 total (1 active), CPU time: mean = 4.524 us, total = 2.715 ms\n    [state-dump] \tNodeManager.deadline_timer.flush_free_objects - 120 total (1 active), CPU time: mean = 3.878 us, total = 465.349 us\n    [state-dump] \tNodeManagerService.grpc_server.GetResourceLoad - 120 total (0 active), CPU time: mean = 38.647 us, total = 4.638 ms\n    [state-dump] \tNodeManagerService.grpc_server.GetNodeStats - 119 total (0 active), CPU time: mean = 475.065 us, total = 56.533 ms\n    [state-dump] \tNodeManager.deadline_timer.record_metrics - 24 total (1 active), CPU time: mean = 155.176 us, total = 3.724 ms\n    [state-dump] \tNodeManager.deadline_timer.debug_state_dump - 12 total (1 active), CPU time: mean = 319.161 us, total = 3.830 ms\n    [state-dump] \tPeriodicalRunner.RunFnPeriodically - 7 total (0 active), CPU time: mean = 271.586 us, total = 1.901 ms\n    [state-dump] \tInternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 75.287 us, total = 150.575 us\n    [state-dump] \tNodeManager.deadline_timer.print_event_loop_stats - 2 total (1 active, 1 running), CPU time: mean = 265.413 us, total = 530.827 us\n    [state-dump] \tNodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 50.795 ms, total = 50.795 ms\n    [state-dump] \tAgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 225.880 us, total = 225.880 us\n    [state-dump] \tNodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 381.775 us, total = 381.775 us\n    [state-dump] \tJobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 4.573 us, total = 4.573 us\n    [state-dump] \tNodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 39.528 us, total = 39.528 us\n    [state-dump] \tInternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s\n    [state-dump] DebugString() time ms: 0\n    [state-dump] \n    [state-dump] \n"
timestamp: 1671544933.6033757
Traceback (most recent call last):
  File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/_private/utils.py", line 222, in publish_error_to_driver
    gcs_publisher.publish_error(job_id.hex().encode(), error_data)
  File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 169, in publish_error
    self._gcs_publish(req)
  File "/home2/hanwen.qiu/miniconda3/envs/ray_build/lib/python3.8/site-packages/ray/_private/gcs_pubsub.py", line 191, in _gcs_publish
    raise TimeoutError(f"Failed to publish after retries: {req}")
TimeoutError: Failed to publish after retries: pub_messages {
  channel_type: RAY_ERROR_INFO_CHANNEL
  key_id: "ffffffff"
  error_info_message {
    job_id: "\377\377\377\377"
    type: "raylet_died"
    error_message: "Raylet is terminated: ip=10.9.2.41, id=5fa7195f6fdcb2f6f9f378604ecc5253871ddde0dffc54186ee82d09. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:\n    [state-dump] \tNodeManagerService.grpc_server.RequestResourceReport - 1187 total (0 active), CPU time: mean = 52.014 us, total = 61.740 ms\n    [state-dump] \tNodeManagerService.grpc_server.UpdateResourceUsage - 1186 total (0 active), CPU time: mean = 43.173 us, total = 51.204 ms\n    [state-dump] \tRayletWorkerPool.deadline_timer.kill_idle_workers - 600 total (1 active), CPU time: mean = 4.524 us, total = 2.715 ms\n    [state-dump] \tNodeManager.deadline_timer.flush_free_objects - 120 total (1 active), CPU time: mean = 3.878 us, total = 465.349 us\n    [state-dump] \tNodeManagerService.grpc_server.GetResourceLoad - 120 total (0 active), CPU time: mean = 38.647 us, total = 4.638 ms\n    [state-dump] \tNodeManagerService.grpc_server.GetNodeStats - 119 total (0 active), CPU time: mean = 475.065 us, total = 56.533 ms\n    [state-dump] \tNodeManager.deadline_timer.record_metrics - 24 total (1 active), CPU time: mean = 155.176 us, total = 3.724 ms\n    [state-dump] \tNodeManager.deadline_timer.debug_state_dump - 12 total (1 active), CPU time: mean = 319.161 us, total = 3.830 ms\n    [state-dump] \tPeriodicalRunner.RunFnPeriodically - 7 total (0 active), CPU time: mean = 271.586 us, total = 1.901 ms\n    [state-dump] \tInternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 75.287 us, total = 150.575 us\n    [state-dump] \tNodeManager.deadline_timer.print_event_loop_stats - 2 total (1 active, 1 running), CPU time: mean = 265.413 us, total = 530.827 us\n    [state-dump] \tNodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 50.795 ms, total = 50.795 ms\n    [state-dump] \tAgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 225.880 us, total = 225.880 us\n    [state-dump] \tNodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 381.775 us, total = 381.775 us\n    [state-dump] \tJobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 4.573 us, total = 4.573 us\n    [state-dump] \tNodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 39.528 us, total = 39.528 us\n    [state-dump] \tInternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s\n    [state-dump] DebugString() time ms: 0\n    [state-dump] \n    [state-dump] \n"
    timestamp: 1671544933.6033757
  }
}


  1. It looks like the wheel i built caused something wrong within GCS ? To make sure these are not related to the source code which i had changed, I follow the build wheel workflow to rebuild one directly use the original branch releases/2.0.0. Same error occured as mentioned above.

Is there something wrong with me? I can only find how to install the build process after modifying the source code in the official document, but I can't find the build and release processes. Maybe something wrong in my build wheel workflow. If you have any idea, please let me know! Thank you!

Have a nice day!

Versions / Dependencies

OS: Ubuntu 16.04.5 LTS (Xenial Xerus)
Python: 3.8.15
Ray: ray-2.0.0

Reproduction script

docker run -e TRAVIS_COMMIT= --rm -w /ray -v pwd:/ray -ti quay.io/pypa/manylinux2014_x86_64 /ray/python/build-wheel-manylinux2014.sh
pip3 install .whl/ray-2.0.0-cp38-cp38-manylinux2014_x86_64.whl
ray start --head --dashboard-host 0.0.0.0 --dashboard-port 8265 --block

Issue Severity

None

@AndreKuu AndreKuu added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 21, 2022
@AndreKuu AndreKuu changed the title [Dashboard] [Dashboard] Head node exited unexceptly because of dashboard process exited Dec 21, 2022
@AndreKuu
Copy link
Contributor Author

AndreKuu commented Dec 23, 2022

The final positioning problem lies in the machine built by build. Although I don't know why the difference between machines caused the issue. More details about the related discuss are post on this post https://discuss.ray.io/t/built-an-unavailable-wheel-with-doc/8725/15

@AndreKuu AndreKuu reopened this Dec 24, 2022
@AndreKuu
Copy link
Contributor Author

I realized that the problem might be that I changed the version number~

@AndreKuu
Copy link
Contributor Author

AndreKuu commented Dec 24, 2022

I think it is caused by the modification of version.
In my business scenario, I need to modify some source code to generate a new distribution release for the project. For example change line 107 in python/ray/__init__.py to __version__ = "3.0.0.1" and push it to the private library.

I grep the version in the repo and find out that line 55 in src/ray/common/constants.h :
constexpr char kRayVersion[] = "3.0.0.dev0";
I change the version same with python/ray/__init__.py and built wheel again. The problem is solved.

Firstly, to be honest, I'm not sure that this change is enough. if this issue caused by these two versions needs to be consistent. Is my modify correct? Is there anything missing?

Secondly, if the issue is just related to these two differences. How about making a proposal ?
proposal:

  • Is it necessary and correct to modify the build script to keep the version varible in head file src/ray/common/constants.h same with the version in python/ray/__ init__. py while building ?
  • Or the right way is add some related error logs when this issue happend. In this process, the problem cannot be located clearly because there is no correct log.

@scottsun94
Copy link
Contributor

cc: @scv119

@rkooo567
Copy link
Contributor

rkooo567 commented Jan 5, 2023

Looks like these are changes we made whenever we make a new release (I assume we are using some sort of script).

c5f0aeb#diff-f95026a08bcb464b58b036437876716d21d3b8630e61258303bcd5384d1d707c

I think what you changed is the absolutely necessary code to change (for other parts, they may not be that important).

I think the proposal makes sense. Feel free to create a PR request to support them!

@rkooo567 rkooo567 added P2 Important issue, but not time-critical core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

3 participants