Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][GCS FT] Raylet on the worker node crashes unexpectedly when head crashes #41343

Closed
kevin85421 opened this issue Nov 22, 2023 · 1 comment · Fixed by #41777
Closed

[Core][GCS FT] Raylet on the worker node crashes unexpectedly when head crashes #41343

kevin85421 opened this issue Nov 22, 2023 · 1 comment · Fixed by #41777
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-gcs Ray core global control storage. P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release

Comments

@kevin85421
Copy link
Member

kevin85421 commented Nov 22, 2023

What happened + What you expected to happen

In KubeRay, there is a test named test_detached_actor.

  • Step 1: Create a RayCluster

    • 1 head: Set num-cpus: 0 to prevent the detached actor from being scheduled on the head node.
      • RAY_gcs_rpc_server_reconnect_timeout_s: 20
    • 1 worker
      • RAY_gcs_rpc_server_reconnect_timeout_s: 120
    • GCS fault tolerance enabled.
  • Step 2: Create a detached actor TestCounter, and the actor will be scheduled on the worker node. Then, call the actor's function increment twice, and the expected results should be 1 and 2.

  • Step 3: Kill the GCS process on the head node. The head node will crash after 20s (RAY_gcs_rpc_server_reconnect_timeout_s).

  • Step 4: Wait for the new head node recovers.

  • Step 5: Call the function increment again, and the expected result should be 3.

This test is very flaky with the Ray 2.8.0 and nightly image. In Step 5, it will fail because the Raylet process on the worker node crashes because the dashboard agent process fails unexpectedly. See the section "test_detached_actor_2.py's log (Step 5)" for more details.

test_detached_actor_2.py's log (Step 5)

`test_detached_actor_2.py`'s log (Step 5)
SIGTERM handler is not set because current thread is not the main thread.
Get TestCounter actor.
retry iter: 0
Try to call remote function 'increment'.
retry iter: 0
retry iter: 1
retry iter: 2
(autoscaler +10s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +10s) Error: No available node types can fulfill resource request {'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
retry iter: 3
retry iter: 4
retry iter: 5
retry iter: 6
retry iter: 7
retry iter: 8
retry iter: 9
retry iter: 10
retry iter: 11
retry iter: 12
retry iter: 13
retry iter: 14
(autoscaler +45s) Error: No available node types can fulfill resource request {'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
retry iter: 15
retry iter: 16
retry iter: 17
retry iter: 18
retry iter: 19
retry iter: 20
retry iter: 21
retry iter: 22
retry iter: 23
retry iter: 24
retry iter: 25
(autoscaler +1m20s) Error: No available node types can fulfill resource request {'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
retry iter: 26
retry iter: 27
retry iter: 28
retry iter: 29
retry iter: 30
(raylet, ip=10.244.0.7) [2023-11-22 13:47:00,634 E 24 59] (raylet) agent_manager.cc:70: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent/1015601482.
(raylet, ip=10.244.0.7) The raylet fate shares with the agent. This can happen because
(raylet, ip=10.244.0.7) - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
(raylet, ip=10.244.0.7) - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet, ip=10.244.0.7) - The agent is killed by the OS (e.g., out of memory).
Traceback (most recent call last):
  File "samples/test_detached_actor_2.py", line 32, in <module>
    val = retry_with_timeout(lambda: ray.get(tc.increment.remote()))
  File "samples/test_detached_actor_2.py", line 18, in retry_with_timeout
    raise err
  File "samples/test_detached_actor_2.py", line 13, in retry_with_timeout
    return func()
  File "samples/test_detached_actor_2.py", line 32, in <lambda>
    val = retry_with_timeout(lambda: ray.get(tc.increment.remote()))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 102, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/worker.py", line 434, in get
    res = self._get(to_get, op_timeout)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/worker.py", line 462, in _get
    raise err
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
command terminated with exit code 1
```
(raylet, ip=10.244.0.7) [2023-11-22 13:47:00,634 E 24 59] (raylet) agent_manager.cc:70: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent/1015601482.
(raylet, ip=10.244.0.7) The raylet fate shares with the agent. This can happen because
(raylet, ip=10.244.0.7) - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
(raylet, ip=10.244.0.7) - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet, ip=10.244.0.7) - The agent is killed by the OS (e.g., out of memory).
Traceback (most recent call last):
  File "samples/test_detached_actor_2.py", line 32, in <module>
    val = retry_with_timeout(lambda: ray.get(tc.increment.remote()))
  File "samples/test_detached_actor_2.py", line 18, in retry_with_timeout
    raise err
  File "samples/test_detached_actor_2.py", line 13, in retry_with_timeout
    return func()
  File "samples/test_detached_actor_2.py", line 32, in <lambda>
    val = retry_with_timeout(lambda: ray.get(tc.increment.remote()))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 102, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/worker.py", line 434, in get
    res = self._get(to_get, op_timeout)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/worker.py", line 462, in _get
    raise err
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
```

dashboard_agent.log on the worker node

dashboard_agent.log on the worker node
2023-11-22:19:52:33,301 INFO     [utils.py:358] Execute command: kubectl exec raycluster-external-redis-worker-small-group-2f9qt -n default -- cat /tmp/ray/session_latest/logs/dashboard_agent.log
Defaulted container "ray-worker" out of: ray-worker, wait-gcs-ready (init)
2023-11-22 11:52:31,598	INFO agent.py:145 -- Dashboard agent grpc address: 0.0.0.0:55177
2023-11-22 11:52:31,598	INFO utils.py:112 -- Get all modules by type: DashboardAgentModule
2023-11-22 11:52:32,155	INFO utils.py:145 -- Available modules: [<class 'ray.dashboard.modules.event.event_agent.EventAgent'>, <class 'ray.dashboard.modules.healthz.healthz_agent.HealthzAgent'>, <class 'ray.dashboard.modules.job.job_agent.JobAgent'>, <class 'ray.dashboard.modules.log.log_agent.LogAgent'>, <class 'ray.dashboard.modules.log.log_agent.LogAgentV1Grpc'>, <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>, <class 'ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl'>]
2023-11-22 11:52:32,155	INFO agent.py:161 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.event.event_agent.EventAgent'>
2023-11-22 11:52:32,155	INFO event_agent.py:38 -- Event agent cache buffer size: 10240
2023-11-22 11:52:32,155	INFO agent.py:161 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.healthz.healthz_agent.HealthzAgent'>
2023-11-22 11:52:32,155	INFO agent.py:161 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.job.job_agent.JobAgent'>
2023-11-22 11:52:32,155	INFO agent.py:161 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgent'>
2023-11-22 11:52:32,155	INFO agent.py:161 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.log.log_agent.LogAgentV1Grpc'>
2023-11-22 11:52:32,155	INFO agent.py:161 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2023-11-22 11:52:32,157	INFO agent.py:161 -- Loading DashboardAgentModule: <class 'ray.dashboard.modules.serve.serve_rest_api_impl.create_serve_rest_api.<locals>.ServeRestApiImpl'>
2023-11-22 11:52:32,157	INFO agent.py:166 -- Loaded 7 modules.
2023-11-22 11:52:32,159	INFO http_server_agent.py:71 -- Dashboard agent http address: 0.0.0.0:52365
2023-11-22 11:52:32,159	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <PlainResource  /api/local_raylet_healthz> -> <function HealthzAgent.health_check at 0x7f4070691700>
2023-11-22 11:52:32,159	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/local_raylet_healthz> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,159	INFO http_server_agent.py:78 -- <ResourceRoute [POST] <PlainResource  /api/job_agent/jobs/> -> <function JobAgent.submit_job at 0x7f4070656280>
2023-11-22 11:52:32,159	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/job_agent/jobs/> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,159	INFO http_server_agent.py:78 -- <ResourceRoute [POST] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/stop> -> <function JobAgent.stop_job at 0x7f40705db5e0>
2023-11-22 11:52:32,159	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/stop> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [DELETE] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}> -> <function JobAgent.delete_job at 0x7f40705db790>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/logs> -> <function JobAgent.get_job_logs at 0x7f40705db940>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/logs> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/logs/tail> -> <function JobAgent.tail_job_logs at 0x7f40705dbaf0>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <DynamicResource  /api/job_agent/jobs/{job_or_submission_id}/logs/tail> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <PlainResource  /api/ray/version> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.get_version at 0x7f40702320d0>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/ray/version> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <PlainResource  /api/serve/deployments/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.get_all_deployments at 0x7f4070232160>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/serve/deployments/> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <PlainResource  /api/serve/applications/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.get_serve_instance_details at 0x7f40702323a0>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/serve/applications/> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <PlainResource  /api/serve/deployments/status> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.get_all_deployment_statuses at 0x7f40702325e0>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/serve/deployments/status> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [DELETE] <PlainResource  /api/serve/deployments/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.delete_serve_application at 0x7f4070232820>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/serve/deployments/> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [DELETE] <PlainResource  /api/serve/applications/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.delete_serve_applications at 0x7f4070232a60>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/serve/applications/> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [PUT] <PlainResource  /api/serve/deployments/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.put_all_deployments at 0x7f4070232c10>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/serve/deployments/> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,160	INFO http_server_agent.py:78 -- <ResourceRoute [PUT] <PlainResource  /api/serve/applications/> -> <function create_serve_rest_api.<locals>.ServeRestApiImpl.put_all_applications at 0x7f4070232e50>
2023-11-22 11:52:32,161	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <PlainResource  /api/serve/applications/> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,161	INFO http_server_agent.py:78 -- <ResourceRoute [GET] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2023-11-22_11-40-23_265582_9/logs')> -> <bound method StaticResource._handle of <StaticResource  /logs -> PosixPath('/tmp/ray/session_2023-11-22_11-40-23_265582_9/logs')>>
2023-11-22 11:52:32,161	INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2023-11-22_11-40-23_265582_9/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f40701719a0>>
2023-11-22 11:52:32,161	INFO http_server_agent.py:79 -- Registered 30 routes.
2023-11-22 11:52:32,163	INFO process_watcher.py:44 -- raylet pid is 24
2023-11-22 11:52:32,164	INFO event_agent.py:56 -- Report events to 10.244.0.9:41103
2023-11-22 11:52:32,165	INFO event_utils.py:132 -- Monitor events logs modified after 1700680952.030946 on /tmp/ray/session_2023-11-22_11-40-23_265582_9/logs/events, the source types are all.

Versions / Dependencies

Ray 2.8.0

I tracked all failed GitHub Actions workflows in KubeRay CI. It firstly happens on Oct. 18, 2023 (error logs). At that time, the nightly image is Ray 2.8.0, which was published on November 1st.

Reproduction script

Run the test_detached_actor.py in KubeRay CI. It is very flaky in the KubeRay CI (GitHub Actions), but I cannot reproduce it in my environment. It becomes very flaky in the past 3 weeks.

Issue Severity

High: It blocks me from completing my task.

@kevin85421 kevin85421 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 22, 2023
@kevin85421 kevin85421 changed the title [Core][GCS FT] Raylet on the worker node crashes unexpectedly when head crashes [WIP][Core][GCS FT] Raylet on the worker node crashes unexpectedly when head crashes Nov 22, 2023
@kevin85421 kevin85421 added release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order core-gcs Ray core global control storage. and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 22, 2023
@kevin85421 kevin85421 changed the title [WIP][Core][GCS FT] Raylet on the worker node crashes unexpectedly when head crashes [Core][GCS FT] Raylet on the worker node crashes unexpectedly when head crashes Nov 22, 2023
@jonathan-anyscale jonathan-anyscale added the core Issues that should be addressed in Ray Core label Nov 23, 2023
@jonathan-anyscale jonathan-anyscale self-assigned this Nov 23, 2023
@jonathan-anyscale
Copy link
Contributor

jonathan-anyscale commented Dec 8, 2023

Update: Still triaging the root cause.

What we found so far:

  • The issue can be reproduced prety often: ~3 times killing the head node gcs server will crash the dashboard agent.
  • The main issue is dashboard_agent crashed with segmentation fault instead of exiting with exit_code 0 as one described in the raylet.out.
  • stack trace: gcs client Connect is where segfault happened.
Screenshot 2023-12-08 at 1 28 12 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-gcs Ray core global control storage. P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants