Use ray start block in Pod's entrypoint #77

chenk008 · 2021-10-18T02:20:08Z

Signed-off-by: chenk008 kongchen28@gmail.com

Why are these changes needed?

When ray process(e.g. raylet,gcs) exited, the Pod should restart so that the ray process can failover.

Generate the Pod args like ray start --block. And this PR remove test script ray-code, when ray starts with block, the ray-code will not be executed.

Related issue number

Close #62

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: chenk008 <kongchen28@gmail.com>

ray-operator/config/samples/ray-cluster.complete.yaml

ray-operator/controllers/common/pod.go

chenk008 · 2021-10-27T12:47:41Z

@akanso @Jeffwan ray-project/ray#19546 is merged. I think this PR is a workaround to support raylet failover util we get a liveness Exec probes.

Jeffwan · 2021-11-26T02:09:39Z

@chenk008 Let's rebase the change and move this forward. As @chenk008 mentioned, here's another issue encountered the same problem #104

chenk008 · 2021-11-29T03:13:33Z

I have removed sample_code in raycluster CRD samples. I think we can move the job submission to raycluster CRD. Here is the related issue #106

Jeffwan · 2021-11-29T14:55:31Z

Sounds good. One last comment, since Ali raise the question, can we at least keep one example file to use code.py? (use custom command to override it) User will know how to use current solution to submit jobs etc. What's more, if some user do not like block way, they know how to change back to sleep infinity way. @chenk008

akanso · 2021-11-29T18:30:37Z

yes, that is a good idea, to have one example using the --block, and the others without it

Jeffwan · 2021-11-30T14:52:57Z

@chenk008 Did you get a chance to verify the changes. I use following steps to verify the restarts.

Create a cluster using --block, verify from dashboard there's 1 head and 1 worker
Delete head node and wait for worker node to join the ray cluster
However, seems even connection is broken, the worker won't exit unexpectedly and it is still up.

k logs -f raycluster-complete-worker-small-group-q9bcs
[2021-11-30 06:45:41,315 W 8 8] global_state_accessor.cc:365: Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?


(base) ray@raycluster-complete-worker-small-group-q9bcs:~$ ls -al /tmp/ray/session_latest/logs/raylet.*
-rw-r--r-- 1 ray users    0 Nov 30 06:45 /tmp/ray/session_latest/logs/raylet.err
-rw-r--r-- 1 ray users 1691 Nov 30 06:45 /tmp/ray/session_latest/logs/raylet.out

raylet.out logs

k exec -it raycluster-complete-worker-small-group-q9bcs  bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "machine-learning" out of: machine-learning, init-myservice (init)
(base) ray@raycluster-complete-worker-small-group-q9bcs:~$ tail -f /tmp/ray/session_latest/logs/raylet.out
[2021-11-30 06:45:41,320 I 17 17] store_runner.cc:46: Starting object store with directory /dev/shm, fallback /tmp/ray, and huge page support disabled
[2021-11-30 06:45:41,320 I 17 41] dlmalloc.cc:146: create_and_mmap_buffer(306053128, /dev/shm/plasmaXXXXXX)
[2021-11-30 06:45:41,321 I 17 17] grpc_server.cc:71: ObjectManager server started, listening on port 41749.
[2021-11-30 06:45:41,323 I 17 17] node_manager.cc:285: Initializing NodeManager with ID fcdf106611513fb2597f0d2ea55e12550f7cefb2f518004539d27272
[2021-11-30 06:45:41,323 I 17 17] grpc_server.cc:71: NodeManager server started, listening on port 36855.
[2021-11-30 06:45:41,326 I 17 50] agent_manager.cc:78: Monitor agent process with pid 49, register timeout 30000ms.
[2021-11-30 06:45:41,328 I 17 17] raylet.cc:100: Raylet of id, fcdf106611513fb2597f0d2ea55e12550f7cefb2f518004539d27272 started. Raylet consists of node_manager and object_manager. node_manager address: 10.244.0.21:36855 object_manager address: 10.244.0.21:41749 hostname: 10.244.0.21
[2021-11-30 06:45:41,334 I 17 17] service_based_accessor.cc:610: Received notification for node id = 79612b988e8802e35b4c2ab179b71c6d065c0be48cdbd27035d8b88a, IsAlive = 1
[2021-11-30 06:45:41,334 I 17 17] service_based_accessor.cc:610: Received notification for node id = fcdf106611513fb2597f0d2ea55e12550f7cefb2f518004539d27272, IsAlive = 1
[2021-11-30 06:45:42,664 I 17 17] agent_manager.cc:34: HandleRegisterAgent, ip: 10.244.0.21, port: 44559, pid: 49

settings

## Head
    Command:
      /bin/bash
      -c
      --
    Args:
      ulimit -n 65536; ray start --head  --redis-password=LetMeInRay  --object-store-memory=100000000  --port=6379  --node-manager-port=12346  --object-manager-port=12345  --dashboard-host=0.0.0.0  --node-ip-address=$MY_POD_IP  --num-cpus=1  --block

## Worker
 Command:
      /bin/bash
      -c
      --
    Args:
      ulimit -n 65536; ray start  --block  --node-ip-address=$MY_POD_IP  --redis-password=LetMeInRay  --address=raycluster-complete-head-svc:6379

Anything I am missing something here?

chenk008 · 2021-12-02T01:21:10Z

yes, that is a good idea, to have one example using the --block, and the others without it

I think --block should be default config. Without the --block flag, the Pod will be useless if the raylet exited. The ability of failover is a basic requirement.

akanso · 2021-12-02T01:26:53Z

I am good with the PR.

Can we add a yaml comment. in the file # Without the --block flag ...

just to explain to the user of the example the impact of the --block

chenk008 · 2021-12-02T01:29:12Z

Yeah, I will add a yaml comment.

chenk008 · 2021-12-02T02:16:55Z

@Jeffwan I think there is some issue in ray core.

I did the same test in ray:1.8, when the head node exited and restarted, the other raylet still alive and not reconnect to the head node except the raylet on head node, on dashboard we can see only one host which is head.

The other raylet will exit 14 minutes after the head exited. The log show below

[2021-12-01 17:51:39,367 W 8 8] global_state_accessor.cc:427: Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
[2021-12-01 17:51:40,369 W 8 8] global_state_accessor.cc:427: Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2021-12-01 17:51:37,194	INFO scripts.py:740 -- Local node IP: 10.244.0.19
2021-12-01 17:51:41,372	SUCC scripts.py:748 -- --------------------
2021-12-01 17:51:41,372	SUCC scripts.py:749 -- Ray runtime started.
2021-12-01 17:51:41,372	SUCC scripts.py:750 -- --------------------
2021-12-01 17:51:41,372	INFO scripts.py:752 -- To terminate the Ray runtime, run
2021-12-01 17:51:41,372	INFO scripts.py:753 --   ray stop
2021-12-01 17:51:41,372	INFO scripts.py:757 -- --block
2021-12-01 17:51:41,372	INFO scripts.py:759 -- This command will now block until terminated by a signal.
2021-12-01 17:51:41,372	INFO scripts.py:761 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.
2021-12-01 18:05:16,547	ERR scripts.py:769 -- Some Ray subprcesses exited unexpectedly:
2021-12-01 18:05:16,548	ERR scripts.py:776 -- raylet [exit code=1]
2021-12-01 18:05:16,548	ERR scripts.py:780 -- Remaining processes will be killed.

* use ray start block Signed-off-by: chenk008 <kongchen28@gmail.com> * add block into rayStartParams * fix ut * add block in sample config * add sample without block Co-authored-by: wuhua.ck <wuhua.ck@alibaba-inc.com>

use ray start block

6bb4e2b

Signed-off-by: chenk008 <kongchen28@gmail.com>

chenk008 changed the title ~~use ray start block~~ Use ray start block in Pod's entrypoint Oct 18, 2021

chenk008 requested review from akanso and Jeffwan and removed request for akanso October 18, 2021 14:07

Jeffwan reviewed Oct 18, 2021

View reviewed changes

ray-operator/config/samples/ray-cluster.complete.yaml Show resolved Hide resolved

ray-operator/controllers/common/pod.go Outdated Show resolved Hide resolved

wuhua.ck added 3 commits November 8, 2021 17:09

add block into rayStartParams

2a084d3

fix ut

39ced47

Merge branch 'master' into improve_failover

f6d0c43

Jeffwan mentioned this pull request Nov 26, 2021

[WIP] Kuberay Ray Autoscaler integration #100

Closed

4 tasks

wuhua.ck added 2 commits November 29, 2021 10:41

Merge branch 'master' into improve_failover

0e398ee

add block in sample config

7617261

akanso approved these changes Dec 2, 2021

View reviewed changes

add sample without block

9beb498

Jeffwan approved these changes Dec 2, 2021

View reviewed changes

chenk008 merged commit 3102c53 into ray-project:master Dec 2, 2021

iconix mentioned this pull request Dec 14, 2021

[Bug] Please rebuild nightly container #111

Closed

2 tasks

Jeffwan added a commit that referenced this pull request Mar 14, 2022

Use ray start block in Pod's entrypoint(#77)

96d4b33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ray start block in Pod's entrypoint #77

Use ray start block in Pod's entrypoint #77

chenk008 commented Oct 18, 2021

chenk008 commented Oct 27, 2021

Jeffwan commented Nov 26, 2021 •

edited

chenk008 commented Nov 29, 2021 •

edited

Jeffwan commented Nov 29, 2021 •

edited

akanso commented Nov 29, 2021

Jeffwan commented Nov 30, 2021 •

edited

chenk008 commented Dec 2, 2021 •

edited

akanso commented Dec 2, 2021

chenk008 commented Dec 2, 2021

chenk008 commented Dec 2, 2021 •

edited

Use ray start block in Pod's entrypoint #77

Use ray start block in Pod's entrypoint #77

Conversation

chenk008 commented Oct 18, 2021

Why are these changes needed?

Related issue number

Checks

chenk008 commented Oct 27, 2021

Jeffwan commented Nov 26, 2021 • edited

chenk008 commented Nov 29, 2021 • edited

Jeffwan commented Nov 29, 2021 • edited

akanso commented Nov 29, 2021

Jeffwan commented Nov 30, 2021 • edited

chenk008 commented Dec 2, 2021 • edited

akanso commented Dec 2, 2021

chenk008 commented Dec 2, 2021

chenk008 commented Dec 2, 2021 • edited

Jeffwan commented Nov 26, 2021 •

edited

chenk008 commented Nov 29, 2021 •

edited

Jeffwan commented Nov 29, 2021 •

edited

Jeffwan commented Nov 30, 2021 •

edited

chenk008 commented Dec 2, 2021 •

edited

chenk008 commented Dec 2, 2021 •

edited