Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add default init container in workers to wait for GCS to be ready #973

Merged
merged 2 commits into from
Mar 20, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Mar 17, 2023

Why are these changes needed?

Currently, the init container logic is wrong. It waits for the head service rather than GCS server. The head service will be ready when the image pull finishes. The current retry logic is implemented by Ray internal.

initContainers:
# the env var $FQ_RAY_IP is set by the operator if missing, with the value of the head service name
- name: init
image: busybox:1.28
# Change the cluster postfix if you don't have a default setting
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for K8s Service $RAY_IP; sleep 2; done"]

For example, add command: ["sleep 180"] in the headGroupSpec. Then, the head Pod command will be sleep 180 && ulimit -n 65536; ray start .... To clarify, the GCS server requires a minimum of 120 seconds to become ready after the head service is ready. It exceed the timeout of the Ray internal retry mechanism, so the worker will fail.

Screen Shot 2023-03-16 at 9 52 00 PM

In this PR, we add a default init container to use ray health-check to check the status of GCS so that can prevent this issue. In addition, each init container must complete successfully before the next one starts. Hence, it is fine for us to have two init containers at this moment to keep backward compatibility. We will remove the original init container from sample YAML files after release 0.5.0.

Related issue number

Closes #476

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
  • We have two init containers at this moment.
    Screen Shot 2023-03-16 at 10 05 51 PM

@kevin85421 kevin85421 changed the title WIP [Feature] Add default init container in workers to wait for GCS to be ready Mar 17, 2023
@kevin85421 kevin85421 marked this pull request as ready for review March 17, 2023 05:13
@kevin85421
Copy link
Member Author

cc @Yicheng-Lu-llll

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
There's a slight concern some users may require additional configuration (security policy, etc.) copied over to the init container.
I wonder if we should copy the entire container spec and replace the entry point -- that could have unforeseen consequences for some users, though.
( OSS is hard :) )

@Yicheng-Lu-llll
Copy link
Contributor

Yicheng-Lu-llll commented Mar 18, 2023

Examined the implementation of the ray health-check: relevant PR, health-check code here, and here. The change Looks good to me!

btw, looks like ray health-check can also be used for the head pod's liveness probe.

@kevin85421
Copy link
Member Author

Looks good. There's a slight concern some users may require additional configuration (security policy, etc.) copied over to the init container.

Good point, I completely overlooked securityContext.

  • Test
    OPERATOR_IMAGE=controller:latest python3 tests/test_security.py 2>&1 | tee log
    
    # Use `kubectl edit pod -n pod-security ${WORKER_POD}` to check the init container's `securityContext`
    Screen Shot 2023-03-18 at 11 24 56 AM Screen Shot 2023-03-18 at 11 13 01 AM

I wonder if we should copy the entire container spec and replace the entry point -- that could have unforeseen consequences for some users, though. ( OSS is hard :) )

I prefer to make it simple at this moment. If we copy the entire Ray container, the init container may have many unused information (e.g. some environment variables.)

@kevin85421 kevin85421 merged commit ed44425 into ray-project:master Mar 20, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
… ready (ray-project#973)

Add default init container in workers to wait for GCS to be ready
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Clean up init container configuration and startup sequence.
4 participants