Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Post v0.5.0] Remove init containers from YAML files #1010

Merged
merged 1 commit into from
Apr 5, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Apr 4, 2023

Why are these changes needed?

See #973 for more details.

test_sample_raycluster_yamls.py

Screen Shot 2023-04-05 at 12 25 18 AM

RAY_IMAGE=rayproject/ray:2.3.0 OPERATOR_IMAGE=kuberay/operator:v0.5.0 python3 tests/test_sample_raycluster_yamls.py 2>&1 | tee log

test_sample_rayservice_yamls.py

Screen Shot 2023-04-05 at 12 38 33 AM

RAY_IMAGE=rayproject/ray:2.3.0 OPERATOR_IMAGE=kuberay/operator:v0.5.0 python3 tests/test_sample_rayservice_yamls.py 2>&1 | tee log

Test RayCluster Helm chart

kind create cluster --image=kindest/node:v1.23.0
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0
# path: helm-chart/ray-cluster
helm install raycluster .

test_security.py

Screen Shot 2023-04-05 at 10 42 15 AM

# path: kuberay/
OPERATOR_IMAGE=kuberay/operator:v0.5.0 python3 tests/test_security.py

Test RayJob YAML (ray_v1alpha1_rayjob.yaml)

kind create cluster --image=kindest/node:v1.23.0
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0
# path: ray-operator/config/samples
kubectl apply -f ray_v1alpha1_rayjob.yaml

# Wait for 1 min
kubectl describe rayjobs.ray.io rayjob-sample | grep "Job Status"
# Job Status:             SUCCEEDED

Related issue number

#973
Closes #974

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421
Copy link
Member Author

cc @jasoonn

@kevin85421
Copy link
Member Author

@Jeffwan @scarlet25151 @wilsonwang371 would you mind reviewing the changes in apiserver/pkg/util/cluster.go? Thanks!

@kevin85421 kevin85421 marked this pull request as ready for review April 5, 2023 18:13
@kevin85421 kevin85421 changed the title [WIP] Remove init containers from YAML files Remove init containers from YAML files Apr 5, 2023
@kevin85421
Copy link
Member Author

I will update the kuberay-helm repository after this PR is merged.

@kevin85421 kevin85421 changed the title Remove init containers from YAML files [Post v0.5.0] Remove init containers from YAML files Apr 5, 2023
Copy link
Collaborator

@scarlet25151 scarlet25151 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR remove initContainer config in API server, and move the config logic to #973 config is aligned with examples'. /LGTM.

@kevin85421 kevin85421 merged commit ded9454 into ray-project:master Apr 5, 2023
@@ -216,17 +216,6 @@ func buildWorkerPodTemplate(imageVersion string, envs map[string]string, spec *a
Annotations: buildNodeGroupAnnotations(computeRuntime, spec.Image),
},
Spec: v1.PodSpec{
InitContainers: []v1.Container{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for late. BTW, try to get more background on this. The workers will result in failure at the begining, even it will restart, it will become misleading from observability perspective. Is that a concern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the old init container logic is wrong. It waits for the head service rather than GCS server. The head service will be ready when the image pull finishes. It will not fail in regular conditions because Ray internally will retry multiple times to connect to the GCS server. However, the retry mechanism in Ray has a timeout, so the workers will fail if the GCS server cannot be ready within the timeout.

In #973, we inject a default init container using ray health-check to wait until the GCS server is ready. We removed the nslookup init container after release v0.5.0 to keep the compatibility philosophy of KubeRay (See #940 for more details).

Copy link

@skliarpawlo skliarpawlo Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I also have a qq about this change, it seems like it prevents helm chart users to configre docker image to be used for the init container. Is there something I'm missing? We have a strong requirement to use internal docker registry so this makes the chart 0.5.0 not usable as-is for us. I checked that the init container is still created and not removed completely

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @skliarpawlo, your question is unrelated to KubeRay APIServer, so I open a new thread to answer your question.

@kevin85421
Copy link
Member Author

Hi @skliarpawlo,

In #973, the default init container will be injected by the KubeRay operator automatically, and the injected container will use the same Docker image as the Ray container in the same Pod.

We have a strong requirement to use internal docker registry so this makes the chart 0.5.0 not usable as-is for us. I checked that the init container is still created and not removed completely

This PR is merged after KubeRay v0.5.0 chart is release. Hence, the Helm chart will still have two init containers. I will update the Helm chart release tomorrow. The workaround for you is removing the following lines in your local raycluster-cluster.yaml.

Screen Shot 2023-04-17 at 12 10 24 AM

@skliarpawlo
Copy link

Hi @skliarpawlo,

In #973, the default init container will be injected by the KubeRay operator automatically, and the injected container will use the same Docker image as the Ray container in the same Pod.

We have a strong requirement to use internal docker registry so this makes the chart 0.5.0 not usable as-is for us. I checked that the init container is still created and not removed completely

This PR is merged after KubeRay v0.5.0 chart is release. Hence, the Helm chart will still have two init containers. I will update the Helm chart release tomorrow. The workaround for you is removing the following lines in your local raycluster-cluster.yaml.

Screen Shot 2023-04-17 at 12 10 24 AM

Thank you!
I'm trying to use ray-cluster helm chart as a dependency, not copy it locally to be able to upgrade it easily.
In case the fixed helm chart is released soon, I can wait until than and then just refresh my dependencies.

@kevin85421
Copy link
Member Author

Hi @skliarpawlo, I create release v0.5.1 for Helm charts. You can use:

helm repo update
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.1
helm install raycluster kuberay/ray-cluster --version 0.5.1

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Remove init container from sample YAML files after release 0.5.0
5 participants