Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][RayJob] RayJob with custom head service name #1332

Merged
merged 10 commits into from
Aug 16, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Aug 15, 2023

Why are these changes needed?

As @mmourafiq pointed out in #1326 (comment), some users prefer a deterministic head service name for RayJob in order to set up a reverse proxy. Although we've made the entire head service configurable, there are some issues that arise when we set the name of the head service:

  • Kubernetes Jobs cannot communicate with the Ray head because of an incorrect service name.
  • Ray workers fail to connect to the Ray head due to the same naming issue.

Related issue number

Closes #1294

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
# Step 0: Build a KubeRay operator image
# Step 1: Install a KubeRay operator
helm install kuberay-operator kuberay/kuberay-operator --version 0.6.0 --set image.repository=controller,image.tag=latest

# Step 2: Create a RayJob with custom head service
# https://gist.github.com/kevin85421/b8e0259467ffdb1b000c4177159f57e9#file-ray-job-custom-head-svc-name-yaml
kubectl apply -f ray-job-custom-head-svc-name.yaml

# Step 3: Check RayJob's status
kubectl get rayjobs.ray.io rayjob-sample -o=jsonpath='{.status.jobStatus}'
# [expected output]: "SUCCEEDED"

@kevin85421 kevin85421 changed the title [WIP] RayJob with custom head service name [Bug][RayJob] RayJob with custom head service name Aug 15, 2023
@mmourafiq
Copy link
Contributor

This looks good 👍

@kevin85421 kevin85421 marked this pull request as ready for review August 15, 2023 20:52
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

ray-operator/controllers/ray/utils/util.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/util.go Show resolved Hide resolved
@architkulkarni
Copy link
Contributor

architkulkarni commented Aug 15, 2023

Do you think we should add your manual test YAML to the RayJob sample YAML tests so that the e2e behavior is tested in CI?

@kevin85421
Copy link
Member Author

Do you think we should add your manual test YAML to the RayJob sample YAML tests so that the e2e behavior is tested in CI?

This would be great. Should I wait for #1320 or #1321 to be merged or I can work on the test without these 2 PRs? Thanks!

@architkulkarni
Copy link
Contributor

Do you think we should add your manual test YAML to the RayJob sample YAML tests so that the e2e behavior is tested in CI?

This would be great. Should I wait for #1320 or #1321 to be merged or I can work on the test without these 2 PRs? Thanks!

@kevin85421 I think it's independent and you don't need to wait, I think you would just need to add the yaml to the list of tested yamls in test_sample_rayjob_yamls.py

@kevin85421
Copy link
Member Author

@architkulkarni I have already added a test ray-job.custom-head-svc.yaml for this PR.

@kevin85421 kevin85421 merged commit f106737 into ray-project:master Aug 16, 2023
19 checks passed
blublinsky pushed a commit to blublinsky/kuberay that referenced this pull request Aug 22, 2023
blublinsky pushed a commit to blublinsky/kuberay that referenced this pull request Aug 29, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Allow setting the RayClusterName in the job specification to have determinitic service head
3 participants