Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: worker node can't connect to head node service #445

Merged
merged 1 commit into from
Aug 9, 2022

Conversation

pingsutw
Copy link
Contributor

@pingsutw pingsutw commented Aug 9, 2022

Signed-off-by: Kevin Su pingsutw@apache.org

Why are these changes needed?

If Ray job name is too long, the worker node will try to connect to the wrong head node service.

(base) ➜  ~ k get svc -n flytesnacks-development
NAME                                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                         AGE
l94lnxh5ktptbgzsfnj-n0-0-raycluster-xg6hj-head-svc   NodePort    10.96.123.235   <none>        6379:32288/TCP,10001:31845/TCP,8265:32389/TCP   33s
(base) ➜  ~ k describe pods -n flytesnacks-development ptbgzsfnj-n0-0-raycluster-xg6hj-worker-test-group-5kbn5 | grep address 
      ulimit -n 65536; ray start  --node-ip-address=$MY_POD_IP  --address=al94lnxh5ktptbgzsfnj-n0-0-raycluster-xg6hj-head-svc:6379  --metrics-export-port=8080  --num-cpus=2  --memory=2097152000  && sleep infinity

The service name is l94lnxh5ktptbgzsfnj-n0-0-raycluster-xg6hj-head-svc , but worker node try to connect to al94lnxh5ktptbgzsfnj-n0-0-raycluster-xg6hj-head-svc:6379.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Kevin Su <pingsutw@apache.org>
@@ -43,7 +43,7 @@ func BuildIngressForHeadService(cluster rayiov1alpha1.RayCluster) (*networkingv1
PathType: &pathType,
Backend: networkingv1.IngressBackend{
Service: &networkingv1.IngressServiceBackend{
Name: utils.GenerateServiceName(cluster.Name),
Name: utils.CheckName(utils.GenerateServiceName(cluster.Name)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We internally use short names to overcome such issues but agree that this is indeed a problem worth revisiting.

@@ -47,11 +47,11 @@ func IsRunningAndReady(pod *corev1.Pod) bool {

// CheckName makes sure the name does not start with a numeric value and the total length is < 63 char
func CheckName(s string) string {
maxLenght := 50 // 63 - (max(8,6) + 5 ) // 6 to 8 char are consumed at the end with "-head-" or -worker- + 5 generated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch on the typo

@Jeffwan Jeffwan merged commit 939034c into ray-project:master Aug 9, 2022
Jeffwan pushed a commit to Jeffwan/kuberay that referenced this pull request Aug 9, 2022
Jeffwan added a commit that referenced this pull request Aug 10, 2022
* Fix nil pointer dereference (#429)

Signed-off-by: Kevin Su <pingsutw@apache.org>

* Fix wrong ray start command (#431)

Signed-off-by: Kevin Su <pingsutw@apache.org>

* Add ray state api doc link in ray service doc (#428)

* Add ray state api doc link in ray service doc

* Update doc

* update

* [doc] Fix config typos

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>

Fixes a couple of typos in recently introduced sample configs.

* Add http resp code check for kuberay (#435)

* Clean up example samples (#434)

This PR cleans up the "complete" and "autoscaler" sample yamls a bit.
Unnecessary pod spec fields are removed without sacrificing the completeness of the examples.
The idea is to make the configuration look less intimidating.

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>

* Add more env for RayService head or worker pods (#439)

* fix: worker node can't connect to head node service (#445)

Signed-off-by: Kevin Su <pingsutw@apache.org>

* helm-chart/ray-cluster: allow head autoscaling (#443)

Also allow setting rayVersion

Signed-off-by: Christos Kotsis <28815556+ulfox@users.noreply.github.com>

* Disable async serve handler in Ray Service cluster (#447)

* Add wget timeout to probes (#448)

* Enable tests against release-0.3 branch

Signed-off-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
Signed-off-by: Christos Kotsis <28815556+ulfox@users.noreply.github.com>
Co-authored-by: Kevin Su <pingsutw@apache.org>
Co-authored-by: bruce <103957904+brucez-anyscale@users.noreply.github.com>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Co-authored-by: Christos Kotsis <28815556+ulfox@users.noreply.github.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Co-authored-by: Wilson Wang <3913185+wilsonwang371@users.noreply.github.com>
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Signed-off-by: Kevin Su <pingsutw@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants