Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] Avoid Duplicate Serve Service #1867

Merged

Conversation

Yicheng-Lu-llll
Copy link
Contributor

@Yicheng-Lu-llll Yicheng-Lu-llll commented Jan 24, 2024

Why are these changes needed?

Currently, when creating a RayService, two identical serve services are created as shown below:

# Run a Rayservice sample in under https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples
kubectl apply -f /home/ubuntu/kuberay/ray-operator/config/samples/ray-service.sample.yaml
kubectl get svc
# NAME                                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                         AGE
# kuberay-operator                               ClusterIP   10.96.97.111    <none>        8080/TCP                                        11m
# kubernetes                                     ClusterIP   10.96.0.1       <none>        443/TCP                                         15m
# rayservice-sample-head-svc                     ClusterIP   10.96.229.244   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   2m48s
# rayservice-sample-raycluster-bw4xz-head-svc    ClusterIP   10.96.192.127   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   4m22s
# rayservice-sample-raycluster-bw4xz-serve-svc   ClusterIP   10.96.6.218     <none>        8000/TCP                                        4m22s
# rayservice-sample-serve-svc                    ClusterIP   10.96.205.231   <none>        8000/TCP                                        2m48s

rayservice-sample-serve-svc is the original serve service, whereas rayservice-sample-raycluster-bw4xz-serve-svc is the serve service exclusively used to support Raycluster with serve support. Clearly, the latter one should not be created when creating RayService.

Changes in this PR

  1. Currently, we need to add specific labels and readiness probe for Raycluster created by RayService. However, the ray.io/enable-serve-service annotation has been used to determine whether a Raycluster is created by RayService. This method is not always accurate, especially following Raycluster with serve support, which also uses the same annotation to indicate whether a Raycluster is directly used for serving. This overlap is the primary reason why two serve services are being created. To resolve this, this PR uses the ray.io/originated-from-crd, introduced in Raycluster with serve support, as the sole indicator to identify if a Raycluster is created by RayService. Meanwhile, keep ray.io/enable-serve-service annotation continue to be used to indicate whether a Raycluster is directly used for serving.

  2. As describe in 1, ray.io/enable-serve-service annotation is not utilized for Raycluster created by Rayservice. Hence, This PR removes it in rayservice controller.

Summary of the current serving behavior

When using Rayservice:

  1. The ray.io/serve annotation is added to both head and worker Pods.
  2. Serve health checks are added to the readiness probes of all worker Pods.
  3. Since 1 and 2, Rayservice support high availability.
  4. A serve service is created.

When using Raycluster with serve support:

  1. Only a serve service is created.
  2. No high availability. Users need to manually add Serve health checks to the readiness probes of worker Pods.

Related issue number

Checks

# Run a Rayservice sample in under https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples
kubectl apply -f /home/ubuntu/kuberay/ray-operator/config/samples/ray-service.sample.yaml
kubectl get pod
# NAME                                                      READY   STATUS    RESTARTS   AGE
# ervice-sample-raycluster-6rqjr-worker-small-group-rmths   1/1     Running   0          71s
# kuberay-operator-5987588ffc-4nwxg                         1/1     Running   0          6m49s
# rayservice-sample-raycluster-6rqjr-head-qbmpl             1/1     Running   0          71s
kubectl get svc
# NAME                                          TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                                         AGE
# kuberay-operator                              ClusterIP   10.96.241.73   <none>        8080/TCP                                        8m36s
# kubernetes                                    ClusterIP   10.96.0.1      <none>        443/TCP                                         14m
# rayservice-sample-head-svc                    ClusterIP   10.96.33.104   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   111s
# rayservice-sample-raycluster-6rqjr-head-svc   ClusterIP   10.96.90.19    <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   2m58s
# rayservice-sample-serve-svc                   ClusterIP   10.96.90.247   <none>        8000/TCP                                        111s
kubectl describe svc rayservice-sample-serve-svc | grep "Endpoints"
# Endpoints:         10.244.0.6:8000,10.244.0.7:8000
kubectl describe $(kubectl get pods -o=name | grep head) | grep  "Readiness:\|Liveness:"
# Liveness:   exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120
# Readiness:  exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success] delay=10s timeout=1s period=5s #success=1 #failure=10
kubectl describe $(kubectl get pods -o=name | grep worker) | grep  "Readiness:\|Liveness:"
# Liveness:   exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120
# Readiness:  exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8000/-/healthz | grep success] delay=10s timeout=1s period=5s #success=1 #failure=1
kubectl describe $(kubectl get pods -o=name | grep head) | grep "ray.io/serve"
# ray.io/serve=true
kubectl describe $(kubectl get pods -o=name | grep worker) | grep "ray.io/serve"
# ray.io/serve=true

@Yicheng-Lu-llll Yicheng-Lu-llll marked this pull request as ready for review January 26, 2024 03:00
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
@kevin85421 kevin85421 self-requested a review January 26, 2024 21:08
@kevin85421 kevin85421 self-assigned this Jan 26, 2024
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just leave some nits.

@@ -43,6 +43,8 @@ const (
// Finalizers for GCS fault tolerance
GCSFaultToleranceRedisCleanupFinalizer = "ray.io/gcs-ft-redis-cleanup-finalizer"

// EnableServeServiceKey is exclusively utilized to indicate if a Raycluster is directly used for serving.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
// EnableServeServiceKey is exclusively utilized to indicate if a Raycluster is directly used for serving.
// EnableServeServiceKey is exclusively utilized to indicate if a RayCluster is directly used for serving.

@@ -1220,7 +1224,7 @@ func TestInitLivenessAndReadinessProbe(t *testing.T) {
// Test 2: User does not define a custom probe. KubeRay will inject Exec probe.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I get more hint on which part do I need to update? Thank you!

val, ok := pod.Labels[utils.RayClusterServingServiceLabelKey]
assert.True(t, ok, "Expected serve label is not present")
assert.Equal(t, utils.EnableRayClusterServingServiceFalse, val, "Wrong serve label value")
CheckHasCorrectDeathEnv(t, &pod.Spec.Containers[utils.RayContainerIndex])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use utils.EnvVarExists instead.

@anyscalesam anyscalesam added enhancement New feature or request P1 Issue that should be fixed within a few weeks labels Jan 29, 2024
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
@kevin85421 kevin85421 merged commit edd332b into ray-project:master Jan 30, 2024
23 checks passed
ryanaoleary pushed a commit to ryanaoleary/kuberay that referenced this pull request Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants