Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][k8s compatibility] k8s v1.20.7 ClusterIP svc do not updated under RayService #1110

Merged
merged 7 commits into from
May 26, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented May 25, 2023

Why are these changes needed?

As mentioned in #1105, in Kubernetes 1.20.7, the selector in the Serve service of RayService cannot be updated, but Kubernetes 1.23.0 does not have this issue.

The root cause of this issue is the immutability of the ClusterIP field. If a service is created without specifying the ClusterIP, Kubernetes automatically assigns it. Therefore, in the reconcileServices function, the oldSvc object retrieved from the Kubernetes cluster contains the ClusterIP value.

Without this PR (as shown in the following code snippet), raySvc is derived from the CR spec, so the ClusterIP is "". However, the ClusterIP in the rayService variable is assigned by Kubernetes. Therefore, this update is invalid since the ClusterIP field is immutable.

func (r *RayServiceReconciler) reconcileServices(ctx context.Context, rayServiceInstance *rayv1alpha1.RayService, rayClusterInstance *rayv1alpha1.RayCluster, serviceType common.ServiceType) error {
// Creat Service Struct.
var raySvc *corev1.Service
var err error
if serviceType == common.HeadService {
raySvc, err = common.BuildHeadServiceForRayService(*rayServiceInstance, *rayClusterInstance)
} else if serviceType == common.ServingService {
raySvc, err = common.BuildServeServiceForRayService(*rayServiceInstance, *rayClusterInstance)
}
if err != nil {
return err
}
raySvc.Name = utils.CheckName(raySvc.Name)
// Get Service instance.
rayService := &corev1.Service{}
err = r.Get(ctx, client.ObjectKey{Name: raySvc.Name, Namespace: rayServiceInstance.Namespace}, rayService)
if err == nil {
// Update Service
rayService.Spec = raySvc.Spec
r.Log.V(1).Info("reconcileServices update service")
if updateErr := r.Update(ctx, rayService); updateErr != nil {
r.Log.Error(updateErr, "raySvc Update error!", "raySvc.Error", updateErr)
return updateErr
}

Optional: Why does Kubernetes 1.23.0 not have this issue?

I used binary search to identify that the issue was resolved between Kubernetes v1.21.2 and Kubernetes v1.21.10.

Trace code (Kubernetes v1.21.10):

  • Kubernetes Service Update RESTful API (code) -> Call BeforeUpdate() (code) -> Call strategy.PrepareForUpdate to prepare for update (code)

  • Kubernetes Service's PrepareForUpdate strategy (code) -> Call patchAllocatedValues (code)

  • patchAllocatedValues: if the new service does not specify a ClusterIP, the function will assign the ClusterIP of the old service to the new one. Hence, the update will not update ClusterIP from non-empty to empty.

     func patchAllocatedValues(newSvc, oldSvc *api.Service) {
         if needsClusterIP(oldSvc) && needsClusterIP(newSvc) {
     	    if newSvc.Spec.ClusterIP == "" {
     		    newSvc.Spec.ClusterIP = oldSvc.Spec.ClusterIP
     	    }
     	    if len(newSvc.Spec.ClusterIPs) == 0 {
     		    newSvc.Spec.ClusterIPs = oldSvc.Spec.ClusterIPs
     	    }
         }
         ...
     }

The function patchAllocatedValues is added in Kubernetes v1.21.5.

Related issue number

Closes #1105

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
# Step 0: Create a Kind cluster
kind create cluster --image=kindest/node:v1.20.7

# Step 1: Build docker image for this PR (path: ray-operator/) and load into the Kind cluster
make docker-image
kind load docker-image controller:latest

# Step 2: Install a KubeRay operator
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0 --set image.repository=controller,image.tag=latest

# Step 3: Create a RayService
# path: ray-operator/config/samples
kubectl apply -f ray_v1alpha1_rayservice.yaml

# Step 4: Edit `spec.rayClusterConfig.rayVersion` from 2.4.0 to 2.100.0.
kubectl edit rayservices.ray.io rayservice-sample

# Step 5: Wait for the serve deployments on the new RayCluster becoming ready.
# Check the service's selector
kubectl describe svc rayservice-sample-head-svc

@kevin85421 kevin85421 changed the title [Bug] k8s v1.20.7 ClusterIP svc do not updated under RayService [Bug][k8s compatibility] k8s v1.20.7 ClusterIP svc do not updated under RayService May 25, 2023
@@ -118,6 +118,11 @@ func (in *DashboardStatus) DeepCopy() *DashboardStatus {
// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil.
func (in *HeadGroupSpec) DeepCopyInto(out *HeadGroupSpec) {
*out = *in
if in.HeadService != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to #1105. The generated code is for #1040. We should enhance the CI to detect such issues. The command make docker-image will not trigger the local zz_generated.deepcopy.go generation. We need to run make build to update zz_generated.deepcopy.go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open an issue: #1111

if err != nil {
return err
}
raySvc.Name = utils.CheckName(raySvc.Name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions common.BuildHeadServiceForRayService and common.BuildServeServiceForRayService utilize utils.GenerateServiceName and utils.GenerateServeServiceName, respectively, which internally invoke CheckName. Hence, delete this line.

if newSvc.Spec.ClusterIP == "" {
newSvc.Spec.ClusterIP = oldSvc.Spec.ClusterIP
}
oldSvc.Spec = *newSvc.Spec.DeepCopy()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider the updates of ObjectMeta in subsequent PRs. We currently only take Spec into consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a TODO in the code comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 3119e67

@kevin85421 kevin85421 marked this pull request as ready for review May 25, 2023 05:40
@kevin85421
Copy link
Member Author

cc @Yicheng-Lu-llll

@kevin85421
Copy link
Member Author

cc @jamm1985

@jamm1985
Copy link

@kevin85421

It works fine. Is it possible to release v0.5.1 fix?

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a few minor questions (non-blocking)

r.Log.V(1).Info("reconcileServices update service")
if updateErr := r.Update(ctx, rayService); updateErr != nil {
r.Log.Error(updateErr, "raySvc Update error!", "raySvc.Error", updateErr)
// Only update the service if the RayCluster switches.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "Only update the service if the RayCluster switches" a new behavior change in this PR? What's the reasoning behind it? If the reason is easy to state it might be worth adding it to this code comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "Only update the service if the RayCluster switches" a new behavior change in this PR?

Yes

What's the reasoning behind it? If the reason is easy to state it might be worth adding it to this code comment.

Similar to #1065, the redundant Update invocations will cause an unnecessary burden on the Kubernetes API Server.

if newSvc.Spec.ClusterIP == "" {
newSvc.Spec.ClusterIP = oldSvc.Spec.ClusterIP
}
oldSvc.Spec = *newSvc.Spec.DeepCopy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a TODO in the code comment?

@kevin85421
Copy link
Member Author

cc @jamm1985 Is it acceptable for you to use the nightly build or image for a specific commit (DockerHub)? We anticipate releasing the first release candidate for v0.6.0 at the end of June, and currently, there are no plans for v0.5.1.

@kevin85421 kevin85421 merged commit f6a172f into ray-project:master May 26, 2023
19 checks passed
@jamm1985
Copy link

@kevin85421 ok. Thank you a lot for the quick fix. I'm looking forward to the new v0.6.0 release and wish you a pleasant work on it.

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…er RayService (ray-project#1110)

[Bug][k8s compatibility] k8s v1.20.7 ClusterIP svc do not updated under RayService
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] k8s v1.20.7 ClusterIP svc do not updated under RayService
3 participants