Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile Ray Workers when VolumeMounts change #945

Closed

Conversation

peterghaddad
Copy link

@peterghaddad peterghaddad commented Mar 6, 2023

Why are these changes needed?

Ensures Ray Pods are up-to-date when volumeMounts change in the cluster manifest. This needs to be utilized with the flag: -forced-cluster-upgrade

Related issue number

#944

Checks

When using the -forced-cluster-upgrade argument with the Ray Operator. The following manual steps have taken place ensuring new functionality works as intended:

  1. Create small Ray Cluster utilizing example manifest.
  2. When the KubeRay operator creates the Ray Cluster, the K8s API server attaches volumeMounts to the head and worker pods. Ensure this does not trigger reconciliation.
        - name: kube-api-access-sgtdfd
          readOnly: true
  1. Update the live Ray Cluster manifest and attach a new VolumeMount to the head pod. Ensure this causes reconciliation of the head node.
  2. Update the live Ray Cluster manifest and attach a new VolumeMount to the worker pod. Ensure this causes reconciliation of the worker node.
  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@peterghaddad peterghaddad changed the title Reconcile Reconcile Ray Workers when VolumeMounts change Mar 6, 2023
@peterghaddad
Copy link
Author

Similar to #527

@peterghaddad
Copy link
Author

@DmitriGekhtman This isn't 100% ready for review just yet, but think this would be good for you to be one of the reviewers if that's okay.

@DmitriGekhtman
Copy link
Collaborator

Sure, keep @kevin85421 in the loop as well!

@kevin85421 kevin85421 self-requested a review March 8, 2023 07:00
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! Would you mind adding some unit tests for the function? In addition, a detailed script for manually testing is good. I will also test this PR by myself.

ray-operator/Dockerfile Outdated Show resolved Hide resolved
ray-operator/apis/ray/v1alpha1/zz_generated.deepcopy.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/util.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/util.go Outdated Show resolved Hide resolved
@peterghaddad
Copy link
Author

peterghaddad commented Mar 13, 2023

@kevin85421 @DmitriGekhtman This is ready for review.

@kevin85421 Feel free to resolve any threads you created.
Thanks!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update! Would you mind adding some tests in util_test.go?

ray-operator/controllers/ray/utils/util_test.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/util_test.go Outdated Show resolved Hide resolved
@@ -346,6 +347,10 @@ func PodNotMatchingTemplate(pod corev1.Pod, template corev1.PodTemplateSpec) boo
// resource entries do not match
return true
}
if !equality.Semantic.DeepDerivative(container1.VolumeMounts, container2.VolumeMounts) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use DeepDerivative rather than DeepEqual?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeepEqual causes issues when default volumeMounts are mounted via the Kubernetes cluster. When using DeepEqual, I noticed pods would continuously reconcile.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"I also have the problem using reflect.DeepEqual or equality.Semantic.DeepEqual because some fields are set with default non-zero values by some controller (like ImagePullPolicy, RestartPolicy, and so on) if these fields are not set by the operator" - kubernetes-sigs/kubebuilder#592

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes-sigs/kubebuilder#592 (comment)
kubernetes-sigs/kubebuilder#592 (comment)

DeepDerivative seems to be risky. Would you mind explaining the behavior of this function and why does it work in our case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kevin85421, I think those comments are actually what we want. Kubernetes generates empty or nil values, and using DeepEqual will cause problems since it will detect a drift when it shouldn't..

DeepDerivative is similar to DeepEqual except that unset fields in a1 are ignored (not compared). This allows us to focus on the fields that matter to the semantic comparison.

Hope this clears things up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeepDerivative is similar to DeepEqual except that unset fields in a1 are ignored (not compared). This allows us to focus on the fields that matter to the semantic comparison.

I knew this. My point is: are there any situations where the worker update could be triggered unintentionally? I have this concern because:

(1) DeepDeriative is much more complex than what we want to compare. (code)
(2) Which fields will be updated by Kubernetes APIServer? Will the changes trigger the worker update unintentionally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Kevin. I believe we have the same concern. Using DeepDeriative for comparing volumeMounts does not trigger unintentionally.

  1. Is is more complex, but it seems that the complexity comes for ensuring fields are properly compared based off data type.
  2. The common example is empty and kubeApiAccess volumeMounts are added via the APIServer causing workers to update unintentionally.

@peterghaddad
Copy link
Author

@kevin85421 Thanks for taking a look! Added in a test specifically aimed at testing the volumeMounts and changed the envVar to volumeMounts.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you ever ran the unit tests on your local environment? You can check this doc for more details.

ray-operator/controllers/ray/utils/util_test.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/util_test.go Outdated Show resolved Hide resolved
@peterghaddad
Copy link
Author

@kevin85421 Thanks for all the review! The tests have been passing locally, and thought of a couple other cases to add. Let me know if you have anything else.

@peterghaddad
Copy link
Author

@kevin85421 wanted to politely follow up.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I am busy with the release v0.5.0 these weeks. Will get back to you when the release process finishes. I do not review the PR again, but something can be improved:

(1) Please provide a detailed test script in the PR description outlining how you tested this PR. Currently, KubeRay CI does not have any integration tests for -forced-cluster-upgrade, so I will review the related PRs carefully and test them myself. Please ensure that you thoroughly understand the behavior of both the Kubernetes API server and the DeepDerivative function. Both of these components make me feel risky to merge this PR. (https://github.com/ray-project/kuberay/pull/945/files#r1139782405)

(2) Can we make the test in util_test.go become a separate unit test?

Thanks!

@peterghaddad
Copy link
Author

@kevin85421 I addressed the above. Thank you for taking another look!

@peterghaddad
Copy link
Author

peterghaddad commented May 17, 2023

Hey @kevin85421 wanted to follow up on this. I also noticed no labels on this PR, is that expected at this state? I appreciate the time!

@kevin85421 kevin85421 closed this Mar 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants