-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: only filter RayCluster events for reconciliation #882
Conversation
Nice catch, it sounds like this was a pretty subtle issue. I didn't quite understand how the code change works, any chance you could add more detail here? Somehow |
Also, I heard that one reason the filter was added originally was to prevent reconciling in a tight loop, because
Would that issue be reintroduced with this PR? |
This change only applies the predicates to the RayCluster resources and not all watched resources like Pods and Services. The original bug won't be introduced. @architkulkarni would you be able to test out this change? |
Sure, I'll run a manual test. |
The manual test passed for this PR (I confirmed that the job was successfully submitted and didn't hang.) Also I checked the logs and confirmed that reconciling wasn't happening in a tight loop (no more reconciling happened after the last reconcile) |
2ae9940
to
ba5719f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- could you push an empty commit to re-trigger the CI?
ray-project#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation. closes ray-project#872
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! LGTM.
@davidxia would you mind checking the error message of CI? The new unit test fails. |
Thanks
I added debug logs to see why the start params are invalid. |
@kevin85421 The debug logs I added in last commit don't show up in the failing test logs. Do you know how I can debug further? I think it's because the ray start params are invalid. |
Seems we need to access the logs of the operator in the context of the test. |
I can reproduce this bug on my laptop by: # path: kuberay/ray-operator
make test There are three reasons to make this test fail. Reason 1
object-store-memory is defined, but the RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE (i.e. AllowSlowStorageEnvVar ) environment variable does not exist. Hence, the function ValidateHeadRayStartParams will return false. We can either removing object-store-memory or adding RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE to solve this bug.
Reason 2kuberay/ray-operator/controllers/ray/raycluster_controller_test.go Lines 151 to 158 in db0ee96
The new test is just after the test "should create 3 workers". However, the worker Pods are highly possible to be in pending rather than running. Hence, the function It("cluster's .status.state should be updated to 'ready'", func() {
listResourceFunc(ctx, &workerPods, filterLabels, &client.ListOptions{Namespace: "default"})
for _, aPod := range workerPods.Items {
fmt.Printf("Pod Phase: %v\n", aPod.Status.Phase)
}
Eventually(
getResourceFunc(ctx, client.ObjectKey{Name: myRayCluster.Name, Namespace: "default"}, myRayCluster),
time.Second*60, time.Millisecond*500).Should(BeNil(), "My myRayCluster = %v", myRayCluster.Name)
Expect(myRayCluster.Status.State).Should(Equal(rayiov1alpha1.Ready))
}) Maybe the test case logic should be:
Reason 3The following change may be unnecessary. |
fcadb8e
to
acdf703
Compare
@@ -62,7 +63,6 @@ var _ = Context("Inside the default namespace", func() { | |||
"port": "6379", | |||
"object-manager-port": "12345", | |||
"node-manager-port": "12346", | |||
"object-store-memory": "100000000", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to remove for controller to think ray start params are valid
It("cluster's .status.state should be updated to 'ready' shortly after all Pods are Running", func() { | ||
Eventually( | ||
getClusterState(ctx, "default", myRayCluster.Name), | ||
time.Second*(common.RAYCLUSTER_DEFAULT_REQUEUE_SECONDS+5), time.Millisecond*500).Should(Equal(rayiov1alpha1.Ready)) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the important part
@@ -79,6 +81,7 @@ var _ = BeforeSuite(func(done Done) { | |||
Expect(k8sClient).ToNot(BeNil()) | |||
|
|||
// Suggested way to run tests | |||
os.Setenv(common.RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV, "10") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decrease requeue delay to make test faster
It("should be able to update all Pods to Running", func() { | ||
for _, workerPod := range workerPods.Items { | ||
workerPod.Status.Phase = corev1.PodRunning | ||
Expect(k8sClient.Status().Update(ctx, &workerPod)).Should(BeNil()) | ||
} | ||
Consistently( | ||
listResourceFunc(ctx, &workerPods, workerFilterLabels, &client.ListOptions{Namespace: "default"}), | ||
time.Second*5, time.Millisecond*500).Should(Equal(3), fmt.Sprintf("workerGroup %v", workerPods.Items)) | ||
for _, workerPod := range workerPods.Items { | ||
Expect(workerPod.Status.Phase).Should(Equal(corev1.PodRunning)) | ||
} | ||
|
||
for _, headPod := range headPods.Items { | ||
headPod.Status.Phase = corev1.PodRunning | ||
Expect(k8sClient.Status().Update(ctx, &headPod)).Should(BeNil()) | ||
} | ||
Consistently( | ||
listResourceFunc(ctx, &headPods, headFilterLabels, &client.ListOptions{Namespace: "default"}), | ||
time.Second*5, time.Millisecond*500).Should(Equal(1), fmt.Sprintf("headGroup %v", headPods.Items)) | ||
for _, headPod := range headPods.Items { | ||
Expect(headPod.Status.Phase).Should(Equal(corev1.PodRunning)) | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to manually update Pod statuses otherwise they'll always be Pending. envtest doesn't create a full K8s cluster. It's only the control plane. There's no container runtime or any other K8s controllers. So Pods are created, but no controller updates them from Pending to Running.
setting up and starting an instance of etcd and the Kubernetes API server, without kubelet, controller-manager or other components
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for this contribution!
It("should be able to update all Pods to Running", func() { | ||
for _, workerPod := range workerPods.Items { | ||
workerPod.Status.Phase = corev1.PodRunning | ||
Expect(k8sClient.Status().Update(ctx, &workerPod)).Should(BeNil()) | ||
} | ||
Consistently( | ||
listResourceFunc(ctx, &workerPods, workerFilterLabels, &client.ListOptions{Namespace: "default"}), | ||
time.Second*5, time.Millisecond*500).Should(Equal(3), fmt.Sprintf("workerGroup %v", workerPods.Items)) | ||
for _, workerPod := range workerPods.Items { | ||
Expect(workerPod.Status.Phase).Should(Equal(corev1.PodRunning)) | ||
} | ||
|
||
for _, headPod := range headPods.Items { | ||
headPod.Status.Phase = corev1.PodRunning | ||
Expect(k8sClient.Status().Update(ctx, &headPod)).Should(BeNil()) | ||
} | ||
Consistently( | ||
listResourceFunc(ctx, &headPods, headFilterLabels, &client.ListOptions{Namespace: "default"}), | ||
time.Second*5, time.Millisecond*500).Should(Equal(1), fmt.Sprintf("headGroup %v", headPods.Items)) | ||
for _, headPod := range headPods.Items { | ||
Expect(headPod.Status.Phase).Should(Equal(corev1.PodRunning)) | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Merge this PR. The CI failure is due to #852. |
ray-project#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation.
#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation.
closes #872
Checks