Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: only filter RayCluster events for reconciliation #882

Merged
merged 2 commits into from
Jan 29, 2023

Conversation

davidxia
Copy link
Contributor

@davidxia davidxia commented Jan 23, 2023

#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation.

closes #872

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@architkulkarni
Copy link
Contributor

Nice catch, it sounds like this was a pretty subtle issue. I didn't quite understand how the code change works, any chance you could add more detail here? Somehow WithPredicates doesn't apply the filters to child resources, but EventFilter does?

@architkulkarni
Copy link
Contributor

Also, I heard that one reason the filter was added originally was to prevent reconciling in a tight loop, because instance.Status.LastUpdateTime is updated at each reconcile (@brucez-anyscale may have more context here):

instance.Status.LastUpdateTime = &timeNow

Would that issue be reintroduced with this PR?

@davidxia
Copy link
Contributor Author

This change only applies the predicates to the RayCluster resources and not all watched resources like Pods and Services. The original bug won't be introduced. @architkulkarni would you be able to test out this change?

@architkulkarni
Copy link
Contributor

Sure, I'll run a manual test.

@architkulkarni
Copy link
Contributor

architkulkarni commented Jan 24, 2023

The manual test passed for this PR (I confirmed that the job was successfully submitted and didn't hang.) Also I checked the logs and confirmed that reconciling wasn't happening in a tight loop (no more reconciling happened after the last reconcile)

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- could you push an empty commit to re-trigger the CI?

ray-project#639 accidentally applied event filters for child resources Pods and Services.
This change does not filter Pod or Service related events. This means Pod
updates will trigger RayCluster reconciliation.

closes ray-project#872
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! LGTM.

@kevin85421
Copy link
Member

@davidxia would you mind checking the error message of CI? The new unit test fails.

@kevin85421 kevin85421 mentioned this pull request Jan 26, 2023
4 tasks
@davidxia
Copy link
Contributor Author

Thanks

• Failure [0.000 seconds]
Inside the default namespace
/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller_test.go:46
  When creating a raycluster
  /home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller_test.go:[131](https://github.com/ray-project/kuberay/actions/runs/4006247114/jobs/6881074875#step:9:132)
    cluster's .status.state should be updated to 'ready' [It]
    /home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller_test.go:160

    Expected
        <v1alpha1.ClusterState>: unhealthy
    to equal
        <v1alpha1.ClusterState>: ready

    /home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/raycluster_controller_test.go:164

I added debug logs to see why the start params are invalid.

@davidxia
Copy link
Contributor Author

@kevin85421 The debug logs I added in last commit don't show up in the failing test logs. Do you know how I can debug further? I think it's because the ray start params are invalid.

@DmitriGekhtman
Copy link
Collaborator

@kevin85421 The debug logs I added in last commit don't show up in the failing test logs. Do you know how I can debug further? I think it's because the ray start params are invalid.

Seems we need to access the logs of the operator in the context of the test.
Or try copying rayStartParams from an example cr.

@kevin85421
Copy link
Member

@kevin85421 The debug logs I added in last commit don't show up in the failing test logs. Do you know how I can debug further? I think it's because the ray start params are invalid.

I can reproduce this bug on my laptop by:

# path: kuberay/ray-operator
make test

There are three reasons to make this test fail.

Reason 1

"object-store-memory": "100000000",

object-store-memory is defined, but the RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE (i.e. AllowSlowStorageEnvVar) environment variable does not exist. Hence, the function ValidateHeadRayStartParams will return false. We can either removing object-store-memory or adding RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE to solve this bug.

Reason 2

It("should create 3 workers", func() {
Eventually(
listResourceFunc(ctx, &workerPods, filterLabels, &client.ListOptions{Namespace: "default"}),
time.Second*15, time.Millisecond*500).Should(Equal(3), fmt.Sprintf("workerGroup %v", workerPods.Items))
if len(workerPods.Items) > 0 {
Expect(workerPods.Items[0].Status.Phase).Should(Or(Equal(corev1.PodRunning), Equal(corev1.PodPending)))
}
})

The new test is just after the test "should create 3 workers". However, the worker Pods are highly possible to be in pending rather than running. Hence, the function CheckAllPodsRunnning will return false. You can print the Pods' states by the following code:

It("cluster's .status.state should be updated to 'ready'", func() {
	listResourceFunc(ctx, &workerPods, filterLabels, &client.ListOptions{Namespace: "default"})
	for _, aPod := range workerPods.Items {
		fmt.Printf("Pod Phase: %v\n", aPod.Status.Phase)
	}
	Eventually(
		getResourceFunc(ctx, client.ObjectKey{Name: myRayCluster.Name, Namespace: "default"}, myRayCluster),
		time.Second*60, time.Millisecond*500).Should(BeNil(), "My myRayCluster  = %v", myRayCluster.Name)
	Expect(myRayCluster.Status.State).Should(Equal(rayiov1alpha1.Ready))
})

Maybe the test case logic should be:

It("cluster's .status.state should be updated to 'ready'", func() {
  // Step 1: Wait for all Pods becoming running.
  // Step 2: Check myRayCluster.Status.State
})

Reason 3

The following change may be unnecessary.
Eventually(..., time.Millisecond*500) -> Eventually(...,time.Second*15, time.Millisecond*500)

@@ -62,7 +63,6 @@ var _ = Context("Inside the default namespace", func() {
"port": "6379",
"object-manager-port": "12345",
"node-manager-port": "12346",
"object-store-memory": "100000000",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to remove for controller to think ray start params are valid

Comment on lines +216 to +224
It("cluster's .status.state should be updated to 'ready' shortly after all Pods are Running", func() {
Eventually(
getClusterState(ctx, "default", myRayCluster.Name),
time.Second*(common.RAYCLUSTER_DEFAULT_REQUEUE_SECONDS+5), time.Millisecond*500).Should(Equal(rayiov1alpha1.Ready))
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the important part

@@ -79,6 +81,7 @@ var _ = BeforeSuite(func(done Done) {
Expect(k8sClient).ToNot(BeNil())

// Suggested way to run tests
os.Setenv(common.RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV, "10")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decrease requeue delay to make test faster

Comment on lines 192 to 218
It("should be able to update all Pods to Running", func() {
for _, workerPod := range workerPods.Items {
workerPod.Status.Phase = corev1.PodRunning
Expect(k8sClient.Status().Update(ctx, &workerPod)).Should(BeNil())
}
Consistently(
listResourceFunc(ctx, &workerPods, workerFilterLabels, &client.ListOptions{Namespace: "default"}),
time.Second*5, time.Millisecond*500).Should(Equal(3), fmt.Sprintf("workerGroup %v", workerPods.Items))
for _, workerPod := range workerPods.Items {
Expect(workerPod.Status.Phase).Should(Equal(corev1.PodRunning))
}

for _, headPod := range headPods.Items {
headPod.Status.Phase = corev1.PodRunning
Expect(k8sClient.Status().Update(ctx, &headPod)).Should(BeNil())
}
Consistently(
listResourceFunc(ctx, &headPods, headFilterLabels, &client.ListOptions{Namespace: "default"}),
time.Second*5, time.Millisecond*500).Should(Equal(1), fmt.Sprintf("headGroup %v", headPods.Items))
for _, headPod := range headPods.Items {
Expect(headPod.Status.Phase).Should(Equal(corev1.PodRunning))
}
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to manually update Pod statuses otherwise they'll always be Pending. envtest doesn't create a full K8s cluster. It's only the control plane. There's no container runtime or any other K8s controllers. So Pods are created, but no controller updates them from Pending to Running.

setting up and starting an instance of etcd and the Kubernetes API server, without kubelet, controller-manager or other components

https://book.kubebuilder.io/reference/envtest.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for this contribution!

ray-operator/controllers/ray/raycluster_controller_test.go Outdated Show resolved Hide resolved
Comment on lines 192 to 218
It("should be able to update all Pods to Running", func() {
for _, workerPod := range workerPods.Items {
workerPod.Status.Phase = corev1.PodRunning
Expect(k8sClient.Status().Update(ctx, &workerPod)).Should(BeNil())
}
Consistently(
listResourceFunc(ctx, &workerPods, workerFilterLabels, &client.ListOptions{Namespace: "default"}),
time.Second*5, time.Millisecond*500).Should(Equal(3), fmt.Sprintf("workerGroup %v", workerPods.Items))
for _, workerPod := range workerPods.Items {
Expect(workerPod.Status.Phase).Should(Equal(corev1.PodRunning))
}

for _, headPod := range headPods.Items {
headPod.Status.Phase = corev1.PodRunning
Expect(k8sClient.Status().Update(ctx, &headPod)).Should(BeNil())
}
Consistently(
listResourceFunc(ctx, &headPods, headFilterLabels, &client.ListOptions{Namespace: "default"}),
time.Second*5, time.Millisecond*500).Should(Equal(1), fmt.Sprintf("headGroup %v", headPods.Items))
for _, headPod := range headPods.Items {
Expect(headPod.Status.Phase).Should(Equal(corev1.PodRunning))
}
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@kevin85421
Copy link
Member

Merge this PR. The CI failure is due to #852.

@kevin85421 kevin85421 merged commit 5953b48 into ray-project:master Jan 29, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
ray-project#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation.
@davidxia davidxia deleted the fix1 branch October 28, 2023 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Pod updates should trigger RayCluster reconciliation
4 participants