Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with operator OOM restart #946

Merged
merged 9 commits into from
Mar 28, 2023

Conversation

wilsonwang371
Copy link
Collaborator

@wilsonwang371 wilsonwang371 commented Mar 7, 2023

Why are these changes needed?

This is the patch to fix #601 issue.

In our kuberay controller, we are watching all the events so that we can find out if there is any failed health probe event. In case we see this event happening, we will try to create a new pod to replace the failed one currently. Ideally, we should not receive a large amount of these events. However, we are not correctly filtering out these messages. When a lot of unrelated events have arrived, we will end up printing too many log messages and make the controller run slow.

In this patch, we use filters to block the event messages that we do not care about as early as possible.

Related issue number

#601

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@wilsonwang371 wilsonwang371 changed the title [DO NOT MERGE]fix issue with operator OOM restart [WIP]fix issue with operator OOM restart Mar 7, 2023
@wilsonwang371 wilsonwang371 linked an issue Mar 7, 2023 that may be closed by this pull request
2 tasks
@Jeffwan
Copy link
Collaborator

Jeffwan commented Mar 7, 2023

For bug fixes, we normally add some tests to compare before & after. Could you add some tests?

@kevin85421 kevin85421 self-requested a review March 7, 2023 23:16
@wilsonwang371 wilsonwang371 changed the title [WIP]fix issue with operator OOM restart Fix issue with operator OOM restart Mar 8, 2023
@wilsonwang371 wilsonwang371 self-assigned this Mar 8, 2023
@wilsonwang371 wilsonwang371 added this to the v0.5.0 release milestone Mar 8, 2023
Copy link
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I will check out this branch and test it by myself next week.

ray-operator/Makefile Show resolved Hide resolved
"event", event)
return ctrl.Result{}, nil
} else if len(pods.Items) > 1 {
// This happens when we use fake client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to have two Pods with the same name and same namespace in test mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this happens during test which we do not have the server side filtering available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind explaining it more? In my understanding, it is impossible to have two Pods with the same name in an actual situation? Why do we need to test this case? Do I miss anything? Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is happening during test in which we are using fakeclient which does not support these different kinds of filters applied in the ListOption. If we don't have this, the test will fail which is not because of a real failure.

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
testPodEventName := fmt.Sprintf("%s.15f0c0c5c5c5c5c5", testPodName)

// add a pod in a different namespace
newPods := []runtime.Object{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specifications of the Pods are very similar (some parts are the same.). Can we use variables instead of expanding all of them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test case. In this test case, we have common.RayNodeLabelKey: "yes", specified. I know some parts are the same but putting them in a separate place might not be good for test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain it more. My meaning is that:

spec :=  corev1.PodSpec{
  Containers: []corev1.Container{
	  {
		  Name:    "ray-worker",
		  Image:   "rayproject/ray:2.2.0",
		  Command: []string{"echo"},
		  Args:    []string{"Hello Ray"},
	  },
  },
}

&corev1.Pod {
   ...
   Spec: spec (Maybe need to use Deep copy)
}

&corev1.Pod {
   ...
   Spec: spec (Maybe need to use Deep copy)
}

instead of

&corev1.Pod {
   ...
   Spec: corev1.PodSpec{
    Containers: []corev1.Container{
	    {
		    Name:    "ray-worker",
		    Image:   "rayproject/ray:2.2.0",
		    Command: []string{"echo"},
		    Args:    []string{"Hello Ray"},
	    },
    },
  }
}

&corev1.Pod {
   ...
   Spec: corev1.PodSpec{
    Containers: []corev1.Container{
	    {
		    Name:    "ray-worker",
		    Image:   "rayproject/ray:2.2.0",
		    Command: []string{"echo"},
		    Args:    []string{"Hello Ray"},
	    },
    },
  }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a single location that we need to make this change. I'd suggest we make another MR to change all of these.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of technical debts in KubeRay now. It's good for us to reduce the system's complexity in every PR : )

@wilsonwang371 wilsonwang371 force-pushed the wilson/fix-601 branch 4 times, most recently from be253eb to 26c5593 Compare March 16, 2023 05:35
testPodEventName := fmt.Sprintf("%s.15f0c0c5c5c5c5c5", testPodName)

// add a pod in a different namespace
newPods := []runtime.Object{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain it more. My meaning is that:

spec :=  corev1.PodSpec{
  Containers: []corev1.Container{
	  {
		  Name:    "ray-worker",
		  Image:   "rayproject/ray:2.2.0",
		  Command: []string{"echo"},
		  Args:    []string{"Hello Ray"},
	  },
  },
}

&corev1.Pod {
   ...
   Spec: spec (Maybe need to use Deep copy)
}

&corev1.Pod {
   ...
   Spec: spec (Maybe need to use Deep copy)
}

instead of

&corev1.Pod {
   ...
   Spec: corev1.PodSpec{
    Containers: []corev1.Container{
	    {
		    Name:    "ray-worker",
		    Image:   "rayproject/ray:2.2.0",
		    Command: []string{"echo"},
		    Args:    []string{"Hello Ray"},
	    },
    },
  }
}

&corev1.Pod {
   ...
   Spec: corev1.PodSpec{
    Containers: []corev1.Container{
	    {
		    Name:    "ray-worker",
		    Image:   "rayproject/ray:2.2.0",
		    Command: []string{"echo"},
		    Args:    []string{"Hello Ray"},
	    },
    },
  }
}

"event", event)
return ctrl.Result{}, nil
} else if len(pods.Items) > 1 {
// This happens when we use fake client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind explaining it more? In my understanding, it is impossible to have two Pods with the same name in an actual situation? Why do we need to test this case? Do I miss anything? Thanks!

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
@wilsonwang371 wilsonwang371 force-pushed the wilson/fix-601 branch 4 times, most recently from 32776b4 to 820a135 Compare March 21, 2023 21:03
@Jeffwan
Copy link
Collaborator

Jeffwan commented Mar 22, 2023

@wilsonwang371 Please help address @kevin85421's comment.

@architkulkarni architkulkarni self-requested a review March 27, 2023 18:28
@kevin85421
Copy link
Member

Hi @wilsonwang371, would you mind rebasing with the master's branch? Because of the tight deadline for the release v0.5.0, @architkulkarni and I will merge this PR today and file a new PR to fix/refactor if we find any bugs in this PR. Thanks!

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me

return ctrl.Result{}, nil
}

_ = r.Log.WithValues("event", request.NamespacedName)
r.Log.Info("reconcile RayCluster Event", "event name", request.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this log? Is it no longer useful?

Log: ctrl.Log.WithName("controllers").WithName("RayCluster"),
}

// add event for reconcile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to also add an event which should get filtered out and not trigger a reconcile?

CreateFunc: func(e event.CreateEvent) bool {
if eventObj, ok := e.Object.(*corev1.Event); ok {
if eventObj.InvolvedObject.Kind != "Pod" || eventObj.Type != "Warning" ||
eventObj.Reason != "Unhealthy" || !strings.Contains(eventObj.Message, "Readiness probe failed") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more programmatic way to get the cause of the unhealthiness, or is scanning the message string standard practice? Also, for my education, why do we only care about failures in the readiness probe and not other pod unhealthy events? Maybe @kevin85421 you can also respond if you know the answer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readiness probe failure will trigger liveness probe failure. readiness probe will not automatically trigger a pod rebuild but liveness probe failure will.

@architkulkarni
Copy link
Contributor

@wilsonwang371 Do you mind adding a bit more detail to the PR description?

wilsonwang371 and others added 6 commits March 27, 2023 15:34
@wilsonwang371
Copy link
Collaborator Author

Sorry, I am a little bit busy these days. I just took some time to rebase and update the comments. Let's wait and see how's @jeevb reply on this latest patch.

)

// NewReconciler returns a new reconcile.Reconciler
func NewReconciler(mgr manager.Manager) *RayClusterReconciler {
if err := mgr.GetFieldIndexer().IndexField(context.Background(), &corev1.Pod{}, podUIDIndexField, func(rawObj client.Object) []string {
pod := rawObj.(*corev1.Pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following change may fix #601 (comment). cc @jeevb

pod := rawObj.(*corev1.Pod)
owner := metav1.GetControllerOf(pod)
if owner == nil || owner.APIVersion != rayiov1alpha1.GroupVersion.String() || owner.Kind != "RayCluster" {
	return nil
}
return []string{string(pod.UID)}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

Above is wrong. This is used to enable efficient lookup operations for the local cache. The informer will still fetch cluster status from Kubernetes APIServer. That is, events which are unrelated to Ray Pods will finally be fetched by the informer. Sorry for the misleading.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do not have time to update & test this PR tomorrow, I can merge this PR and open a new patch to fix the issues.

options := []client.ListOption{
client.MatchingFields(map[string]string{podUIDIndexField: string(event.InvolvedObject.UID)}),
client.InNamespace(event.InvolvedObject.Namespace),
client.MatchingLabels(map[string]string{common.RayNodeLabelKey: "yes"}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove all common.RayNodeLabelKey

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Member

@kevin85421 kevin85421 Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out that I misunderstood some things. Initially, I thought that we should implement all filter logic in the resource event handler (i.e. CreateFunc) and return true only when an event is issued by a Ray Pod. With this approach, we can make the function eventReconcile only handle event processing and not worry about event filtering. Therefore, the label common.RayNodeLabelKey is not necessary.

However, I have since realized that the owners of the unhealthy events are Pods, which may include Ray Pods and other Pods. Therefore, we cannot use GetControllerOf to get the owner and determine whether the event belongs to a Ray Pod or not by checking whether the owner's kind is "RayCluster" or not.

Hence, the implementation will be more complex. If we want to distinguish whether an event belongs to a Ray Pod or not, we need to

  • Get the UID of the event's InvolvedObject.
  • Use the UID to look up Pods from the local cache.
  • Use GetControllerOf(pod) to get the owner, and check whether the kind is RayCluster or not.

It will make the function eventReconcile easier to understand, but we need to have more I/O (read local cache) which may cause performance overhead.

In summary, we should use common.RayNodeLabelKey to distinguish Ray Pods in eventReconcile until we understand the cost of the look-up operation.

@wilsonwang371
Copy link
Collaborator Author

If you do not have time to update & test this PR tomorrow, I can merge this PR and open a new patch to fix the issues.

If you have time to help with this, that will be good.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Mar 28, 2023

@kevin85421 @architkulkarni If there're minor changes to be applied, I think we can check in the code and create separate PRs for improvement. If there're some minor changes you can directly suggest on the PR, feel free to use https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/incorporating-feedback-in-your-pull-request. That could reduce response time.

@kevin85421
Copy link
Member

Merge this PR, and will file a separate PR to fix and improve some minor issues.

@kevin85421 kevin85421 merged commit 3c53af6 into ray-project:master Mar 28, 2023
@kevin85421 kevin85421 mentioned this pull request Mar 29, 2023
4 tasks
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Fix issue with operator OOM restart
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug][Operator][Leader election?] Operator failure and restart, logs attached
4 participants