-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: update RayCluster .status.reason
field with pod creation error
#639
Conversation
70d2751
to
841bf35
Compare
841bf35
to
ea32961
Compare
This requires an update to the RayCluster CRD which I know you like to avoid. The only other way is to set this info as an annotation or label. I think this case is worth updating the CRD to have a new field. Just have to get it right the first time. :) Lmk what you think. |
66c75a6
to
ad2ecac
Compare
Addition of optional fields is safe for backwards compatibility. |
For comparison, here's the behavior of K8s Deployments when it can't create Pods due to exhausted ResourceQuota. Listing Deployments shows whether it's READY with actual/desired Pods.
Getting the YAML shows a list of conditions with
Similar detailed info for underlying ReplicaSet.
Describing ReplicaSet shows events about root cause.
Might be nice to add a READY column with actual/desired Pods for RayCluster and also events? |
@@ -214,6 +214,9 @@ func (r *RayClusterReconciler) rayClusterReconcile(request ctrl.Request, instanc | |||
if updateErr := r.updateClusterState(instance, rayiov1alpha1.Failed); updateErr != nil { | |||
r.Log.Error(updateErr, "RayCluster update state error", "cluster name", request.Name) | |||
} | |||
if updateErr := r.updateClusterReason(instance, err.Error()); updateErr != nil { | |||
r.Log.Error(updateErr, "RayCluster update reason error", "cluster name", request.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start. we may want to expose more for other state.
Agree, conditions would be great to have. status can not tell the user how entire state machine but condition does. |
@davidxia Is this PR ready? If so, please remove the WIP status. |
@Jeffwan would you like me to implement conditions in this PR instead of status or do you want conditions to be in addition to status and implemented in a future PR? |
@davidxia I think future PRs sounds good. I think you mark the PR in WIP. I was not sure whether you like to implement it in same PR or not? |
This is a good start. Before merging, it would be great to add a simple unit test. As a follow-up for a different PR, we could emit an event with the RayCluster as subject -- I think that would get the error into the kubectl describe output. |
Thanks, I'll add some test(s) and mark as ready for review. Any tips for how to make the "Go-build-and-test / Compatibility Test - 2.0.0 (pull_request)" pass? |
I'd ignore it. We need to smooth out some tests of experimental features. |
ac82334
to
0ec6d6a
Compare
Thanks, added a unit test and marked as ready for review. (follow up issue "ray-operator emit RayCluster events and .status.conditions[] for Pod creation errors") I tested manually on my K8s cluster (deploy new operator image, apply a ResourceQuota that's already exceeded, try to create a RayCluster in that namespace) and noticed the following difference in behavior. With v0.3.0 release, the reconcile loop retried once every second and backs off pretty quickly if there are errors. With this change, for some reason the operator tries to reconcile a failed RayCluster every 100ms with no backoff. See the before and after logs here. Any idea why this is happening and/or if it's an issue? I can't see what changes I made here that would alter this behavior. |
Do you see the problematic behavior in current master? Is it possible another commit caused it? |
@kevin85421 if you have some time, could you check to see if the undesirable reconciler behavior described a couple of comments up occurs in master? |
I've opened an issue to track investigation of the overactive reconciliation loop: |
Seems master is fine. I hope we can help debug the PR branch so that it can be merged. |
0ec6d6a
to
966ac67
Compare
@DmitriGekhtman after rebasing this PR on latest master, building image, and testing the new operator on my cluster, I can't repro the original tight reconcile loop. Now it behaves as we expect. 🤷♂️ |
actually nvm, I can still repro it when I update the RayCluster CRD. Looking more... |
@@ -214,6 +214,9 @@ func (r *RayClusterReconciler) rayClusterReconcile(request ctrl.Request, instanc | |||
if updateErr := r.updateClusterState(instance, rayiov1alpha1.Failed); updateErr != nil { | |||
r.Log.Error(updateErr, "RayCluster update state error", "cluster name", request.Name) | |||
} | |||
if updateErr := r.updateClusterReason(instance, err.Error()); updateErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DmitriGekhtman The reconcile loop doesn't cease because the err.Error()
string here is always different. It begins with pods "POD_NAME" is forbidden: exceeded quota: quota...
and POD_NAME
keeps changing. So updating the .status.reason
queues the RayCluster for another reconciliation which updates .status.reason
which queues it again, on and on.
I'll make the string here stable which should fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's interesting! Outputting the same error sgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DmitriGekhtman this is ugly and brittle. Is there a way to have the reconcile loop ignore changes to .status.reason
or perhaps .status
altogether? I want to do something like this. I think relevant code is here and/or here? But not sure and need help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe we could just set the reason to something like "Pod reconcile failed" and then display the full error by emitting an event, which would be easily accessible via kubectl describe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sgtm, I’ll take a look soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use Or to also watch for changes to Labels and Annotations.
CR annotations are currently taken into account by the operator.
Labels are not taken into account for now, but they might be later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The con of ignoring status changes is that extraneous status changes are not quickly fixed. I think that's a minor issue, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, updated. Also included emitting event since it's small and straightforward. Lmk if there's any good examples of how to test event or label/annotation update-trigger-reconciliation logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on my cluster. Updating labels or annotations triggers reconciliation as expected. Events also show up as expected. See gist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks!
6339626
to
ce48035
Compare
42b95a3
to
720f100
Compare
fixes ray-project#603 ### example before if Pod can't be created due to ResourceQuota exceeded ``` kubectl get rayclusters dxia-test2 -o yaml ... status: state: failed kubectl describe rayclusters dxia-test2 ... Status: State: failed ``` ### example after if Pod can't be created due to ResourceQuota exceeded ``` kubectl get rayclusters dxia-test2 -o yaml ... status: reason: 'pods "dxia-test2-head-lbvdc" is forbidden: exceeded quota: quota, requested: limits.cpu=15,requests.cpu=15, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10' state: failed kubectl describe rayclusters dxia-test2 ... Status: Reason: pods "dxia-test2-head-9mdm5" is forbidden: exceeded quota: quota, requested: limits.cpu=15,requests.cpu=15, used: limits.cpu=60,requests.cpu=60, limited: limits.cpu=10,requests.cpu=10 State: failed ```
…sn't change by adding `predicate.GenerationChangedPredicate` to reconciler. > This predicate will skip update events that have no change in the object's > metadata.generation field. The metadata.generation field of an object is > incremented by the API server when writes are made to the spec field of an > object. This allows a controller to ignore update events where the spec is > unchanged, and only the metadata and/or status fields are changed. https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/predicate?utm_source=godoc#GenerationChangedPredicate Without this change, the controller gets stuck in a tight loop repeatedly updating the same RayCluster when a ResourceQuota is exhausted. The controller updates the RayCluster.status, gets an event about the forbidden Pod with a different name, updates the .status, gets an event, on and on.
720f100
to
beba570
Compare
and emit event on RayCluster for Pod reconciliation errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a great improvement!
Lmk if this can be included in upcoming 0.4.0 release. |
It will probably be included. |
Thanks so much! |
ray-project#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation. closes ray-project#872
ray-project#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation. closes ray-project#872
ray-project#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation. closes ray-project#872
ray-project#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation. closes ray-project#872
ray-project#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation. closes ray-project#872
#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation.
ray-project#639) * feat: update RayCluster `.status.reason` field with pod creation error Makes RayCluster errors related to Pod creation easier more apparent to user.
ray-project#639 accidentally applied event filters for child resources Pods and Services. This change does not filter Pod or Service related events. This means Pod updates will trigger RayCluster reconciliation.
Why are these changes needed?
Makes RayCluster errors related to Pod creation easier more apparent to user.
example before if Pod can't be created due to ResourceQuota exceeded
example after if Pod can't be created due to ResourceQuota exceeded
Related issue number
fixes #603
Checks