New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][Operator][Leader election?] Operator failure and restart, logs attached #601
Comments
We are observing this as well with It seems likely that the issue is a result of this line:
We are now running a test with this line omitted, and can confirm that the spurious logs of unrelated resources are gone, and memory usage is lower. This seems to suggest that the line will watch every event (and consequently trigger a reconcile on each one): https://github.com/kubernetes-sigs/controller-runtime/blob/f6f37e6cc1ec7b7d18a266a6614f86df211b1a0a/pkg/handler/enqueue.go#L35 To watch events related to child objects only, this seems more appropriate: https://github.com/kubernetes-sigs/controller-runtime/blob/f6f37e6cc1ec7b7d18a266a6614f86df211b1a0a/pkg/handler/enqueue_owner.go#L42 I believe |
Ok, watching events was not a good idea cc @kevin85421 @Jeffwan @wilsonwang371 -- I'd recommend simplifying the implementation of the fault tolerance feature so that it doesn't do this. |
sounds reasonable. I will take a look at this and discuss this with @Jeffwan |
This PR seems to suggest that the events watching was meant to support recovering from readiness/liveness probe failures. My understanding is this is already covered by watching the child pods. I’m curious if I’m missing something, and should be looking out for something in particular while we run a test with events watching disabled. |
we are filtering out the events that we do not care here.
So I think we may need to change this part to make it only process the pod events it cares about.
|
Let's try to filter the event with owner type = RayCluster. that's enough to bring down the memory usage. |
We can have a hotfix PR for this issue recently, but we finally need to stop watching events because operator operations should be idempotent and stateless. However, events are time-sensitive and deleting Pods based on events are not idempotent. |
Hi Jeev, Are you able to try our patched version later and help us to verify this? |
Yes, I’ll deploy from the PR branch tomorrow and let it run for awhile to collect metrics. Thanks for working on this so promptly! :) |
Getting lots of spurious logs related to events from unrelated objects (on
|
thanks. let me take a look |
Hi @jeevb , I manually tested the latest code on my machine and it is working as expected now. The issue you are seeing is because of too many debug messages that I forgot to disable. You can try the latest patch and see the result. |
Hi @jeevb can you take a look at this again and see if no more extra logs? |
Yes, will test today and report back! |
Not seeing spam of messages anymore, but seeing these ~10 log messages associated with unrelated pods at startup:
|
This is generally ok since this is the case when we have an unhealthy pod but we are not going to deal with it. If this is also not something we want, we can remove it later. |
Everything looks good so far. Anything in particular I should test? |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
I was running the nightly KubeRay operator for development purposes and observed a failure with some strange error messages logged.
The logs had roughly 1000 lines of
Read request instance not found error
with names of pods unrelated to KubeRay,followed by some complaints about leases and leader election,
followed by a crash and restart.
Ideally, there shouldn't be any issues related to leader election, since we don't support leader election right now.
Only the last few lines of
Read request instance not found error!
are pasted below.Reproduction script
I don't know how to reproduce this yet.
Anything else
I don't know yet.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: