-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit crash_loop_limit
property
#1068
Comments
At the risk of going off on an (already understood) tangent, but is the more fundemental issue to solve here not just "Redpanda should be resilient to repeated quick restarts"? As an example, even with this feature, we'd still be vulnerable if something external restarted RP many times, right? I appreciate that's not a converstation for the helm charts repo, but raising here as I wonder if we should be linking this issue across to something in the core repo? |
Talked with @mattschumpert about this one and I think we're going to go with a bandaid solution for now. We should add in a bash script at the start of redpanda that checks if the crash_loop_file exists. If so, log out a warning explaining what it means and the |
So finally got around to giving this a shot and didn't realize until I got to testing it that the In order to add in a sleep delay, we'd have to parse that file which... I don't think we should. It's an internal format. @mattschumpert @RafalKorepta any thoughts on what we should do with this? Maybe this turns into an ask of the core team to either be resilient to crash looping or add a sleep in at the redpanda level? I don't think there's a way to hook into redpanda post crash as we'd need to use |
I think we should be completely OK to parse this internal format since this is about a redpanda-specific feature and this is a redpanda-specific helm chart. For the rest I suggest speaking with @mmaslankaprv about what we can do here |
Our best bet in that case is probably to put something into |
It feels Redpanda is going too far into managing runtime concerns that are usually up to a process supervisor to decide such as systemd, kubelet, etc. |
@chrisseto this needs to go back through @mmaslankaprv for a re-thinking of how this should work in K8s. We've already proven we NEED crash loop detection on by default (that part is not negotiable). For the rest, If we want to change the approach of that file being there, not being there, what it looks like, or , please work with @mmaslankaprv on what we should do. maybe worth a live discussion if needed |
FWIW, we had to disable this feature in Azure because it was getting on the way. So, I'm not sure it is truly needed, and if it is, it probably points to a design or implementation smell in Redpanda. |
Do we have an idea of how many restarts are too many? I know we touched on this @mattschumpert but I don't recall the number. If we could raise the default in the chart to something that's less likely to trigger (5 is VERY low) when taking into account Kubernetes backoffs, we can move this ticket a bit further into the backlog. Just as a spitball, what if we went to 50? |
@c4milo brand new clusters never need it. you should also speak to @mmaslankaprv who has the experience to explain why it's needed |
@chrisseto the number was driven by incidents and there is no sense in making it very high as you just cause more damage and get into unrecoverable situations. I'll stop here and let you speak with the replication team to design jointly the proper solution |
@mattschumpert Just to note that the current implementation also causes significant issues in incidents, where it requires a good deal of extra work to undo the crash loop protection to bring a cluster back online after an issue is resolved. This is true in Cloud, but doubly so in Self-Hosted envs where the same tools may not be available. |
@mattschumpert thanks for the additional insight on the default! There might be an easier answer now that I think about it. The node/broker config is just a configmap being mounted to the Pod and one way to get RP out of a crash loop is to change the node/broker config. In theory, one could just increment the number in the configmap (or even make a simple white space change IIUC) to recover a crashing node. If we test that and it works as expected/well, do we think a docs update would be a possible resolution to this issue? |
In #718 we set the default value of
crash_loop_limit
to 5 to avoid accidentally accumulating many short log segments. While an excellent mitigation, having to wait an entire hour in the case of an accidental misconfiguration is a non starter.We should either set
crash_loop_limit
tonull
and rely on Kubernetes backoffs or add in a bash shim that will wait to start the Redpanda process for a few seconds if the crash limiter file exists. The latter would provide enough of a window of opportunity to A) Log what is happening and B) exec in and delete the file.cc @vuldin for thoughts as we had to recently suffer due to this feature :) and @JakeSCahill who is writing up documentation related to this.
JIRA Link: K8S-110
The text was updated successfully, but these errors were encountered: