Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit crash_loop_limit property #1068

Open
chrisseto opened this issue Mar 1, 2024 · 13 comments
Open

Revisit crash_loop_limit property #1068

chrisseto opened this issue Mar 1, 2024 · 13 comments

Comments

@chrisseto
Copy link
Contributor

chrisseto commented Mar 1, 2024

In #718 we set the default value of crash_loop_limit to 5 to avoid accidentally accumulating many short log segments. While an excellent mitigation, having to wait an entire hour in the case of an accidental misconfiguration is a non starter.

We should either set crash_loop_limit to null and rely on Kubernetes backoffs or add in a bash shim that will wait to start the Redpanda process for a few seconds if the crash limiter file exists. The latter would provide enough of a window of opportunity to A) Log what is happening and B) exec in and delete the file.

cc @vuldin for thoughts as we had to recently suffer due to this feature :) and @JakeSCahill who is writing up documentation related to this.

JIRA Link: K8S-110

@JFlath
Copy link
Contributor

JFlath commented May 7, 2024

At the risk of going off on an (already understood) tangent, but is the more fundemental issue to solve here not just "Redpanda should be resilient to repeated quick restarts"? As an example, even with this feature, we'd still be vulnerable if something external restarted RP many times, right?

I appreciate that's not a converstation for the helm charts repo, but raising here as I wonder if we should be linking this issue across to something in the core repo?

@chrisseto
Copy link
Contributor Author

Talked with @mattschumpert about this one and I think we're going to go with a bandaid solution for now. We should add in a bash script at the start of redpanda that checks if the crash_loop_file exists. If so, log out a warning explaining what it means and the kubectl command that users can run to remove the file then sleep for ~30-60s to give plenty of time to remove said file.

@chrisseto
Copy link
Contributor Author

So finally got around to giving this a shot and didn't realize until I got to testing it that the startup_log file always exists. It consists of the metadata required to decide if crash loop limit has been hit.

In order to add in a sleep delay, we'd have to parse that file which... I don't think we should. It's an internal format.

@mattschumpert @RafalKorepta any thoughts on what we should do with this? Maybe this turns into an ask of the core team to either be resilient to crash looping or add a sleep in at the redpanda level?

I don't think there's a way to hook into redpanda post crash as we'd need to use exec to correctly launch redpanda and rpk is doing the same thing.

@mattschumpert
Copy link

I think we should be completely OK to parse this internal format since this is about a redpanda-specific feature and this is a redpanda-specific helm chart. For the rest I suggest speaking with @mmaslankaprv about what we can do here

@chrisseto
Copy link
Contributor Author

Our best bet in that case is probably to put something into rpk? Otherwise we're going to be parsing binary in bash which I'd really rather not depend on. Even in the case of RPK, the definition lives in redpanda and seems specific to how C++ marshals structs. Keep me posted on Michał's opinion!

@c4milo
Copy link
Member

c4milo commented Jul 1, 2024

It feels Redpanda is going too far into managing runtime concerns that are usually up to a process supervisor to decide such as systemd, kubelet, etc.

@mattschumpert
Copy link

@chrisseto this needs to go back through @mmaslankaprv for a re-thinking of how this should work in K8s.

We've already proven we NEED crash loop detection on by default (that part is not negotiable).

For the rest, If we want to change the approach of that file being there, not being there, what it looks like, or , please work with @mmaslankaprv on what we should do. maybe worth a live discussion if needed

@c4milo
Copy link
Member

c4milo commented Jul 2, 2024

FWIW, we had to disable this feature in Azure because it was getting on the way. So, I'm not sure it is truly needed, and if it is, it probably points to a design or implementation smell in Redpanda.

@chrisseto
Copy link
Contributor Author

Do we have an idea of how many restarts are too many? I know we touched on this @mattschumpert but I don't recall the number. If we could raise the default in the chart to something that's less likely to trigger (5 is VERY low) when taking into account Kubernetes backoffs, we can move this ticket a bit further into the backlog.

Just as a spitball, what if we went to 50?

@mattschumpert
Copy link

@c4milo brand new clusters never need it.

you should also speak to @mmaslankaprv who has the experience to explain why it's needed

@mattschumpert
Copy link

mattschumpert commented Jul 2, 2024

@chrisseto the number was driven by incidents and there is no sense in making it very high as you just cause more damage and get into unrecoverable situations. I'll stop here and let you speak with the replication team to design jointly the proper solution

@JFlath
Copy link
Contributor

JFlath commented Jul 8, 2024

@mattschumpert Just to note that the current implementation also causes significant issues in incidents, where it requires a good deal of extra work to undo the crash loop protection to bring a cluster back online after an issue is resolved. This is true in Cloud, but doubly so in Self-Hosted envs where the same tools may not be available.

@chrisseto
Copy link
Contributor Author

@mattschumpert thanks for the additional insight on the default!

There might be an easier answer now that I think about it. The node/broker config is just a configmap being mounted to the Pod and one way to get RP out of a crash loop is to change the node/broker config. In theory, one could just increment the number in the configmap (or even make a simple white space change IIUC) to recover a crashing node.

If we test that and it works as expected/well, do we think a docs update would be a possible resolution to this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants