Revisit `crash_loop_limit` property #1068

chrisseto · 2024-03-01T15:42:58Z

In #718 we set the default value of crash_loop_limit to 5 to avoid accidentally accumulating many short log segments. While an excellent mitigation, having to wait an entire hour in the case of an accidental misconfiguration is a non starter.

We should either set crash_loop_limit to null and rely on Kubernetes backoffs or add in a bash shim that will wait to start the Redpanda process for a few seconds if the crash limiter file exists. The latter would provide enough of a window of opportunity to A) Log what is happening and B) exec in and delete the file.

cc @vuldin for thoughts as we had to recently suffer due to this feature :) and @JakeSCahill who is writing up documentation related to this.

JIRA Link: K8S-110

The text was updated successfully, but these errors were encountered:

JFlath · 2024-05-07T12:43:55Z

At the risk of going off on an (already understood) tangent, but is the more fundemental issue to solve here not just "Redpanda should be resilient to repeated quick restarts"? As an example, even with this feature, we'd still be vulnerable if something external restarted RP many times, right?

I appreciate that's not a converstation for the helm charts repo, but raising here as I wonder if we should be linking this issue across to something in the core repo?

chrisseto · 2024-05-30T20:52:13Z

Talked with @mattschumpert about this one and I think we're going to go with a bandaid solution for now. We should add in a bash script at the start of redpanda that checks if the crash_loop_file exists. If so, log out a warning explaining what it means and the kubectl command that users can run to remove the file then sleep for ~30-60s to give plenty of time to remove said file.

chrisseto · 2024-07-01T19:15:27Z

So finally got around to giving this a shot and didn't realize until I got to testing it that the startup_log file always exists. It consists of the metadata required to decide if crash loop limit has been hit.

In order to add in a sleep delay, we'd have to parse that file which... I don't think we should. It's an internal format.

@mattschumpert @RafalKorepta any thoughts on what we should do with this? Maybe this turns into an ask of the core team to either be resilient to crash looping or add a sleep in at the redpanda level?

I don't think there's a way to hook into redpanda post crash as we'd need to use exec to correctly launch redpanda and rpk is doing the same thing.

mattschumpert · 2024-07-01T19:32:39Z

I think we should be completely OK to parse this internal format since this is about a redpanda-specific feature and this is a redpanda-specific helm chart. For the rest I suggest speaking with @mmaslankaprv about what we can do here

chrisseto · 2024-07-01T20:27:41Z

Our best bet in that case is probably to put something into rpk? Otherwise we're going to be parsing binary in bash which I'd really rather not depend on. Even in the case of RPK, the definition lives in redpanda and seems specific to how C++ marshals structs. Keep me posted on Michał's opinion!

c4milo · 2024-07-01T20:33:59Z

It feels Redpanda is going too far into managing runtime concerns that are usually up to a process supervisor to decide such as systemd, kubelet, etc.

mattschumpert · 2024-07-01T23:20:50Z

@chrisseto this needs to go back through @mmaslankaprv for a re-thinking of how this should work in K8s.

We've already proven we NEED crash loop detection on by default (that part is not negotiable).

For the rest, If we want to change the approach of that file being there, not being there, what it looks like, or , please work with @mmaslankaprv on what we should do. maybe worth a live discussion if needed

c4milo · 2024-07-02T00:05:13Z

FWIW, we had to disable this feature in Azure because it was getting on the way. So, I'm not sure it is truly needed, and if it is, it probably points to a design or implementation smell in Redpanda.

chrisseto · 2024-07-02T13:42:20Z

Do we have an idea of how many restarts are too many? I know we touched on this @mattschumpert but I don't recall the number. If we could raise the default in the chart to something that's less likely to trigger (5 is VERY low) when taking into account Kubernetes backoffs, we can move this ticket a bit further into the backlog.

Just as a spitball, what if we went to 50?

mattschumpert · 2024-07-02T14:36:09Z

@c4milo brand new clusters never need it.

you should also speak to @mmaslankaprv who has the experience to explain why it's needed

mattschumpert · 2024-07-02T14:38:17Z

@chrisseto the number was driven by incidents and there is no sense in making it very high as you just cause more damage and get into unrecoverable situations. I'll stop here and let you speak with the replication team to design jointly the proper solution

JFlath · 2024-07-08T10:38:48Z

@mattschumpert Just to note that the current implementation also causes significant issues in incidents, where it requires a good deal of extra work to undo the crash loop protection to bring a cluster back online after an issue is resolved. This is true in Cloud, but doubly so in Self-Hosted envs where the same tools may not be available.

chrisseto · 2024-07-08T14:06:25Z

@mattschumpert thanks for the additional insight on the default!

There might be an easier answer now that I think about it. The node/broker config is just a configmap being mounted to the Pod and one way to get RP out of a crash loop is to change the node/broker config. In theory, one could just increment the number in the configmap (or even make a simple white space change IIUC) to recover a crashing node.

If we test that and it works as expected/well, do we think a docs update would be a possible resolution to this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit `crash_loop_limit` property #1068

Revisit `crash_loop_limit` property #1068

chrisseto commented Mar 1, 2024 •

edited by jira bot

Loading

JFlath commented May 7, 2024

chrisseto commented May 30, 2024

chrisseto commented Jul 1, 2024

mattschumpert commented Jul 1, 2024

chrisseto commented Jul 1, 2024

c4milo commented Jul 1, 2024

mattschumpert commented Jul 1, 2024

c4milo commented Jul 2, 2024

chrisseto commented Jul 2, 2024

mattschumpert commented Jul 2, 2024

mattschumpert commented Jul 2, 2024 •

edited

Loading

JFlath commented Jul 8, 2024

chrisseto commented Jul 8, 2024

Revisit crash_loop_limit property #1068

Revisit crash_loop_limit property #1068

Comments

chrisseto commented Mar 1, 2024 • edited by jira bot Loading

JFlath commented May 7, 2024

chrisseto commented May 30, 2024

chrisseto commented Jul 1, 2024

mattschumpert commented Jul 1, 2024

chrisseto commented Jul 1, 2024

c4milo commented Jul 1, 2024

mattschumpert commented Jul 1, 2024

c4milo commented Jul 2, 2024

chrisseto commented Jul 2, 2024

mattschumpert commented Jul 2, 2024

mattschumpert commented Jul 2, 2024 • edited Loading

JFlath commented Jul 8, 2024

chrisseto commented Jul 8, 2024

Revisit `crash_loop_limit` property #1068

Revisit `crash_loop_limit` property #1068

chrisseto commented Mar 1, 2024 •

edited by jira bot

Loading

mattschumpert commented Jul 2, 2024 •

edited

Loading