Fail fast for unrecoverable Kubernetes jobs #18159

GWphua · 2025-06-19T10:09:19Z

Description

Currently, Kubernetes jobs will retry for a total of 10 times to create a peon pod for ingestion tasks. However, there are some pain points that surface during periods of heavy configuration, or when starting new clusters using similar Druid helm charts (copy-paste):

The error Error when looking for K8s pod with label[job-name=%s] is not useful for us. We will need to further discover what is going on by doing kubectl describe job ${jobname}. Sometimes, if errors happen overnight, and the job is deleted by Kubernetes, it is very difficult to diagnose the root cause of ingestion task failures.
Given the 10 retries for pod creation, along with Kubernetes exponential backoff, jobs will be pending for ~2minutes before failing and logging.

This PR improves error handling and retry logic in druid-kubernetes-overlord-extension when the Kubernetes job fails to create peon pods.

Enhanced Error Handling for Pod Creation Failures

Provides more descriptive error messages when Kubernetes jobs are unable to find the pod. The related job's latest Kubernetes event message may shed some light into what's going wrong.
Better error categorization using DruidException with appropriate persona (OPERATOR) and category (NOT_FOUND)

Important Note: You (or K8s Operators) will need to allow event logging for Druid service accounts to allow the fail-fast feature to work properly.

If you somehow forget to allow event logging for your Druid service account, the behaviour of jobs that successfully spin up pods will not be affected, but you will get a warning Failed to get events for job[%s] and receive the old K8s pod with label[job-name=%s] not found"message.

Improved Retry Logic using Blacklisted Error Message

Implements intelligent retry logic that avoids retrying when the failure is due to known unrecoverable conditions. A list of unrecoverable event message substrings are specified under BLACKLISTED_PEON_POD_ERROR_MESSAGES. The shouldRetryStartingPeonPod() method checks exception messages (in the form of Kubernetes Job event messages) against this blacklist to determine if retrying would be futile.

BLACKLISTED_PEON_POD_ERROR_MESSAGES currently includes: "max limit to request ratio" - which catches failures when resource (cpu, memory, etc.) request-limits ratio is beyond the allowable amount.

This is how the exception message will look like now:

Job[XXX] failed to start up pods. Latest event: [Error creating: pods "XXX" is forbidden: memory max limit to request ratio per Container is 1, but provided ratio is 1.333333]

I have only added one event message substring that really hits me very often. Feel free to add upon this constant should you discover more unrecoverable issues (or even allow this list to be configurable?).

Release note

Task pods will now output Kubernetes job events when failing during the pod creation phase, and K8s tasks will fail fast under unrecoverable conditions.

Key changed/added classes in this PR

KubernetesPeonClient.java
DruidK8sConstants.java
KubernetesPeonClientTest.java

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

FrankChen021 · 2025-07-01T02:28:06Z

...verlord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/DruidK8sConstants.java

+  public static final ImmutableList<String> BLACKLISTED_PEON_POD_ERROR_MESSAGES = ImmutableList.of(
+      // Catches limit to ratio constraint: https://github.com/kubernetes/kubernetes/blob/3e39d1074fc717a883aaf57b966dd7a06dfca2ec/plugin/pkg/admission/limitranger/admission.go#L359
+      // {resource} max limit to request ratio per Container is {value}, but provided ratio is {value}
+      "max limit to request ratio"


there might be some other cases that PODs will not be created sucecssfully such as quota exceeds

Added to the list according to Kubernetes code

FrankChen021

lgtm

GWphua · 2025-07-10T07:55:49Z

Failed test unrelated to changes.

GWphua added 5 commits June 19, 2025 16:51

Add code that allows retry checking with Blacklisted messages

44bba2e

Additional Druid Exception to handle pod creation failure

9636599

Changes to DruidException category

3f79bbf

Fix test case

46965e3

Unit test for no-restart

c486b98

github-actions bot added the Kubernetes label Jun 19, 2025

Update role configurations

c63e9c9

github-actions bot added the Area - Documentation label Jun 19, 2025

Change to trigger CI

f13a684

FrankChen021 reviewed Jul 1, 2025

View reviewed changes

FrankChen021 approved these changes Jul 1, 2025

View reviewed changes

Add exceeded quota fail-fast

0b6a570

FrankChen021 merged commit a576b3c into apache:master Jul 11, 2025
140 of 142 checks passed

GWphua deleted the handle-job-failures branch July 11, 2025 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail fast for unrecoverable Kubernetes jobs #18159

Fail fast for unrecoverable Kubernetes jobs #18159

Uh oh!

GWphua commented Jun 19, 2025 •

edited

Loading

Uh oh!

FrankChen021 Jul 1, 2025

Uh oh!

GWphua Jul 10, 2025

Uh oh!

FrankChen021 left a comment

Uh oh!

GWphua commented Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!

Fail fast for unrecoverable Kubernetes jobs #18159

Fail fast for unrecoverable Kubernetes jobs #18159

Uh oh!

Conversation

GWphua commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Enhanced Error Handling for Pod Creation Failures

Improved Retry Logic using Blacklisted Error Message

Release note

Key changed/added classes in this PR

Uh oh!

FrankChen021 Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

GWphua Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

GWphua commented Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!

GWphua commented Jun 19, 2025 •

edited

Loading