-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Fail fast for unrecoverable Kubernetes jobs #18159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
public static final ImmutableList<String> BLACKLISTED_PEON_POD_ERROR_MESSAGES = ImmutableList.of( | ||
// Catches limit to ratio constraint: https://github.com/kubernetes/kubernetes/blob/3e39d1074fc717a883aaf57b966dd7a06dfca2ec/plugin/pkg/admission/limitranger/admission.go#L359 | ||
// {resource} max limit to request ratio per Container is {value}, but provided ratio is {value} | ||
"max limit to request ratio" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there might be some other cases that PODs will not be created sucecssfully such as quota exceeds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the list according to Kubernetes code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Failed test unrelated to changes. |
Description
Currently, Kubernetes jobs will retry for a total of 10 times to create a peon pod for ingestion tasks. However, there are some pain points that surface during periods of heavy configuration, or when starting new clusters using similar Druid helm charts (copy-paste):
Error when looking for K8s pod with label[job-name=%s]
is not useful for us. We will need to further discover what is going on by doingkubectl describe job ${jobname}
. Sometimes, if errors happen overnight, and the job is deleted by Kubernetes, it is very difficult to diagnose the root cause of ingestion task failures.This PR improves error handling and retry logic in
druid-kubernetes-overlord-extension
when the Kubernetes job fails to create peon pods.Enhanced Error Handling for Pod Creation Failures
DruidException
with appropriate persona (OPERATOR) and category (NOT_FOUND)Important Note: You (or K8s Operators) will need to allow event logging for Druid service accounts to allow the fail-fast feature to work properly.
If you somehow forget to allow event logging for your Druid service account, the behaviour of jobs that successfully spin up pods will not be affected, but you will get a warning
Failed to get events for job[%s]
and receive the oldK8s pod with label[job-name=%s] not found"
message.Improved Retry Logic using Blacklisted Error Message
Implements intelligent retry logic that avoids retrying when the failure is due to known unrecoverable conditions. A list of unrecoverable event message substrings are specified under
BLACKLISTED_PEON_POD_ERROR_MESSAGES
. TheshouldRetryStartingPeonPod()
method checks exception messages (in the form of Kubernetes Job event messages) against this blacklist to determine if retrying would be futile.BLACKLISTED_PEON_POD_ERROR_MESSAGES
currently includes: "max limit to request ratio" - which catches failures when resource (cpu, memory, etc.) request-limits ratio is beyond the allowable amount.This is how the exception message will look like now:
I have only added one event message substring that really hits me very often. Feel free to add upon this constant should you discover more unrecoverable issues (or even allow this list to be configurable?).
Release note
Task pods will now output Kubernetes job events when failing during the pod creation phase, and K8s tasks will fail fast under unrecoverable conditions.
Key changed/added classes in this PR
KubernetesPeonClient.java
DruidK8sConstants.java
KubernetesPeonClientTest.java
This PR has: