Job never becomes ready when re-attaching to exiting env when pods for the job are deleted #6272

dnephin · 2023-12-03T18:26:52Z

Background

I believe this is the same issue as described in #5822. That issue was closed over a year ago, so creating a new one.

We have a system that pre-creates tilt environments for tilt ci, so when tilt ci runs it resumes from an existing environment. We've noticed that occasionally those resumes will fail with a timeout. Using the new output from #6262 we noticed that when it happens it's always a job that never goes ready. Looking at logs we notice that these jobs had deleted pods.

#5859 added special handling for "completed jobs without pods" to kubernetesapply/reconciler.go, but that does not appear to work for processExitCondition.

It appears that session/conv.go:Reconciler.k8sRuntimeTarget sees a job with status == v1alpha1.RuntimeStatusOK, but there are no pods, so v1.PodSucceeded == phase is false, and the resource is marked as Active.Ready = true. In processExitCondition

tilt/internal/controllers/core/session/status.go

Line 101 in 38238ec

    
           } else if res.State.Active != nil && (!res.State.Active.Ready || res.Type == v1alpha1.TargetTypeJob) {

means that the ready == True is ignored for jobs, so they never appear ready when the pods are gone.

Expected Behavior

Re-attaching to an environment with tilt ci recognizes a job is completed, even when the pods for that job are deleted.

Current Behavior

Today it will timeout with something like:

Error: Timeout after 30m0s: 1 resources not ready (db-init:runtime)

Steps to Reproduce

Using the example Tiltfile from job never becomes ready when re-attaching to existing env #5822
tilt ci and wait for it to complete
kubectl delete the pod/db-init- pod
tilt ci --timeout 1m - this will timeout instead of going ready

Implementation Questions

I think we understand the problem well enough, but I'm not entirely sure how to fix it yet.

I think in k8sRuntimeTarget we can handle when no pod is found from krs.MostRecentPod, and produce a state of Terminated. It seems like the condition is available at that time.

The text was updated successfully, but these errors were encountered:

fixes tilt-dev#6272 History of this change: - we used to have a very naive definition of "Ready" - callers of the runtime status API had hacks to look at Job pods to determine completeness (tilt-dev#4367) - later, we changed it to have a more sophisticated definition, so that Jobs only count as 'ready' when they're complete. (tilt-dev#5013) - we forgot to remove the old caller hacks :grimace: in any case, i added a bunch of tests, since we need more job integration tests anyway. Signed-off-by: Nick Santos <nick.santos@docker.com>

) * Handle jobs with deleted pods Signed-off-by: Daniel Nephin <dnephin@gmail.com> * session: ensure 'tilt ci' exits properly when reattaching to jobs fixes #6272 History of this change: - we used to have a very naive definition of "Ready" - callers of the runtime status API had hacks to look at Job pods to determine completeness (#4367) - later, we changed it to have a more sophisticated definition, so that Jobs only count as 'ready' when they're complete. (#5013) - we forgot to remove the old caller hacks :grimace: in any case, i added a bunch of tests, since we need more job integration tests anyway. Signed-off-by: Nick Santos <nick.santos@docker.com> --------- Signed-off-by: Daniel Nephin <dnephin@gmail.com> Signed-off-by: Nick Santos <nick.santos@docker.com> Co-authored-by: Daniel Nephin <dnephin@gmail.com>

dnephin added the bug Something isn't working label Dec 3, 2023

dnephin mentioned this issue Dec 3, 2023

Handle jobs with deleted pods #6273

Closed

nicks mentioned this issue Dec 5, 2023

session: ensure 'tilt ci' exits properly when reattaching to jobs #6276

Merged

nicks closed this as completed in #6276 Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job never becomes ready when re-attaching to exiting env when pods for the job are deleted #6272

Job never becomes ready when re-attaching to exiting env when pods for the job are deleted #6272

dnephin commented Dec 3, 2023 •

edited

Job never becomes ready when re-attaching to exiting env when pods for the job are deleted #6272

Job never becomes ready when re-attaching to exiting env when pods for the job are deleted #6272

Comments

dnephin commented Dec 3, 2023 • edited

Background

Expected Behavior

Current Behavior

Steps to Reproduce

Implementation Questions

dnephin commented Dec 3, 2023 •

edited