Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job never becomes ready when re-attaching to exiting env when pods for the job are deleted #6272

Closed
dnephin opened this issue Dec 3, 2023 · 0 comments · Fixed by #6276
Closed
Labels
bug Something isn't working

Comments

@dnephin
Copy link
Contributor

dnephin commented Dec 3, 2023

Background

I believe this is the same issue as described in #5822. That issue was closed over a year ago, so creating a new one.

We have a system that pre-creates tilt environments for tilt ci, so when tilt ci runs it resumes from an existing environment. We've noticed that occasionally those resumes will fail with a timeout. Using the new output from #6262 we noticed that when it happens it's always a job that never goes ready. Looking at logs we notice that these jobs had deleted pods.

#5859 added special handling for "completed jobs without pods" to kubernetesapply/reconciler.go, but that does not appear to work for processExitCondition.

It appears that session/conv.go:Reconciler.k8sRuntimeTarget sees a job with status == v1alpha1.RuntimeStatusOK, but there are no pods, so v1.PodSucceeded == phase is false, and the resource is marked as Active.Ready = true. In processExitCondition

} else if res.State.Active != nil && (!res.State.Active.Ready || res.Type == v1alpha1.TargetTypeJob) {

means that the ready == True is ignored for jobs, so they never appear ready when the pods are gone.

Expected Behavior

Re-attaching to an environment with tilt ci recognizes a job is completed, even when the pods for that job are deleted.

Current Behavior

Today it will timeout with something like:

Error: Timeout after 30m0s: 1 resources not ready (db-init:runtime)

Steps to Reproduce

  1. Using the example Tiltfile from job never becomes ready when re-attaching to existing env #5822
  2. tilt ci and wait for it to complete
  3. kubectl delete the pod/db-init- pod
  4. tilt ci --timeout 1m - this will timeout instead of going ready

Implementation Questions

I think we understand the problem well enough, but I'm not entirely sure how to fix it yet.

I think in k8sRuntimeTarget we can handle when no pod is found from krs.MostRecentPod, and produce a state of Terminated. It seems like the condition is available at that time.

@dnephin dnephin added the bug Something isn't working label Dec 3, 2023
nicks added a commit to nicks/tilt that referenced this issue Dec 5, 2023
fixes tilt-dev#6272

History of this change:
- we used to have a very naive definition of "Ready"
- callers of the runtime status API had hacks to look at Job pods to determine completeness (tilt-dev#4367)
- later, we changed it to have a more sophisticated definition, so that
  Jobs only count as 'ready' when they're complete. (tilt-dev#5013)
- we forgot to remove the old caller hacks :grimace:

in any case, i added a bunch of tests, since we need more job integration tests anyway.

Signed-off-by: Nick Santos <nick.santos@docker.com>
nicks added a commit to nicks/tilt that referenced this issue Dec 5, 2023
fixes tilt-dev#6272

History of this change:
- we used to have a very naive definition of "Ready"
- callers of the runtime status API had hacks to look at Job pods to determine completeness (tilt-dev#4367)
- later, we changed it to have a more sophisticated definition, so that
  Jobs only count as 'ready' when they're complete. (tilt-dev#5013)
- we forgot to remove the old caller hacks :grimace:

in any case, i added a bunch of tests, since we need more job integration tests anyway.

Signed-off-by: Nick Santos <nick.santos@docker.com>
nicks added a commit to nicks/tilt that referenced this issue Dec 5, 2023
fixes tilt-dev#6272

History of this change:
- we used to have a very naive definition of "Ready"
- callers of the runtime status API had hacks to look at Job pods to determine completeness (tilt-dev#4367)
- later, we changed it to have a more sophisticated definition, so that
  Jobs only count as 'ready' when they're complete. (tilt-dev#5013)
- we forgot to remove the old caller hacks :grimace:

in any case, i added a bunch of tests, since we need more job integration tests anyway.

Signed-off-by: Nick Santos <nick.santos@docker.com>
nicks added a commit that referenced this issue Dec 8, 2023
)

* Handle jobs with deleted pods

Signed-off-by: Daniel Nephin <dnephin@gmail.com>

* session: ensure 'tilt ci' exits properly when reattaching to jobs

fixes #6272

History of this change:
- we used to have a very naive definition of "Ready"
- callers of the runtime status API had hacks to look at Job pods to determine completeness (#4367)
- later, we changed it to have a more sophisticated definition, so that
  Jobs only count as 'ready' when they're complete. (#5013)
- we forgot to remove the old caller hacks :grimace:

in any case, i added a bunch of tests, since we need more job integration tests anyway.

Signed-off-by: Nick Santos <nick.santos@docker.com>

---------

Signed-off-by: Daniel Nephin <dnephin@gmail.com>
Signed-off-by: Nick Santos <nick.santos@docker.com>
Co-authored-by: Daniel Nephin <dnephin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant