Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaner should also cover cases where pipeline run's pod is gone, but pipeline run object is marked non-terminal #1088

Closed
augray opened this issue Oct 31, 2023 · 0 comments · Fixed by #1100
Labels
bug Something isn't working infrastructure Things that relate to setting up the runtime reliability

Comments

@augray
Copy link
Member

augray commented Oct 31, 2023

This can likely be grouped with stale-pipeline-runs

@tscurtu tscurtu added bug Something isn't working infrastructure Things that relate to setting up the runtime labels Nov 13, 2023
github-merge-queue bot pushed a commit that referenced this issue Nov 16, 2023
…'s DB doesn't see that (#1100)

Closes #1088 

We have observed that this can happen if, for example, the runner pod
OOM'd. We don't want the Sematic dashboard to show such things as still
active when they're really not.

Additionally, the runner jobs weren't really getting their statuses
updated except when they were created and when they were killed. I added
some intermediate updates by having the jobs get updated whenever the
resolution object is saved.

Finally, there are cases where the runner job looked like it might just
be pending on k8s, and that was getting confused as still being active.
To catch cases like this, I added logic that if a job still hasn't been
started (which for run jobs is as soon as k8s acknowledges them, and for
resolution jobs as soon as the runner pod updates the resolution status
at its start) within 24 hours, consider the job as defunct and no longer
active.

Testing
--------

Hacked the runner code so it didn't respond to signals by calling the
cancellation API. Hacked the job creation timeout to be 10 minutes
rather than 24 hours. Then:

- Had the runner immediately exit without doing anything else. Confirmed
that this got seen as garbage and cleaned
- Started the testing pipeline with two jobs set to wait for 15 minutes.
Once the jobs were actually in progress and at the sleep runs, I killed
the runner pod for one. Confirmed that it got cleaned (but not until
AFTER the sleep run had finished its work), but the one where I didn't
kill the runner finished successfully.

Also deployed to dev1, and it cleaned up all the garbage we had there
except for stuff from the LocalRunner that hadn't been marked terminal
yet. A separate strategy will be needed to address defunct local
runners; that's out of scope for this PR.

---------

Co-authored-by: Josh Bauer <josh@sematic.dev>
neutralino1 pushed a commit that referenced this issue Apr 3, 2024
…'s DB doesn't see that (#1100)

Closes #1088 

We have observed that this can happen if, for example, the runner pod
OOM'd. We don't want the Sematic dashboard to show such things as still
active when they're really not.

Additionally, the runner jobs weren't really getting their statuses
updated except when they were created and when they were killed. I added
some intermediate updates by having the jobs get updated whenever the
resolution object is saved.

Finally, there are cases where the runner job looked like it might just
be pending on k8s, and that was getting confused as still being active.
To catch cases like this, I added logic that if a job still hasn't been
started (which for run jobs is as soon as k8s acknowledges them, and for
resolution jobs as soon as the runner pod updates the resolution status
at its start) within 24 hours, consider the job as defunct and no longer
active.

Testing
--------

Hacked the runner code so it didn't respond to signals by calling the
cancellation API. Hacked the job creation timeout to be 10 minutes
rather than 24 hours. Then:

- Had the runner immediately exit without doing anything else. Confirmed
that this got seen as garbage and cleaned
- Started the testing pipeline with two jobs set to wait for 15 minutes.
Once the jobs were actually in progress and at the sleep runs, I killed
the runner pod for one. Confirmed that it got cleaned (but not until
AFTER the sleep run had finished its work), but the one where I didn't
kill the runner finished successfully.

Also deployed to dev1, and it cleaned up all the garbage we had there
except for stuff from the LocalRunner that hadn't been marked terminal
yet. A separate strategy will be needed to address defunct local
runners; that's out of scope for this PR.

---------

Co-authored-by: Josh Bauer <josh@sematic.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working infrastructure Things that relate to setting up the runtime reliability
Projects
None yet
2 participants