Cleaner should also cover cases where pipeline run's pod is gone, but pipeline run object is marked non-terminal #1088

augray · 2023-10-31T19:49:09Z

This can likely be grouped with stale-pipeline-runs

…'s DB doesn't see that (#1100) Closes #1088 We have observed that this can happen if, for example, the runner pod OOM'd. We don't want the Sematic dashboard to show such things as still active when they're really not. Additionally, the runner jobs weren't really getting their statuses updated except when they were created and when they were killed. I added some intermediate updates by having the jobs get updated whenever the resolution object is saved. Finally, there are cases where the runner job looked like it might just be pending on k8s, and that was getting confused as still being active. To catch cases like this, I added logic that if a job still hasn't been started (which for run jobs is as soon as k8s acknowledges them, and for resolution jobs as soon as the runner pod updates the resolution status at its start) within 24 hours, consider the job as defunct and no longer active. Testing -------- Hacked the runner code so it didn't respond to signals by calling the cancellation API. Hacked the job creation timeout to be 10 minutes rather than 24 hours. Then: - Had the runner immediately exit without doing anything else. Confirmed that this got seen as garbage and cleaned - Started the testing pipeline with two jobs set to wait for 15 minutes. Once the jobs were actually in progress and at the sleep runs, I killed the runner pod for one. Confirmed that it got cleaned (but not until AFTER the sleep run had finished its work), but the one where I didn't kill the runner finished successfully. Also deployed to dev1, and it cleaned up all the garbage we had there except for stuff from the LocalRunner that hadn't been marked terminal yet. A separate strategy will be needed to address defunct local runners; that's out of scope for this PR. --------- Co-authored-by: Josh Bauer <josh@sematic.dev>

augray added the reliability label Oct 31, 2023

tscurtu added bug Something isn't working infrastructure Things that relate to setting up the runtime labels Nov 13, 2023

augray mentioned this issue Nov 16, 2023

Have the cleaner cover cases where the runner pod is dead but Sematic's DB doesn't see that #1100

Merged

augray closed this as completed in #1100 Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleaner should also cover cases where pipeline run's pod is gone, but pipeline run object is marked non-terminal #1088

Cleaner should also cover cases where pipeline run's pod is gone, but pipeline run object is marked non-terminal #1088

augray commented Oct 31, 2023

Cleaner should also cover cases where pipeline run's pod is gone, but pipeline run object is marked non-terminal #1088

Cleaner should also cover cases where pipeline run's pod is gone, but pipeline run object is marked non-terminal #1088

Comments

augray commented Oct 31, 2023