Demos for github job cancellation signal handling to show exactly how it behaves. (Spoiler: badly).
Why is this needed? Because the github actions docs for this appear to be nonexistent at time of writing, and the answer on the relevant github "community" forum post are nowhere near clear or detailed enough. (It's always possible I'm just bad at searching, but you'd think the post above would link to the docs if that were the case).
I was having issues with Terraform runs leaving lock files lying around on S3 state storage buckets when github jobs were cancelled. This shouldn't be happening, because terraform will try to clean up and remove its lock file when gracefully killed. Given the lack of usable github docs, I had to write a demo to find out what exactly happens when a github actions job is cancelled.
TL;DR: unless special care is taken, child processes don't get any signal
delivered to them before they're hard-destroyed when a github actions job is
cancelled. As a workaround, if: always()
blocks can be used to do necessary
cleanup steps as they run on cancel. Or for simple cases you can exec
your
process, so it becomes the top-level process for a step and does receive
signals on cancel.
A set of github actions demonstrate how github job cancellation signal handling works. To try them you must run them via github actions, then cancel them using the github actions UI.
-
cancel-test-exec-child-ignore-sigquit.yaml
:exec
's a script that ignoresSIGINT
,SIGQUIT
andSIGTERM
. It shows that github delivers aSIGINT
, waits 7.5s, delivers aSIGTERM
, waits 2.5s, then presumably sends aSIGKILL
. It then runs anyif: always()
steps after destroying all processes running in the killed step. -
cancel-test-exec-child.yaml
:exec
's the same script, but only ignoresSIGINT
, so the child process will terminate on the subsequentSIGTERM
. This is more realistic, and the subsequent test cases do the same thing.Since it checks the process tree in the
if: always()
cleanup step, this test also shows that github destroys all processes under the step recursively before it begins any cleanup steps. It must be keeping track of all processes. -
cancel-test-shell-plain.yaml
: Represents the "normal" case of a github actions step using a bash shell that runs a child process as a blocking command within the shell. You will see that the child process (the same script as the above demo) does not receive anySIGINT
orSIGTERM
. The bash leader process does, but you can't see that because bash deferstrap
execution until child process exit when blocking waiting for a child process, and the whole lot getsSIGKILL
'd before the child process exits to return control to bash.This means that the workload running in the inner script got no chance to clean up its work.
-
cancel-test-shell-sigfwd.yaml
: Demonstrates that it is possible to use a top-level shell with job control enabled as a process-group leader that forwards signals to its child processes. It's ugly though. Because of deferred traps, every subcommand that needs a chance to handle signals must be run as a background job with&
thenwait
ed for, and there's plenty of fiddling about to make it work.See comments in
signal_forwarding_wrapper.sh
for details. -
cancel-test-long-running-if-always.yaml
: Explores what happens when anif: always()
step takes too long or refuses to exit.It seems like github will let the cleanup run for about 4 minutes then kill it, initially with a
SIGINT
.Repeated cancels sent during the
if: always()
run appear to have no effect.Interestingly, it the job won't retain logs if the cleanup job doesn't exit within the overall job timeout, you can only see the logs if you were streaming them during the run.
-
test-daemonize.yaml
andtest-daemonize-cancel.yaml
: Shows that by detaching from the current session a process can remain running after the end of the containing step, or the cancellation of the step.So it's not possible to rely on Github to terminate everything running under a step; there could still be workloads running when your
if: always()
steps run after cancellation of work. You may need to ensure you record their pids and kill them yourself before handling cleanup.I presume they're destroyed when the containing
job
is cleaned up - I didn't bother giving the task telemetry to some public endpoint so it could report how long it stays running for.
Consider a simple job step like:
- name: whatever
shell: bash
run: |
my-long-running-task
You might expect that if the actions run is cancelled, my-long-running-task
would get some kind of signal to give it a chance to clean up before the whole
actions runner is destroyed. As if you'd pressed Ctrl-C
in the shell, then
waited a bit, then pressed Ctrl-\
(break).
In reality, it exits (presumably on SIGKILL
) without any chance to clean up.
On cancel, github actions delivers a SIGINT
only to the top-level process
for the current step of each active job. Then 7.5s later it delivers a
SIGTERM
, again to the top-level process only. 2.5s later it sends a SIGKILL
(presumably to everything in the process tree).
You'd think that's fine. But signals don't propagate down process trees, so child processes running under the top-level step process won't see a signal unless the top-level process explicit forwards it.
A typical Github actions job will be a run
step with shell: bash
that
invokes some task as a child process of the step's top-level shell. If you
cancel a job with this, github actions will signal the top pid (the shell) with
SIGINT
.
Bash will behave as documented:
When Bash is running without job control enabled and receives SIGINT while waiting for a foreground command, it waits until that foreground command terminates and then decides what to do about the SIGINT [...]
The process will never get that SIGINT
, so it'll never exit and bash never
gets to do anything. And you can't use a trap on SIGINT
to forward signals
to the child process(es) either, because:
If Bash is waiting for a command to complete and receives a signal for which a trap has been set, the trap will not be executed until the command completes.
This issue isn't specific to bash, it's just a useful demo because it's the widely used default for github actions.
Instead of signalling the parent process, github actions should at least offer the option of signalling all processes running under the current step.
A simple albeit imperfect way to do this is to spawn the bash
shell as a
process group leader by setting the -m
flag. Github actions would then
signal the whole process group for each active step when a job cancel request
is sent by sending a kill signal to the negated pid of the leader shell. This
works well for simple cases, but process groups aren't nestable, so if some
workload under the step creates its own process group it'll be orphaned and
won't receive the process-group signal unless its parent notices and propagates
it explicitly.
It'd be better still if it used linux sessions, pid namespaces, or other suitable constructs that can properly contain all processes in the step and signal them all at once.
While bash's usual awful-ness contributes to making this hard to get right, GitHub's lack of docs or any means to configure the job signal handling to deliver to a process group certainly makes it much worse.
Until/unless github improves their actions runners with configurable cancel timeouts and the option to signal the whole process tree, there are a few possible workarounds:
-
exec
any long-running tasks that might need a chance to respond toSIGINT
orSIGTERM
with cleanup steps before termination, so they become the top-level process in astep
. Split series of such tasks into separate github actions steps instead of a single step with multiple sub-commands called by a shell.This'll work ok if your job can respond to
SIGINT
and clean up within 7.5s, or toSIGTERM
and clean up within 2.5s. But if it's relying on network resources that's going to be iffy - and it probably is, since the runner itself is usually stateless. -
Have your command or the controlling shell write a state file to the runner working directory and remove it on clean exit. Then use an
if: always()
action to check for the state file and perform the cleanup actions if found.There is going to be a race between the task completing its work and deleting the state file, so you'll need to ensure that the cleanup action is idempotent. In other words, it must be safe to run the cleanup steps twice.
For example, if you're unlocking something you should write a unique lock-instance-id to your state file. Then when you unlock it, you can ensure the unlock request gets silently ignored if the lock is currently locked with a different lock-instance-id. Handily, this is how Terraform
force-unlock
works.It's unclear what the rules for
if: always()
actions are during github actions cancellation. From a quick experiment it looks like you've got about 4 minutes to complete cleanup, andif: always()
steps ignore cancel requests. -
Hope that steps inside Docker containers work better?
I haven't tested this yet but maybe github has saner cancel behaviour when a job runs in a user-defined container?
- How github behaves with jobs using containers
- Why github actions, which is a job control system, doesn't document fundamental and basic properties of job cancellation behaviour.