-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue:3308] SIGTERM Graceful shutdown functionality #3340
base: main
Are you sure you want to change the base?
Conversation
Any ETA on when can we expect a review on this PR? |
This would be really great to get in assuming it works, we're also experiencing this. |
Would love to see this merged |
This PR is an essential bug fix for using github runner with Karpenter. |
Karpenter support is essential to save significant cost savings across all companies. We save easily $300k+ per year, Scaling that across 1000's of tech companies, Karpenter support can easily save a lot and associated CO2 Emissions. Can we prioritize reviewing and merging this PR? |
Upvote for the PR. We ended up implementing a custom image and baking in the script. However, we noticed that it is not behaving properly in |
@velkovb could I inquire as to the errors you saw when the runner did not capture the signal correctly? I have observed behavior in with ephemeral pvc's get stuck in the Have been leaning towards using the Kubernetes buildkit driver as the solution, but a side car would certainly be easier |
We were seeing errors that connection to the docker socket was lost during an image build. We get a SIGTERM signal and the runner container handles it properly but the dind one doesn't and terminates so docker host disappears and build breaks. |
that's a different issue, and fixed in PR actions/actions-runner-controller#3601 |
For a while this proposed change seemed to do the trick for our runners however something seems to have changed somewhere where due to the Received SIGTERM, Graceful shutdown in 1800 Secs ...
error: list of process IDs must follow -p
Usage:
ps [options]
Try 'ps --help <simple|list|output|threads|misc|all>'
or 'ps --help <s|l|o|t|m|a>'
for additional help text.
For more details see ps(1).
Exiting runner... Despite it suggesting it was exiting, it seems that it does not in fact on all occasions and instead hanging leaving the listener process. USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
runner 1 0.0 0.0 4500 3532 ? Ss 12:10 0:00 /bin/bash /home/runner/run.sh
runner 10 0.0 0.0 4368 3228 ? S 12:10 0:00 /bin/bash /home/runner/run-helper.sh
runner 25 0.2 0.0 274048668 109492 ? Sl 12:10 0:01 /home/runner/bin/Runner.Listener run
runner 86 0.0 0.0 4632 3788 pts/0 Ss 12:21 0:00 /bin/bash
runner 101 0.0 0.0 7068 1568 pts/0 R+ 12:21 0:00 ps -aux I amended the script further to handle exiting cleanly when only this listener is present: handle_sigterm() {
# Default graceful stop timeout is 3 seconds
RUNNER_GRACEFUL_STOP_TIMEOUT=${RUNNER_GRACEFUL_STOP_TIMEOUT:-3}
echo "Received SIGTERM, " \
"Graceful shutdown in $RUNNER_GRACEFUL_STOP_TIMEOUT Secs ..."
if [ -n "$RUNNER_TOKEN" ]; then
echo "Runner token is still set, de-registering runner..."
idle_runner="/runner/config.sh remove --token $RUNNER_TOKEN"
else
# workaround for Issue#3330
# For the case JITCONFIG is used instead of reg token.
# Fallback to check if worker is running,race condition prone.
worker_process_id=$(pgrep Runner.Worker)
idle_runner="test -z \"$worker_process_id\""
fi
# Check if runner is idle if not then wait for job to finish before stopping
if ! eval $idle_runner; then
echo "Running a job, waiting for $RUNNER_GRACEFUL_STOP_TIMEOUT s to finish.."
i=0
while [[ $i -lt $RUNNER_GRACEFUL_STOP_TIMEOUT ]]; do
echo "Still waiting for job to finish.."
# Check again if runner is idle to handle potential race condition
if [ -z $worker_process_id ]; then
echo "Worker process id not found, trying to find it again.."
worker_process_id=$(pgrep Runner.Worker)
# If worker process id is still not found, exit
if [ -z $worker_process_id ]; then
echo "Worker process id still not found, exiting.."
return
fi
fi
# Check if runner stopped itself
if ! ps -p $worker_process_id > /dev/null; then
echo "Runner stopped itself while graceful period waiting."
return
fi
sleep 1
((i++))
done
echo "Graceful period over, terminating..."
fi
# Graceful wait period over, kill the worker process
# Or if worker process was not found, then check for listener process and kill it
if [ -z $worker_process_id ]; then
echo "Worker process id not found, checking for listener process.."
listener_process_id=$(pgrep Runner.Listener)
if [ -n $listener_process_id ]; then
echo "Killing listener process id: $listener_process_id"
kill -INT $listener_process_id
fi
else
echo "Killing worker process id: $worker_process_id"
kill -INT -$worker_process_id
fi
} |
This is to make runner compatible with Kubernetes' Karpenter, and in general k8s pod movement . This fixes the #3308 by handling graceful shutdown of the runner. It does following.
RUNNER_GRACEFUL_STOP_TIMEOUT
seconds before terminating or job completion whichever happens first.