Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue:3308] SIGTERM Graceful shutdown functionality #3340

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jaswanthikolla
Copy link

@jaswanthikolla jaswanthikolla commented Jun 14, 2024

This is to make runner compatible with Kubernetes' Karpenter, and in general k8s pod movement . This fixes the #3308 by handling graceful shutdown of the runner. It does following.

  1. If the runner is just listening for jobs and Idle, It will just exit.
  2. If the runner is running a job, It will wait RUNNER_GRACEFUL_STOP_TIMEOUT seconds before terminating or job completion whichever happens first.

@jaswanthikolla jaswanthikolla requested a review from a team as a code owner June 14, 2024 02:44
@jaswanthikolla
Copy link
Author

Any ETA on when can we expect a review on this PR?

@ccincotti3
Copy link

This would be really great to get in assuming it works, we're also experiencing this.

@moosh3
Copy link

moosh3 commented Oct 27, 2024

Would love to see this merged

@joosangkim
Copy link

This PR is an essential bug fix for using github runner with Karpenter.

@jaswanthikolla
Copy link
Author

jaswanthikolla commented Nov 20, 2024

Karpenter support is essential to save significant cost savings across all companies. We save easily $300k+ per year, Scaling that across 1000's of tech companies, Karpenter support can easily save a lot and associated CO2 Emissions.

Can we prioritize reviewing and merging this PR?

@velkovb
Copy link

velkovb commented Dec 12, 2024

Upvote for the PR. We ended up implementing a custom image and baking in the script. However, we noticed that it is not behaving properly in dind runners as the signal is only captured on the runner container and the docker socket dies. Moving dind to a sidecar container has solved it for us - actions/actions-runner-controller#3842

@alec-drw
Copy link

@velkovb could I inquire as to the errors you saw when the runner did not capture the signal correctly? I have observed behavior in with ephemeral pvc's get stuck in the Released state after docker fails to cleanly shutdown, leading to an eventual break in the storage provisioner.

Have been leaning towards using the Kubernetes buildkit driver as the solution, but a side car would certainly be easier

@velkovb
Copy link

velkovb commented Dec 12, 2024

@velkovb could I inquire as to the errors you saw when the runner did not capture the signal correctly? I have observed behavior in with ephemeral pvc's get stuck in the Released state after docker fails to cleanly shutdown, leading to an eventual break in the storage provisioner.

Have been leaning towards using the Kubernetes buildkit driver as the solution, but a side car would certainly be easier

We were seeing errors that connection to the docker socket was lost during an image build. We get a SIGTERM signal and the runner container handles it properly but the dind one doesn't and terminates so docker host disappears and build breaks.

@jaswanthikolla
Copy link
Author

However, we noticed that it is not behaving properly in dind runners as the signal is only captured on the runner container and the docker socket dies

that's a different issue, and fixed in PR actions/actions-runner-controller#3601

@marknet15
Copy link

For a while this proposed change seemed to do the trick for our runners however something seems to have changed somewhere where due to the Runner.Worker process is only active and running when a job is in progress the script would end up hanging leaving the Runner.Listener process running:

Received SIGTERM,  Graceful shutdown in 1800 Secs ...
error: list of process IDs must follow -p

Usage:
 ps [options]

 Try 'ps --help <simple|list|output|threads|misc|all>'
  or 'ps --help <s|l|o|t|m|a>'
 for additional help text.

For more details see ps(1).
Exiting runner...

Despite it suggesting it was exiting, it seems that it does not in fact on all occasions and instead hanging leaving the listener process.

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
runner         1  0.0  0.0   4500  3532 ?        Ss   12:10   0:00 /bin/bash /home/runner/run.sh
runner        10  0.0  0.0   4368  3228 ?        S    12:10   0:00 /bin/bash /home/runner/run-helper.sh
runner        25  0.2  0.0 274048668 109492 ?    Sl   12:10   0:01 /home/runner/bin/Runner.Listener run
runner        86  0.0  0.0   4632  3788 pts/0    Ss   12:21   0:00 /bin/bash
runner       101  0.0  0.0   7068  1568 pts/0    R+   12:21   0:00 ps -aux

I amended the script further to handle exiting cleanly when only this listener is present:

handle_sigterm() {
    # Default graceful stop timeout is 3 seconds
    RUNNER_GRACEFUL_STOP_TIMEOUT=${RUNNER_GRACEFUL_STOP_TIMEOUT:-3}
    echo "Received SIGTERM, " \
        "Graceful shutdown in $RUNNER_GRACEFUL_STOP_TIMEOUT Secs ..."

    if [ -n "$RUNNER_TOKEN" ]; then
        echo "Runner token is still set, de-registering runner..."
        idle_runner="/runner/config.sh remove --token $RUNNER_TOKEN"
    else
        # workaround for Issue#3330
        # For the case JITCONFIG is used instead of reg token.
        # Fallback to check if worker is running,race condition prone.
        worker_process_id=$(pgrep Runner.Worker)
        idle_runner="test -z \"$worker_process_id\""
    fi

    # Check if runner is idle if not then wait for job to finish before stopping
    if ! eval $idle_runner; then
        echo "Running a job, waiting for $RUNNER_GRACEFUL_STOP_TIMEOUT s to finish.."
        i=0
        while [[ $i -lt $RUNNER_GRACEFUL_STOP_TIMEOUT ]]; do
            echo "Still waiting for job to finish.."

            # Check again if runner is idle to handle potential race condition
            if [ -z $worker_process_id ]; then
                echo "Worker process id not found, trying to find it again.."
                worker_process_id=$(pgrep Runner.Worker)
                # If worker process id is still not found, exit
                if [ -z $worker_process_id ]; then
                    echo "Worker process id still not found, exiting.."
                    return
                fi
            fi

            # Check if runner stopped itself
            if ! ps -p $worker_process_id > /dev/null; then
                echo "Runner stopped itself while graceful period waiting."
                return
            fi
            sleep 1
            ((i++))
        done
        echo "Graceful period over, terminating..."
    fi

    # Graceful wait period over, kill the worker process
    # Or if worker process was not found, then check for listener process and kill it
    if [ -z $worker_process_id ]; then
        echo "Worker process id not found, checking for listener process.."
        listener_process_id=$(pgrep Runner.Listener)
        if [ -n $listener_process_id ]; then
            echo "Killing listener process id: $listener_process_id"
            kill -INT $listener_process_id
        fi
    else
        echo "Killing worker process id: $worker_process_id"
        kill -INT -$worker_process_id
    fi
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants