Skip to content

Checkout bricks a self-hosted runner and cannot recover #1148

Open
@kvanbere

Description

@kvanbere

Something went wrong, and all of our self-hosted runners checked out bad .git folders or somehow corrupted them. It happened on around 13 of our runners at the same time. I think it was a random occurrence, because I had to manually login and delete the repository folder, and then it was fine.

Here are our logs:

2023-01-30T02:56:34.9249114Z Waiting for a runner to pick up this job...
2023-01-30T04:54:24.3969588Z Job is about to start running on the runner: XXXXXXXXXXXXXXXXXXXXXXXX (organization)
2023-01-30T04:54:29.3070556Z Current runner version: '2.301.1'
2023-01-30T04:54:29.3077744Z Runner name: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
2023-01-30T04:54:29.3078128Z Runner group name: 'Default'
2023-01-30T04:54:29.3078642Z Machine name: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
2023-01-30T04:54:29.3080746Z ##[group]GITHUB_TOKEN Permissions
2023-01-30T04:54:29.3081343Z Actions: write
2023-01-30T04:54:29.3081520Z Checks: write
2023-01-30T04:54:29.3081693Z Contents: write
2023-01-30T04:54:29.3081906Z Deployments: write
2023-01-30T04:54:29.3082186Z Discussions: write
2023-01-30T04:54:29.3082429Z Issues: write
2023-01-30T04:54:29.3082608Z Metadata: read
2023-01-30T04:54:29.3082779Z Packages: write
2023-01-30T04:54:29.3082958Z Pages: write
2023-01-30T04:54:29.3083147Z PullRequests: write
2023-01-30T04:54:29.3083476Z RepositoryProjects: write
2023-01-30T04:54:29.3083696Z SecurityEvents: write
2023-01-30T04:54:29.3083888Z Statuses: write
2023-01-30T04:54:29.3084056Z ##[endgroup]
2023-01-30T04:54:29.3087171Z Secret source: Actions
2023-01-30T04:54:29.3087569Z Prepare workflow directory
2023-01-30T04:54:29.4388409Z Prepare all required actions
2023-01-30T04:54:29.4550014Z Getting action download info
2023-01-30T04:54:29.8524043Z Download action repository 'actions/checkout@v3' (SHA:ac593985615ec2ede58e132d2e21d2b1cbd6127c)
2023-01-30T04:54:30.9083915Z Complete job name: XXXXXXXXXXXXXXXXXXXXXXXX
2023-01-30T04:54:31.0985565Z ##[group]Run actions/checkout@v3
2023-01-30T04:54:31.0985877Z with:
2023-01-30T04:54:31.0986059Z   repository: XXXXXXXX/XXXXXXXX
2023-01-30T04:54:31.0986462Z   token: ***
2023-01-30T04:54:31.0986609Z   ssh-strict: true
2023-01-30T04:54:31.0986786Z   persist-credentials: true
2023-01-30T04:54:31.0986951Z   clean: true
2023-01-30T04:54:31.0987092Z   fetch-depth: 1
2023-01-30T04:54:31.0987234Z   lfs: false
2023-01-30T04:54:31.0987377Z   submodules: false
2023-01-30T04:54:31.0987547Z   set-safe-directory: true
2023-01-30T04:54:31.0987702Z env:
2023-01-30T04:54:31.0987887Z   TMP: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.temp
2023-01-30T04:54:31.0988151Z   TEMP: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.temp
2023-01-30T04:54:31.0988398Z   TMPDIR: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.temp
2023-01-30T04:54:31.0988665Z   MATLAB_PREFDIR: C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX/.preferences
2023-01-30T04:54:31.0988870Z ##[endgroup]
2023-01-30T04:54:34.6968863Z Syncing repository: XXXXXXXX/XXXXXXXX
2023-01-30T04:54:34.6970512Z ##[group]Getting Git version info
2023-01-30T04:54:34.6970936Z Working directory is 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX'
2023-01-30T04:54:34.6971402Z [command]"C:\Program Files\Git\cmd\git.exe" version
2023-01-30T04:54:34.7493487Z git version 2.36.1.windows.1
2023-01-30T04:54:34.7592122Z ##[endgroup]
2023-01-30T04:54:34.7607048Z Temporarily overriding HOME='C:\runner\e595c9b9\_work\_temp\bcafa367-f8cb-4d31-84b1-63d10aaaabed' before making global git config changes
2023-01-30T04:54:34.7607516Z Adding repository directory to the temporary git global config as a safe directory
2023-01-30T04:54:34.7608114Z [command]"C:\Program Files\Git\cmd\git.exe" config --global --add safe.directory C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX
2023-01-30T04:54:34.8483251Z [command]"C:\Program Files\Git\cmd\git.exe" config --local --get remote.origin.url
2023-01-30T04:54:34.8992096Z ##[error]fatal: --local can only be used inside a git repository
2023-01-30T04:54:34.9013542Z Deleting the contents of 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX'
2023-01-30T04:54:35.0573716Z ##[error]EPERM: operation not permitted, unlink 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX\.git'
2023-01-30T04:54:35.4710729Z Post job cleanup.
2023-01-30T04:54:38.8875206Z Cleaning up orphan processes

In this case, checkout seems to be bailing fatally, i.e. after the error fatal: --local can only be used inside a git repository, the actions run ends immediately with a fault and won't try and continue.

This effectively bricked the runner because any jobs that the bad runner would pick up would fail instantly. Not only that, but the bad runner would take all the jobs in the queue and virtually instantly fail them, which messed up our job history quite a bit unfortunately.

Since the resolution step was simply to login and delete the offending bad folder, it would be nice if it would automatically nuke away the folder and retry once.

It seems like it tried this:

2023-01-30T04:54:34.9013542Z Deleting the contents of 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX'
2023-01-30T04:54:35.0573716Z ##[error]EPERM: operation not permitted, unlink 'C:\runner\e595c9b9\_work\XXXXXXXX\XXXXXXXX\.git'

I am not sure why that didn't work, since I was able to login and just rm the folder fine as the same user. In any case, all 13 runners failed to delete the folder automatically.

To reproduce, I would suggest:

  • Install self hosted runner on Windows Server 2022 running as a service and using a non-admin service user (i.e. Bob)
  • Setup action to checkout repository
  • Manually corrupt the .git folder by adding extra random files into it (?)
  • Ensure git config --local --get remote.origin.url fails
  • Observe consequent jobs acquired by this runner will fail instantly and it will fail to recover

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions