Skip to content

generic worker sometimes shuts down a worker while seemingly working on a task #6681

@bhearsum

Description

@bhearsum

A concrete case of this that I came across yesterday was with https://firefox-ci-tc.services.mozilla.com/tasks/bs8RlnGKTOebtoPHGHt_4A/runs/0, which ran on Windows in Azure. We can see that run 0 was claimed at 2023-11-13T17:40:38.174Z by vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam, and then resolved 20 minutes later with CLAIM_EXPIRED. In the worker logs, I found what seems to be contradictory information:

Nov 13 17:40:36Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:40:36 polling for termination-time 
Nov 13 17:41:06Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:41:06 polling for termination-time 
Nov 13 17:41:36Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:41:36 polling for termination-time 
Nov 13 17:42:06Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:42:06 polling for termination-time 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:42:30 Error: (Intermittent) HTTP response code 502 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service HTTP/2.0 502 Bad Gateway 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service Content-Length: 332 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service Alt-Svc: clear 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service Content-Type: text/html; charset=UTF-8 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service Date: Mon, 13 Nov 2023 17:42:30 GMT 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service Referrer-Policy: no-referrer 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <html><head> 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <meta http-equiv="content-type" content="text/html;charset=utf-8"> 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <title>502 Server Error</title> 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service </head> 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <body text=#000000 bgcolor=#ffffff> 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <h1>Error: Server Error</h1> 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2> 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <h2></h2> 
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service </body></html> 
Nov 13 17:42:36Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:42:36 polling for termination-time 
Nov 13 17:42:50Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:42:50 No task claimed. Idle for 7m14.3394143s (will exit if no task claimed in 2m45.6605857s). 
<more of the same for a bit>
 Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Looking for existing task users to delete... 
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Running command: wmic useraccount get name 
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Worker idle for idleShutdownTimeoutSecs seconds (10m10.2536984s) 
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Saving file file-caches.json (absolute path: C:\worker-runner\file-caches.json) 
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Saving file directory-caches.json (absolute path: C:\worker-runner\directory-caches.json) 
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Running command: icacls 'file-caches.json' /grant:r 'Administrators:(GA)' /inheritance:r 
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Running command: icacls 'directory-caches.json' /grant:r 'Administrators:(GA)' /inheritance:r 
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Running command: icacls 'tasks-resolved-count.txt' /grant:r 'Administrators:(GA)' /inheritance:r 
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Exiting worker with exit code 68 
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service
2023/11/13 17:45:46 WORKER_METRICS {"eventType":"instanceShutdown","instanceType":"Standard_F8s_v2","region":"eastus","timestamp":1699897546,"worker":"generic-worker","workerId":"vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam","workerPoolId":"gecko-t/win10-64-2009"} 
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Immediate shutdown being issued... 
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Running command: 'C:\Windows\System32\shutdown.exe' /s /t 3 /c 'generic-worker idle timeout' 
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 exit status 68 

Notably, there's nothing around the claim time that indicates the worker claimed the task (just the polling for termination-time messages). A couple of minutes later there's a 502 which presumably originated from the termination polling code, although it's difficult to be sure. Eventually we get No task claimed, and a warning that the worker will shut down, which it eventually does.

https://firefox-ci-tc.services.mozilla.com/tasks/DZvVQ-VUTPSyPBBS13Bwfg/runs/0 is another example of where this may have happened (we don't have worker nor live logs for these, unfortunately, so it's difficult to confirm). This is a Linux task in GCP, and in both run 0 and run 1 the worker disappeared after a number of hours. https://firefox-ci-tc.services.mozilla.com/tasks/Gc5b4d-JT-OrIhhcppbd3g/runs/0 is another case like that, except we see the same thing with the Windows task above, where the expiration happens after 20 minutes.

I haven't had a chance to run any queries yet, but 9/10 retries I found on https://treeherder.mozilla.org/jobs?repo=mozilla-central&resultStatus=retry%2Crunning%2Cpending%2Crunnable&selectedTaskRun=AyJIoo5gQXCrAjBvCLSFow.0 seem to be hitting this, and it happens on Linux, Mac, and Windows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions