-
Notifications
You must be signed in to change notification settings - Fork 261
Description
A concrete case of this that I came across yesterday was with https://firefox-ci-tc.services.mozilla.com/tasks/bs8RlnGKTOebtoPHGHt_4A/runs/0, which ran on Windows in Azure. We can see that run 0 was claimed at 2023-11-13T17:40:38.174Z by vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam, and then resolved 20 minutes later with CLAIM_EXPIRED. In the worker logs, I found what seems to be contradictory information:
Nov 13 17:40:36Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:40:36 polling for termination-time
Nov 13 17:41:06Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:41:06 polling for termination-time
Nov 13 17:41:36Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:41:36 polling for termination-time
Nov 13 17:42:06Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:42:06 polling for termination-time
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:42:30 Error: (Intermittent) HTTP response code 502
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service HTTP/2.0 502 Bad Gateway
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service Content-Length: 332
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service Alt-Svc: clear
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service Content-Type: text/html; charset=UTF-8
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service Date: Mon, 13 Nov 2023 17:42:30 GMT
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service Referrer-Policy: no-referrer
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <html><head>
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <meta http-equiv="content-type" content="text/html;charset=utf-8">
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <title>502 Server Error</title>
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service </head>
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <body text=#000000 bgcolor=#ffffff>
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <h1>Error: Server Error</h1>
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service <h2></h2>
Nov 13 17:42:30Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service </body></html>
Nov 13 17:42:36Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:42:36 polling for termination-time
Nov 13 17:42:50Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:42:50 No task claimed. Idle for 7m14.3394143s (will exit if no task claimed in 2m45.6605857s).
<more of the same for a bit>
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Looking for existing task users to delete...
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Running command: wmic useraccount get name
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Worker idle for idleShutdownTimeoutSecs seconds (10m10.2536984s)
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Saving file file-caches.json (absolute path: C:\worker-runner\file-caches.json)
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Saving file directory-caches.json (absolute path: C:\worker-runner\directory-caches.json)
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Running command: icacls 'file-caches.json' /grant:r 'Administrators:(GA)' /inheritance:r
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Running command: icacls 'directory-caches.json' /grant:r 'Administrators:(GA)' /inheritance:r
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Running command: icacls 'tasks-resolved-count.txt' /grant:r 'Administrators:(GA)' /inheritance:r
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Exiting worker with exit code 68
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service
2023/11/13 17:45:46 WORKER_METRICS {"eventType":"instanceShutdown","instanceType":"Standard_F8s_v2","region":"eastus","timestamp":1699897546,"worker":"generic-worker","workerId":"vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam","workerPoolId":"gecko-t/win10-64-2009"}
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Immediate shutdown being issued...
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 Running command: 'C:\Windows\System32\shutdown.exe' /s /t 3 /c 'generic-worker idle timeout'
Nov 13 17:45:47Z vm-adjkhsxbq02zfovzpgatfaftfjp3dlqewam.reddog.microsoft.com worker-runner-service 2023/11/13 17:45:46 exit status 68
Notably, there's nothing around the claim time that indicates the worker claimed the task (just the polling for termination-time messages). A couple of minutes later there's a 502 which presumably originated from the termination polling code, although it's difficult to be sure. Eventually we get No task claimed, and a warning that the worker will shut down, which it eventually does.
https://firefox-ci-tc.services.mozilla.com/tasks/DZvVQ-VUTPSyPBBS13Bwfg/runs/0 is another example of where this may have happened (we don't have worker nor live logs for these, unfortunately, so it's difficult to confirm). This is a Linux task in GCP, and in both run 0 and run 1 the worker disappeared after a number of hours. https://firefox-ci-tc.services.mozilla.com/tasks/Gc5b4d-JT-OrIhhcppbd3g/runs/0 is another case like that, except we see the same thing with the Windows task above, where the expiration happens after 20 minutes.
I haven't had a chance to run any queries yet, but 9/10 retries I found on https://treeherder.mozilla.org/jobs?repo=mozilla-central&resultStatus=retry%2Crunning%2Cpending%2Crunnable&selectedTaskRun=AyJIoo5gQXCrAjBvCLSFow.0 seem to be hitting this, and it happens on Linux, Mac, and Windows.