New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buildbot: workers detached every minute and "no space left on device" issue #85808
Comments
It seems many of the RHEL and Fedora builds fail due to disk space https://buildbot.python.org/all/#/builders/185/builds/2 ./configure: line 2382: cannot create temp file for here-document: No space left on device |
These workers have different owners and so need to reach different people. We should list all impacted workers. AMD64 RHEL8 3.x is the worker: cstratak-RHEL8-x86_64. |
There is an issue which I discovered after I returned from holidays, basically the buildbot-worker keeps getting disconnected from master, so builds start and end abruptly, retaining some artifacts. The next second it tried again with the same results, eventually filling the hard disk with the artifacts. Might be due to an updated package, but I've yet to discover what the issue is. |
There were almost 10GB of remnant cc* files in /tmp from the compilers used, which I presume were also the temporary artifacts which remained there after the disconnects. Cleaned those up and rebooted the RHEL8 x86_64 buildbot. |
python-builder-rawhide had its /tmp partition full of temporary "ccXXXX.XXX" files. Before: /tmp was full at 100% (3.9 GB). After sudo rm -f /tmp/cc*, only 52 KB are used (1%). I'm not sure why gcc/clang left so many temporary files :-/ There are many large (22 MB) assembly files (.s). |
Statistics on partition which are the most full. Fedora Rawhide x86-64 is ok: /dev/mapper/vg_root_python--builder--rawhide.osci.io-root 14G 5,4G 7,6G 42% / Fedora Stable x86-64 is ok: /dev/mapper/vg_root_python--builder2--rawhide.osci.io-root 14G 7,7G 5,2G 60% / RHEL8 x86-64 is ok: /dev/mapper/vg_root_python--builder--rhel8.osci.io-root 14G 3,5G 9,5G 27% / RHEL7 x86-64 is ok: /dev/mapper/vg_root_python--builder--rhel7.osci.io-root 7,6G 3,6G 3,7G 49% / RHEL8 FIPS x86-64 is ok: /dev/mapper/vg_root_python--builder--rhel8--fips.osci.io-root 15G 2.8G 12G 20% / Fedora Rawhide AArch64 is ok: /dev/mapper/fedora-root 44G 26G 19G 58% / Fedora Stable AArch64 is ok: /dev/mapper/fedora-root 44G 33G 11G 76% / RHEL7 AArch64 is ok: /dev/mapper/rhel-root 44G 15G 30G 33% / RHEL8 AArch64 had like 22 GB in /tmp, I removed them. It's now better:
Fedora Stable ppc64le /tmp contained 1 GB of temporay files. I removed them. Before: /dev/mapper/fedora-root 45G 29G 17G 63% / Fedora Rawhide ppc64le is ok: /dev/mapper/fedora-root 45G 27G 19G 59% / RHEL7 ppc64le is ok: /dev/mapper/rhel-root 45G 19G 27G 42% / RHEL8 ppc64le had 22 GB of old files in /tmp: removed, rebooted. Before: /dev/mapper/rhel-root 45G 41G 4.2G 91% / |
27 buildbot workers are detached. They are are detached every minute! Affected workers: aixtools-aix-power6 Output of: grep Worker.detach twistd.log|sed -e 's!.*(!!g;s!)!!g'|sort -u|uniq Example of server logs: 2020-08-27 15:33:49+0000 [Broker,33580,10.132.169.156] Worker.detached(billenstein-macos) |
I closed bpo-41648 "edelsohn-* buildbot worker failing with: No space left on device" as a duplicate of this issue. |
I have found a large number of un-removed files in /tmp. Things seem to function better with Buildbots running older 0.x "buildslave" as opposed to newer "builtbot-worker" instances. |
Right. I found many /tmp/ccXXXX.XXX and /tmp/tmpXXXXX files. Around 20 GB of these files! Maybe using passing "-pipe" to gcc/clang would avoid the /tmp/ccXXXX.XXX files when a build is interrupted. For example, I saw assembly files (.s) of around 20 MB. I don't know what are the /tmp/tmpXXXXX files. I'm disappointed that in 2020, buildbot has no safe way to ensure that all created files are removed at the end of a build. chroot, containers, etc. are effecient way to ensure that everything is removed at the end of a build. |
On the worker (client) side, I see many "lost remote step" every 1 to 3 minutes. Example with the PPC64LE Fedora Stable (cstratak-fedora-stable-ppc64le) worker: 2020-08-27 01:30:09-0400 [Broker,client] lost remote step |
The buildbot server migrated to a new machine and is now behind a load balancer. tcp/80 (buildbot web page, HTTP) and tcp/9020 (used by buildbot workers) are both behind the load balancer. Maybe the load balancer closes TCP connections which are idle for 60 seconds? Buildbot workers have a TCP keepalive option of 1 hour (3600 seconds) by default: Pablo told me that his worker uses a keepalive of 2 minutes (120 seconds). |
Ernest confirmed that there are edge load balancers for the PSF infra in DigitalOcean. He updated the load balancers to offer a full 24 hour timeout on buildbot TCP connections. (Yesterday around 17:30 UTC.) It seems like it doesn't fix the issue. Example in server logs: (...) |
I added keepalive_interval=60 parameter to Worker() in the server configuration: |
There are multiple errors in the buildbot server logs. I'm not sure if it's related or not. 2020-08-28 09:16:25+0000 [-] while invoking <bound method HttpStatusPushBase.buildStarted of <buildbot.reporters.github.GitHubStatusPush object at 0x7f661dd58850>>
Traceback (most recent call last):
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/reporters/http.py", line 80, in getMoreInfoAndSend
yield self.send(build)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- <exception caught here>
Error when stopping the server. It seems like this one is just that a client tries to reconnect whereas the server is down: 2020-08-28 09:30:43+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] BuildMaster is stopped
2020-08-28 09:30:43+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] invalid login from unknown user 'ware-win81-release'
2020-08-28 09:30:43+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] Peer will receive following PB traceback:
2020-08-28 09:30:43+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] Unhandled Error
Traceback (most recent call last):
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 460, in callback
self._startRunCallbacks(result)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
self._runCallbacks()
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1475, in gotResult
_inlineCallbacks(r, g, status)
--- <exception caught here>
|
On the server side, it seems like the "edelsohn-rhel8-z" worker is detached because its TCP connection is closed, only 87 seconds after the worker was attached. I added some debug traces: 2020-08-28 09:44:02+0000 [Broker,2,10.132.169.156] worker 'edelsohn-rhel8-z' attaching from IPv4Address(type='TCP', host='10.132.169.156', port=56234) It doesn't say if the client closed the connection on purpose, or if the load balancer closed an inactive connection. |
I can provide some information from the logs of one of the buildbots, or change a parameter. Let me know. |
David: can you please change the buildbot client configuration to use "buildbot-api.python.org" host name? This host name doesn't go through the PSF load balancer. It seems like TCP connection issues are coming from the load balancer. Also, please reduce the keepalive to 60 seconds (keepalive_interval). |
Charris, Pablo and me identified that TCP connections are closed by the load balancer on some buildbot workers. When the "buildbot.python.org" host name is used, TCP connections (tcp port 9020) go through a load balancer. Ernest exposed the TCP port 9020 directly to the Internet (without the load balancer) using a new host name: "buildbot-api.python.org". Buildbot workers should be updated to use "buildbot-api.python.org". I also suggest to use a keepalive of 60 seconds, rather than 600 seconds. If your worker got impacted the this issue, I strongly advice you to clean up manually the temporary directory (/tmp). When a worker was disconnected, the build was interrupted without removing temporary files. On some workers, we got around 20 GB of temporary files in /tmp: "ccXXXX" files and "tmpXXXX" files. I guess that some files are coming from the compiler, some other from the Python test suite. I updated the buildbot client configuration of the 9 workers operated by Red Hat: Fedora Rawhide x64-86 On our owners, I used the following commands: systemctl stop buildbot-worker.service |
Ah, I also updated: Fedora Stable ppc64le |
I have updated edelsohn-aix-ppc64 |
The gps-* bots have been updated. |
All the pablogsal-* buildbots have been updated |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: