Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buildbot: workers detached every minute and "no space left on device" issue #85808

Closed
tirkarthi opened this issue Aug 26, 2020 · 23 comments
Closed
Labels
3.10 only security fixes tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@tirkarthi
Copy link
Member

BPO 41642
Nosy @gpshead, @vstinner, @stratakis, @pablogsal, @tirkarthi

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-08-26.09:29:54.047>
labels = ['type-bug', 'tests', '3.10']
title = 'Buildbot: workers detached every minute and "no space left on device" issue'
updated_at = <Date 2020-09-02.19:11:07.427>
user = 'https://github.com/tirkarthi'

bugs.python.org fields:

activity = <Date 2020-09-02.19:11:07.427>
actor = 'pablogsal'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Tests']
creation = <Date 2020-08-26.09:29:54.047>
creator = 'xtreak'
dependencies = []
files = []
hgrepos = []
issue_num = 41642
keywords = []
message_count = 23.0
messages = ['375925', '375926', '375928', '375930', '375931', '375934', '376006', '376007', '376008', '376009', '376010', '376011', '376022', '376023', '376026', '376027', '376038', '376230', '376234', '376235', '376238', '376258', '376259']
nosy_count = 6.0
nosy_names = ['gregory.p.smith', 'vstinner', 'David.Edelsohn', 'cstratak', 'pablogsal', 'xtreak']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue41642'
versions = ['Python 3.10']

@tirkarthi
Copy link
Member Author

It seems many of the RHEL and Fedora builds fail due to disk space

https://buildbot.python.org/all/#/builders/185/builds/2

./configure: line 2382: cannot create temp file for here-document: No space left on device
./configure: line 2394: cannot create temp file for here-document: No space left on device
./configure: line 2429: cannot create temp file for here-document: No space left on device
./configure: line 2591: cannot create temp file for here-document: No space left on device
./configure: line 2595: cannot create temp file for here-document: No space left on device
./configure: line 2599: cannot create temp file for here-document: No space left on device
./configure: line 2603: cannot create temp file for here-document: No space left on device
./configure: line 2607: cannot create temp file for here-document: No space left on device
./configure: line 2611: cannot create temp file for here-document: No space left on device

@tirkarthi tirkarthi added 3.10 only security fixes tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error labels Aug 26, 2020
@vstinner
Copy link
Member

It seems many of the RHEL and Fedora builds fail due to disk space

These workers have different owners and so need to reach different people. We should list all impacted workers.

https://buildbot.python.org/all/#/builders/185/builds/2

AMD64 RHEL8 3.x is the worker: cstratak-RHEL8-x86_64.

@stratakis
Copy link
Mannequin

stratakis mannequin commented Aug 26, 2020

There is an issue which I discovered after I returned from holidays, basically the buildbot-worker keeps getting disconnected from master, so builds start and end abruptly, retaining some artifacts.

The next second it tried again with the same results, eventually filling the hard disk with the artifacts.

Might be due to an updated package, but I've yet to discover what the issue is.

@stratakis
Copy link
Mannequin

stratakis mannequin commented Aug 26, 2020

There were almost 10GB of remnant cc* files in /tmp from the compilers used, which I presume were also the temporary artifacts which remained there after the disconnects.

Cleaned those up and rebooted the RHEL8 x86_64 buildbot.

@vstinner
Copy link
Member

python-builder-rawhide had its /tmp partition full of temporary "ccXXXX.XXX" files. Before: /tmp was full at 100% (3.9 GB). After sudo rm -f /tmp/cc*, only 52 KB are used (1%).

I'm not sure why gcc/clang left so many temporary files :-/ There are many large (22 MB) assembly files (.s).

@vstinner
Copy link
Member

Statistics on partition which are the most full.

Fedora Rawhide x86-64 is ok:

/dev/mapper/vg_root_python--builder--rawhide.osci.io-root 14G 5,4G 7,6G 42% /
/dev/mapper/vg_root_python--builder--rawhide.osci.io-home 36G 24G 11G 70% /home

Fedora Stable x86-64 is ok:

/dev/mapper/vg_root_python--builder2--rawhide.osci.io-root 14G 7,7G 5,2G 60% /
/dev/mapper/vg_root_python--builder2--rawhide.osci.io-home 36G 23G 12G 67% /home

RHEL8 x86-64 is ok:

/dev/mapper/vg_root_python--builder--rhel8.osci.io-root 14G 3,5G 9,5G 27% /
/dev/mapper/vg_root_python--builder--rhel8.osci.io-home 36G 9,7G 24G 29% /home

RHEL7 x86-64 is ok:

/dev/mapper/vg_root_python--builder--rhel7.osci.io-root 7,6G 3,6G 3,7G 49% /
/dev/mapper/vg_root_python--builder--rhel7.osci.io-home 22G 15G 5,9G 71% /home

RHEL8 FIPS x86-64 is ok:

/dev/mapper/vg_root_python--builder--rhel8--fips.osci.io-root 15G 2.8G 12G 20% /
/dev/mapper/vg_root_python--builder--rhel8--fips.osci.io-home 34G 3.7G 29G 12% /home

Fedora Rawhide AArch64 is ok:

/dev/mapper/fedora-root 44G 26G 19G 58% /
tmpfs 16G 436K 16G 1% /tmp

Fedora Stable AArch64 is ok:

/dev/mapper/fedora-root 44G 33G 11G 76% /
tmpfs 16G 1,6G 15G 11% /tmp

RHEL7 AArch64 is ok:

/dev/mapper/rhel-root 44G 15G 30G 33% /

RHEL8 AArch64 had like 22 GB in /tmp, I removed them. It's now better:

  • before: /dev/mapper/rhel-root 44G 41G 3.3G 93% /
  • after: /dev/mapper/rhel-root 44G 16G 28G 36% /

Fedora Stable ppc64le /tmp contained 1 GB of temporay files. I removed them. Before:

/dev/mapper/fedora-root 45G 29G 17G 63% /
tmpfs 4.0G 1.1G 3.0G 27% /tmp

Fedora Rawhide ppc64le is ok:

/dev/mapper/fedora-root 45G 27G 19G 59% /
tmpfs 4.0G 384K 4.0G 1% /tmp

RHEL7 ppc64le is ok:

/dev/mapper/rhel-root 45G 19G 27G 42% /

RHEL8 ppc64le had 22 GB of old files in /tmp: removed, rebooted. Before:

/dev/mapper/rhel-root 45G 41G 4.2G 91% /

@vstinner
Copy link
Member

27 buildbot workers are detached. They are are detached every minute! Affected workers:

aixtools-aix-power6
billenstein-macos
bolen-ubuntu
bolen-windows10
bolen-windows7
cstratak-RHEL7-aarch64
cstratak-RHEL7-ppc64le
cstratak-RHEL7-x86_64
cstratak-RHEL8-fips-x86_64
cstratak-RHEL8-ppc64le
cstratak-fedora-stable-ppc64le
edelsohn-aix-ppc64
edelsohn-debian-z
edelsohn-fedora-ppc64
edelsohn-fedora-rawhide-z
edelsohn-fedora-z
edelsohn-rhel-z
edelsohn-rhel8-z
edelsohn-sles-z
gps-clang-ubsan
gps-debian-profile-opt
gps-raspbian
isidentical-centos-power8
kloth-win64
koobs-freebsd-564d
ware-alpine
ware-win81-release

Output of: grep Worker.detach twistd.log|sed -e 's!.*(!!g;s!)!!g'|sort -u|uniq

Example of server logs:

2020-08-27 15:33:49+0000 [Broker,33580,10.132.169.156] Worker.detached(billenstein-macos)
2020-08-27 15:34:43+0000 [Broker,33608,10.132.169.156] Worker.detached(billenstein-macos)
2020-08-27 15:35:40+0000 [Broker,33635,10.132.169.157] Worker.detached(billenstein-macos)
(...)

@vstinner vstinner changed the title RHEL and fedora buildbots fail due to disk space error Buildbot: workers detached every minute and "no space left on device" issue Aug 27, 2020
@vstinner vstinner changed the title RHEL and fedora buildbots fail due to disk space error Buildbot: workers detached every minute and "no space left on device" issue Aug 27, 2020
@vstinner
Copy link
Member

I closed bpo-41648 "edelsohn-* buildbot worker failing with: No space left on device" as a duplicate of this issue.

@DavidEdelsohn
Copy link
Mannequin

DavidEdelsohn mannequin commented Aug 27, 2020

I have found a large number of un-removed files in /tmp. Things seem to function better with Buildbots running older 0.x "buildslave" as opposed to newer "builtbot-worker" instances.

@DavidEdelsohn DavidEdelsohn mannequin changed the title Buildbot: workers detached every minute and "no space left on device" issue RHEL and fedora buildbots fail due to disk space error Aug 27, 2020
@DavidEdelsohn DavidEdelsohn mannequin changed the title Buildbot: workers detached every minute and "no space left on device" issue RHEL and fedora buildbots fail due to disk space error Aug 27, 2020
@vstinner
Copy link
Member

I have found a large number of un-removed files in /tmp.

Right. I found many /tmp/ccXXXX.XXX and /tmp/tmpXXXXX files. Around 20 GB of these files! Maybe using passing "-pipe" to gcc/clang would avoid the /tmp/ccXXXX.XXX files when a build is interrupted. For example, I saw assembly files (.s) of around 20 MB.

I don't know what are the /tmp/tmpXXXXX files.

I'm disappointed that in 2020, buildbot has no safe way to ensure that all created files are removed at the end of a build. chroot, containers, etc. are effecient way to ensure that everything is removed at the end of a build.

@vstinner
Copy link
Member

On the worker (client) side, I see many "lost remote step" every 1 to 3 minutes. Example with the PPC64LE Fedora Stable (cstratak-fedora-stable-ppc64le) worker:

2020-08-27 01:30:09-0400 [Broker,client] lost remote step
2020-08-27 01:31:57-0400 [Broker,client] lost remote step
2020-08-27 01:34:34-0400 [Broker,client] lost remote step
2020-08-27 01:36:29-0400 [Broker,client] lost remote step
2020-08-27 01:38:37-0400 [Broker,client] lost remote step
(...)

@vstinner
Copy link
Member

The buildbot server migrated to a new machine and is now behind a load balancer. tcp/80 (buildbot web page, HTTP) and tcp/9020 (used by buildbot workers) are both behind the load balancer.

Maybe the load balancer closes TCP connections which are idle for 60 seconds?

Buildbot workers have a TCP keepalive option of 1 hour (3600 seconds) by default:
https://docs.buildbot.net/latest/manual/configuration/workers.html#master-worker-tcp-keepalive

Pablo told me that his worker uses a keepalive of 2 minutes (120 seconds).

@vstinner
Copy link
Member

The buildbot server migrated to a new machine and is now behind a load balancer. tcp/80 (buildbot web page, HTTP) and tcp/9020 (used by buildbot workers) are both behind the load balancer.
(...)
Buildbot workers have a TCP keepalive option of 1 hour (3600 seconds) by default (...)

Ernest confirmed that there are edge load balancers for the PSF infra in DigitalOcean. He updated the load balancers to offer a full 24 hour timeout on buildbot TCP connections. (Yesterday around 17:30 UTC.)

It seems like it doesn't fix the issue. Example in server logs:

(...)
2020-08-28 08:21:55+0000 [Broker,50831,10.132.169.157] Worker.detached(koobs-freebsd-564d)
2020-08-28 08:23:25+0000 [Broker,50856,10.132.169.157] Worker.detached(koobs-freebsd-564d)
2020-08-28 08:24:55+0000 [Broker,50881,10.132.169.157] Worker.detached(koobs-freebsd-564d)
2020-08-28 08:26:26+0000 [Broker,50906,10.132.169.157] Worker.detached(koobs-freebsd-564d)
2020-08-28 08:27:56+0000 [Broker,50931,10.132.169.156] Worker.detached(koobs-freebsd-564d)
(...)

@vstinner vstinner changed the title RHEL and fedora buildbots fail due to disk space error Buildbot: workers detached every minute and "no space left on device" issue Aug 28, 2020
@vstinner vstinner changed the title RHEL and fedora buildbots fail due to disk space error Buildbot: workers detached every minute and "no space left on device" issue Aug 28, 2020
@vstinner
Copy link
Member

I added keepalive_interval=60 parameter to Worker() in the server configuration:
python/buildmaster-config@2d28a4c

@vstinner
Copy link
Member

There are multiple errors in the buildbot server logs. I'm not sure if it's related or not.

2020-08-28 09:16:25+0000 [-] while invoking <bound method HttpStatusPushBase.buildStarted of <buildbot.reporters.github.GitHubStatusPush object at 0x7f661dd58850>>
	Traceback (most recent call last):
	  File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
	    result = g.send(result)
	  File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/reporters/http.py", line 80, in getMoreInfoAndSend
	    yield self.send(build)
	  File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
	    return _cancellableInlineCallbacks(gen)
	  File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
	    _inlineCallbacks(None, g, status)
	--- <exception caught here> 

  File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/reporters/http.py", line 80, in getMoreInfoAndSend
    yield self.send(build)
  File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in \_inlineCallbacks
    result = g.send(result)
  File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/reporters/github.py", line 133, in send
    m = re.search(r"refs/pull/([0-9]*)/merge", branch)
  File "/usr/lib/python3.8/re.py", line 201, in search
    return \_compile(pattern, flags).search(string)
builtins.TypeError: expected string or bytes-like object

Error when stopping the server. It seems like this one is just that a client tries to reconnect whereas the server is down:

2020-08-28 09:30:43+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] BuildMaster is stopped
2020-08-28 09:30:43+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] invalid login from unknown user 'ware-win81-release'
2020-08-28 09:30:43+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] Peer will receive following PB traceback:
2020-08-28 09:30:43+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] Unhandled Error
	Traceback (most recent call last):
	  File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 460, in callback
	    self._startRunCallbacks(result)
	  File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
	    self._runCallbacks()
	  File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1475, in gotResult
	    _inlineCallbacks(r, g, status)
	--- <exception caught here> 

  File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in \_inlineCallbacks
    result = g.send(result)
  File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/pbmanager.py", line 217, in requestAvatarId
    eventually(self.master.initLock.release)
builtins.AttributeError: 'NoneType' object has no attribute 'initLock'

@vstinner
Copy link
Member

On the server side, it seems like the "edelsohn-rhel8-z" worker is detached because its TCP connection is closed, only 87 seconds after the worker was attached. I added some debug traces:

2020-08-28 09:44:02+0000 [Broker,2,10.132.169.156] worker 'edelsohn-rhel8-z' attaching from IPv4Address(type='TCP', host='10.132.169.156', port=56234)
2020-08-28 09:44:02+0000 [Broker,2,10.132.169.156] Got workerinfo from 'edelsohn-rhel8-z'
(...)
2020-08-28 09:45:29+0000 [Broker,2,10.132.169.156] @ detached: worker=<Worker 'edelsohn-rhel8-z'>
2020-08-28 09:45:29+0000 [Broker,2,10.132.169.156] @ detached: TB: File "<string>", line 1, in <module>
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/scripts/twistd.py", line 31, in run
app.run(runApp, ServerOptions)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/application/app.py", line 674, in run
runApp(config)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/scripts/twistd.py", line 25, in runApp
runner.run()
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/application/app.py", line 385, in run
self.postApplication()
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/scripts/_twistd_unix.py", line 268, in postApplication
self.startReactor(None, self.oldstdout, self.oldstderr)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/application/app.py", line 398, in startReactor
runReactorWithLogging(
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/application/app.py", line 312, in runReactorWithLogging
reactor.run()
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/base.py", line 1283, in run
self.mainLoop()
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/base.py", line 1295, in mainLoop
self.doIteration(t)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/epollreactor.py", line 235, in doPoll
log.callWithLogger(selectable, _drdw, selectable, fd, event)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/python/log.py", line 103, in callWithLogger
return callWithContext({"system": lp}, func, *args, **kw)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/python/log.py", line 86, in callWithContext
return context.call({ILogContext: newCtx}, func, *args, **kw)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext
return func(*args,**kw)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/posixbase.py", line 627, in _doReadOrWrite
self._disconnectSelectable(selectable, why, inRead)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/posixbase.py", line 252, in _disconnectSelectable
selectable.readConnectionLost(f)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/tcp.py", line 307, in readConnectionLost
self.connectionLost(reason)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/tcp.py", line 327, in connectionLost
protocol.connectionLost(reason)
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/spread/pb.py", line 717, in connectionLost
notifier()
File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/spread/pb.py", line 1572, in maybeLogout
fn()
File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/pbmanager.py", line 190, in <lambda>
return (pb.IPerspective, persp, lambda: persp.detached(mind))
File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/worker/protocols/pb.py", line 155, in detached
import traceback, io; out = io.StringIO(); traceback.print_stack(file=out); tb = out.getvalue()

It doesn't say if the client closed the connection on purpose, or if the load balancer closed an inactive connection.

@DavidEdelsohn
Copy link
Mannequin

DavidEdelsohn mannequin commented Aug 28, 2020

I can provide some information from the logs of one of the buildbots, or change a parameter. Let me know.

@vstinner
Copy link
Member

vstinner commented Sep 2, 2020

I can provide some information from the logs of one of the buildbots, or change a parameter. Let me know.

David: can you please change the buildbot client configuration to use "buildbot-api.python.org" host name? This host name doesn't go through the PSF load balancer. It seems like TCP connection issues are coming from the load balancer. Also, please reduce the keepalive to 60 seconds (keepalive_interval).

@vstinner
Copy link
Member

vstinner commented Sep 2, 2020

Charris, Pablo and me identified that TCP connections are closed by the load balancer on some buildbot workers.

When the "buildbot.python.org" host name is used, TCP connections (tcp port 9020) go through a load balancer.

Ernest exposed the TCP port 9020 directly to the Internet (without the load balancer) using a new host name: "buildbot-api.python.org".

Buildbot workers should be updated to use "buildbot-api.python.org". I also suggest to use a keepalive of 60 seconds, rather than 600 seconds.

If your worker got impacted the this issue, I strongly advice you to clean up manually the temporary directory (/tmp). When a worker was disconnected, the build was interrupted without removing temporary files. On some workers, we got around 20 GB of temporary files in /tmp: "ccXXXX" files and "tmpXXXX" files. I guess that some files are coming from the compiler, some other from the Python test suite.

I updated the buildbot client configuration of the 9 workers operated by Red Hat:

Fedora Rawhide x64-86
Fedora Stable x64-86
RHEL8 x64-86
RHEL7 x64-86
RHEL8 FIPS x86-64
Fedora Rawhide AArch64
Fedora Stable AArch64
RHEL 8 ppc64le
RHEL 7 ppc64le

On our owners, I used the following commands:

systemctl stop buildbot-worker.service
du -sh /tmp; rm -f /tmp/{cc,tmp}*; du -sh /tmp
sed -i -e "s/buildmaster_host = 'buildbot.python.org'/buildmaster_host = 'buildbot-api.python.org'/;s/keepalive = .*/keepalive = 60/" /home/buildbot/buildarea/buildbot.tac; grep -E '(host|keepalive) =' /home/buildbot/buildarea/buildbot.tac
systemctl start buildbot-worker.service
systemctl status buildbot-worker.service

@vstinner
Copy link
Member

vstinner commented Sep 2, 2020

Ah, I also updated:

Fedora Stable ppc64le
Fedora Rawhide ppc64le

@DavidEdelsohn
Copy link
Mannequin

DavidEdelsohn mannequin commented Sep 2, 2020

I have updated

edelsohn-aix-ppc64
edelsohn-debian-z
edelsohn-fedora-ppc64
edelsohn-fedora-rawhide-z
edelsohn-fedora-z
edelsohn-rhel-z
edelsohn-rhel8-z
edelsohn-sles-z
aixtools-aix-power6

@gpshead
Copy link
Member

gpshead commented Sep 2, 2020

The gps-* bots have been updated.

@pablogsal
Copy link
Member

All the pablogsal-* buildbots have been updated

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.10 only security fixes tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants