Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

networkd-test.py test_transient_hostname is flaky #4753

Closed
martinpitt opened this issue Nov 27, 2016 · 5 comments
Closed

networkd-test.py test_transient_hostname is flaky #4753

martinpitt opened this issue Nov 27, 2016 · 5 comments

Comments

@martinpitt
Copy link
Contributor

Sometimes the recently introduced test for setting the transient host name fails:

======================================================================
FAIL: test_transient_hostname (__main__.DnsmasqClientTest)
networkd sets transient hostname from DHCP
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/adttmp/autopkgtest-virt-lxc.shared.xmf9l35n/downtmp/build.leB/systemd-debian/test/networkd-test.py", line 442, in test_transient_hostname
    self.assertIn(b'testgreen', subprocess.check_output(['hostnamectl']))
AssertionError: b'testgreen' not found in b'   Static hostname: n/a\nTransient hostname: autopkgtest-lxc-ppfmgn\n         Icon name: computer-container\n           Chassis: container\n        Machine ID: 3b5423839f2c4e83924a044a361eb013\n           Boot ID: 1743ea26733040018c15774418e0d2c5\n    Virtualization: lxc\n  Operating System: Ubuntu 16.04.1 LTS\n            Kernel: Linux 4.4.0-38-generic\n      Architecture: s390x\n'

full log, also seen on other architectures.

In the above case the transient host name did get set, but is wrong (it's the real host name).

It requires further information to debug this as this is hard to reproduce. It could be a bug in the test or some race condition between networkd and hostnamed. As a starter I'll make the test show journal output on failure.

@martinpitt martinpitt self-assigned this Nov 27, 2016
martinpitt added a commit to martinpitt/systemd that referenced this issue Nov 27, 2016
This test fails sometimes but it is hard to reproduce, so we need more
information what happens. Set journal log level to "debug" for the entirety of
the test script, and show networkd's and hostnamed's journals on failure of
test_transient_hostname().

This should help with tracking down issue systemd#4753.
martinpitt added a commit to martinpitt/systemd that referenced this issue Nov 27, 2016
This test fails sometimes but it is hard to reproduce, so we need more
information what happens. Set journal log level to "debug" for the entirety of
networkd-test.py, and show networkd's and hostnamed's journals on failure of
the two test_transient_hostname* tests.

This should help with tracking down issue systemd#4753.
martinpitt added a commit to martinpitt/systemd that referenced this issue Nov 28, 2016
This test fails sometimes but it is hard to reproduce, so we need more
information what happens. Set journal log level to "debug" for the entirety of
networkd-test.py, and show networkd's and hostnamed's journals and the DHCP
server log on failure of the two test_transient_hostname* tests. Also sync the
journal before querying it to get more precise output.

This should help with tracking down issue systemd#4753.
martinpitt added a commit to martinpitt/systemd that referenced this issue Nov 28, 2016
This test fails sometimes but it is hard to reproduce, so we need more
information what happens. Set journal log level to "debug" for the entirety of
networkd-test.py, and show networkd's and hostnamed's journals and the DHCP
server log on failure of the two test_transient_hostname* tests. Also sync the
journal before querying it to get more precise output.

This should help with tracking down issue systemd#4753.
evverx pushed a commit that referenced this issue Nov 28, 2016
This test fails sometimes but it is hard to reproduce, so we need more
information what happens. Set journal log level to "debug" for the entirety of
networkd-test.py, and show networkd's and hostnamed's journals and the DHCP
server log on failure of the two test_transient_hostname* tests. Also sync the
journal before querying it to get more precise output.

This should help with tracking down issue #4753.
@martinpitt
Copy link
Contributor Author

PR #4754 landed now, so the next time we see this failure we should get some clue what goes wrong.

@evverx
Copy link
Member

evverx commented Nov 29, 2016

@martinpitt ,
https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-xenial-pitti-systemd-semaphore/xenial/s390x/s/systemd-upstream/20161129_202059_5d48c@/log.gz

test_transient_hostname (__main__.DnsmasqClientTest)
networkd sets transient hostname from DHCP ... ---- systemd-networkd.service ----
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: Trying to enqueue job systemd-networkd.service/start/replace
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: Installed new job systemd-networkd.service/start as 1208
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: Enqueued job systemd-networkd.service/start as 1208
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: ConditionCapability=CAP_NET_ADMIN succeeded.
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: Failed to reset devices.list: Operation not permitted
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: Passing 0 fds to service
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: About to execute: /lib/systemd/systemd-networkd
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: Forked /lib/systemd/systemd-networkd as 6032
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: Changed dead -> start
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: Starting Network Service...
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[6032]: systemd-networkd.service: Executing: /lib/systemd/systemd-networkd
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: Got notification message from PID 6032 (WATCHDOG=1)
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd-networkd[6032]: test_eth42: IPv6 enabled for interface: Success
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd-networkd[6032]: eth0: Gained IPv6LL
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd-networkd[6032]: Enumeration completed
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: Got notification message from PID 6032 (READY=1, STATUS=Processing requests...)
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: Changed start -> running
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-networkd.service: Job systemd-networkd.service/start finished, result=done
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: Started Network Service.
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd-networkd[6032]: test_eth42: Gained carrier
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd-networkd[6032]: router_eth42: Gained carrier
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd-networkd[6032]: test_eth42: DHCPv4 address 192.168.5.210/24 via 192.168.5.1
Nov 29 20:18:26 autopkgtest-lxc-hbhvat systemd-networkd[6032]: router_eth42: Gained IPv6LL
Nov 29 20:18:26 autopkgtest-lxc-hbhvat systemd-networkd[6032]: test_eth42: Gained IPv6LL
Nov 29 20:18:26 autopkgtest-lxc-hbhvat systemd-networkd[6032]: test_eth42: Configured
---- systemd-hostnamed.service ----
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Trying to enqueue job systemd-hostnamed.service/stop/replace
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Installed new job systemd-hostnamed.service/stop as 1163
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Enqueued job systemd-hostnamed.service/stop as 1163
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Changed running -> stop-sigterm
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: Stopping Hostname Service...
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Child 5776 belongs to systemd-hostnamed.service
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Main process exited, code=killed, status=15/TERM
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Changed stop-sigterm -> dead
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Job systemd-hostnamed.service/stop finished, result=done
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: Stopped Hostname Service.
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: cgroup is empty
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Collecting.
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Collecting.
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Collecting.
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Trying to enqueue job systemd-hostnamed.service/start/replace
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Installed new job systemd-hostnamed.service/start as 1213
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Enqueued job systemd-hostnamed.service/start as 1213
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Passing 0 fds to service
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: About to execute: /lib/systemd/systemd-hostnamed
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Forked /lib/systemd/systemd-hostnamed as 6035
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Changed dead -> start
Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: Starting Hostname Service...
Nov 29 20:18:26 autopkgtest-lxc-hbhvat systemd[6035]: systemd-hostnamed.service: Executing: /lib/systemd/systemd-hostnamed
Nov 29 20:18:26 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Got notification message from PID 6035 (WATCHDOG=1)
Nov 29 20:18:26 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: D-Bus name org.freedesktop.hostname1 changed owner from  to :1.85
Nov 29 20:18:26 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Changed start -> running
Nov 29 20:18:26 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Job systemd-hostnamed.service/start finished, result=done
Nov 29 20:18:26 autopkgtest-lxc-hbhvat systemd[1]: Started Hostname Service.
Nov 29 20:18:26 testgreen systemd-hostnamed[6035]: Changed host name to 'testgreen'


---- dnsmasq log ----
Nov 29 20:18:25 dnsmasq[6029]: started, version 2.75 cachesize 150
Nov 29 20:18:25 dnsmasq[6029]: compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth DNSSEC loop-detect inotify
Nov 29 20:18:25 dnsmasq-dhcp[6029]: DHCP, IP range 192.168.5.10 -- 192.168.5.200, lease time 1h
Nov 29 20:18:25 dnsmasq-dhcp[6029]: DHCP, sockets bound exclusively to interface router_eth42
Nov 29 20:18:25 dnsmasq[6029]: reading /etc/resolv.conf
Nov 29 20:18:25 dnsmasq[6029]: using nameserver 10.0.3.1#53
Nov 29 20:18:25 dnsmasq[6029]: using nameserver 127.0.0.53#53
Nov 29 20:18:25 dnsmasq[6029]: read /etc/hosts - 7 addresses
Nov 29 20:18:25 dnsmasq-dhcp[6029]: DHCPDISCOVER(router_eth42) de:ad:be:ef:47:11 
Nov 29 20:18:25 dnsmasq-dhcp[6029]: DHCPOFFER(router_eth42) 192.168.5.210 de:ad:be:ef:47:11 
Nov 29 20:18:25 dnsmasq-dhcp[6029]: DHCPREQUEST(router_eth42) 192.168.5.210 de:ad:be:ef:47:11 
Nov 29 20:18:25 dnsmasq-dhcp[6029]: DHCPACK(router_eth42) 192.168.5.210 de:ad:be:ef:47:11 testgreen

------

FAIL

@martinpitt
Copy link
Contributor Author

yay, hit from PR #4694, the s390x one from @evverx above and the amd64 one too.

Unfortunately the networkd journal is definitively cut short in both logs, and the amd64 one even has a truncated hostnamed log. But at least we know that in both cases dnsmasq actually handed out the expected IP and host name, and in the s390x log hostnamed confirmed the Changed host name to 'testgreen'. Apparently there is some async propagation delay involved here? The weird thing is that in both cases hostnamectl does show a transient hostname (normally it doesn't), but it's the wrong (original) one.

So as the next step I could add some timeout loop (say, 5 seconds?) which waits for hostnamectl to show the expected transient hostname. We would then at least see if this is just a race condition (we look too early) or a permanently wrong one. What I'm not sure about is whether the retry loop is actually justified, or would just hide some bug.

martinpitt added a commit to martinpitt/systemd that referenced this issue Nov 29, 2016
Sometimes setting the transient hostname does not happen synchronously, so
retry up to five times. It is not yet clear whether this is legitimate
behaviour or an underlying bug, but this will at least show whether the wrong
transient hostname is just a race condition or permanently wrong.

See issue systemd#4753
@evverx
Copy link
Member

evverx commented Nov 29, 2016

@martinpitt I think this

Nov 29 20:18:25 autopkgtest-lxc-hbhvat systemd[1]: systemd-hostnamed.service: Trying to enqueue job systemd-hostnamed.service/start/replace

means that the systemd-hostnamed was activated by hostnamectl at 20:18:25
And this

Nov 29 20:18:26 testgreen systemd-hostnamed[6035]: Changed host name to 'testgreen'

means that the hostname was changed at 20:18:26

So, yes

So as the next step I could add some timeout loop (say, 5 seconds?) which waits for hostnamectl to show the expected transient hostname

makes sense

@orodbhen
Copy link

Do you know if this fix has been backported to RHEL/CentOS 7?

I'm using networkd on a set of tftpboot systems, and it occasionally fails to set the hostname from DHCP. It get's the IP address, and establishes an active LAN connection, but the hostname remains localhost.

Most of the time, everything works, but not always.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants