v243 breaks libvirt lxc guest support #13629

cpaelzer · 2019-09-23T13:18:09Z

With recent systemd 243 as in Ubuntus systemd packages

Note: This was reported to Ubuntu in bug 1844879

While formerly a LXC guest of the following style worked:

$ export LIBVIRT_DEFAULT_URI='lxc:///'
$ virsh define smoke-lxc.xml
$ virsh start sl
$ virsh list
 Id Name State
------------------------
 2280 sl running
# At this point we know and have confirmed the guest works
# Now restarting libvirtd breaks it.
$ systemctl restart libvirtd
# Now the guest container is gone

Attached guest definition smoke-lxc.xml
from Debian/Ubuntu testcase.

Up until recently this worked and through libvirtd restart guests survived.
I pinged upstream libvirt (possible that LXC guest cgroup management would need to be fixed there or in lxc itself) but it seems this behavior wasn't reported there yet either.

Note: restaring libvirt-lxc guests always triggered

Sep 23 12:51:46 autopkgtest systemd[1]: libvirtd.service: Found left-over process 2428 (bash) in control group while starting unit. Ignoring.
Sep 23 12:51:46 autopkgtest systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Sep 23 12:51:46 autopkgtest systemd[1]: libvirtd.service: Found left-over process 2426 (libvirt_lxc) in control group while starting unit. Ignoring.
Sep 23 12:51:46 autopkgtest systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

And so far the processes stayed around and libvirt still had a container to manage after restart.
But since the systemd fix for issue #12386 came in via commit 0219b35 this broke libvirtd-lxc.

The text was updated successfully, but these errors were encountered:

berrange · 2019-09-23T13:37:17Z

The containers that libvirt starts get put into their own control group / systemd scope under /machines.slice. The exception though is that libvirt_lxc controller process which is still in the main libvirtd.service control group. I guess the latter is what's being killed off & causing the containers to go away from libvirt's POV. The libvirtd.service unit file explicitly uses KillMode=process so that these supplementary processes still remain after libvirtd is stopped. The same is true of the dnsmasq processes libvirtd starts btw - are you seeing those killed off too by chance ?

cpaelzer · 2019-09-23T13:40:16Z

FTR this is libvirt 5.4 due to the Ubuntu Feature freezes. But I have not seen any related change in libvirt not did @berrange know about any that would be worth to try.

cpaelzer · 2019-09-23T14:20:04Z

@berrange the PIDs of the dnsmasq processes associated to libvirt do not change with systemd 243 installed. So the issue seems to be more specific to libvirt-lxc in this case.

cpaelzer · 2019-09-23T14:36:49Z

I made systemd logging verbose which is a lot, but I was able to gather some logs that might be useful.

I did systemctl restart lbivirtd with different configurations:
a) No guest started, just dnsmasq process: restart-libvirtd-with-systemd-debug-onlydnsmasq.txt
b) One libvirt-lxc guest started that is lost restart-libvirtd-with-systemd-debug-lxcguest.txt

Comparing (a) and (b) I see the message related to commit 0219b35 showing up in line 425 / 421.

  systemd[1]: libvirtd.service: Failed to destroy cgroup /system.slice/libvirtd.service, ignoring: Device or resource busy
  systemd[1]: libvirtd.service: Job 701 libvirtd.service/restart finished, result=done                                                                                                        
  systemd[1]: Stopped Virtualization daemon.                                          
  systemd[1]: libvirtd.service: Converting job libvirtd.service/restart -> libvirtd.service/start
  systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1/unit/libvirtd_2eservice interface=org.freedesktop.DBus.Properties member
  systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1/unit/libvirtd_2eservice interface=org.freedesktop.DBus.Properties member
  systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/libvirtd_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged co
  systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/libvirtd_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged co
  systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1/job/701 interface=org.freedesktop.DBus.Properties member=PropertiesChang
  systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/job/701 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=298 reply_c
  systemd[1]: libvirtd.service: Found left-over process 971 (dnsmasq) in control group while starting unit. Ignoring.
  systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
  systemd[1]: libvirtd.service: Found left-over process 972 (dnsmasq) in control group while starting unit. Ignoring.
  systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
  systemd[1]: libvirtd.service: Found left-over process 1169 (bash) in control group while starting unit. Ignoring.                                                                           
  systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.                                                                           
  systemd[1]: libvirtd.service: Found left-over process 1166 (libvirt_lxc) in control group while starting unit. Ignoring.                                                                    
  systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

This is the "removal" of the processes. With the old code the same message was triggered but as we seen in the commit it was an early exit stopping the cleanup at this point.

Interesting is that we see dnsmasq reported the same way as the libvirt-lxc guest but the former "survives" while the latter is gone afterwards.

At the tail of the bad-case log we see the formerly reported libvirt error and the related systemd cleanup now:

  libvirtd[1186]: internal error: No valid cgroup for machine sl                                                                                                                              
  libvirtd[1186]: End of file while reading data: Input/output error                                                                                                                          
  systemd[1]: Received SIGCHLD from PID 1166 (libvirt_lxc).                                                                                                                                   
  systemd[1]: Child 1166 (libvirt_lxc) died (code=killed, status=15/TERM)                                                                                                                     
  systemd[1]: libvirtd.service: Failed to read oom_kill field of memory.events cgroup attribute: No such file or directory                                                                    
  systemd[1]: libvirtd.service: Child 1166 belongs to libvirtd.service.                                                                                                                       
  systemd[1]: Received SIGCHLD from PID 1169 (bash).                                                                                                                                          
  systemd[1]: Child 1169 (bash) died (code=exited, status=129/n/a)                                                                                                                            
  systemd[1]: libvirtd.service: Failed to read oom_kill field of memory.events cgroup attribute: No such file or directory                                                                    
  systemd[1]: libvirtd.service: Child 1169 belongs to libvirtd.service.

cpaelzer · 2019-09-23T14:50:55Z

I was trying to check for cgroup/hierarchy differences between dnsmasq and libvirt-lxc guests.
In the example 971/972 are dnsmasq and 1268/1280 are libvirt-lxc related.

The same service scope detects them:

...
  ├─libvirtd.service
  │ ├─ 971 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_leaseshelper
  │ ├─ 972 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_leaseshelper
  │ ├─1186 /usr/sbin/libvirtd
  │ ├─1268 /usr/lib/libvirt/libvirt_lxc --name sl --console 20 --security=apparmor --handshake 23
  │ └─1280 /bin/bash

And:

$ for i in 971 972 1268 1280; do systemctl status $i | grep CGroup ; done
   CGroup: /system.slice/libvirtd.service
   CGroup: /system.slice/libvirtd.service
   CGroup: /system.slice/libvirtd.service
   CGroup: /system.slice/libvirtd.service

I agree that there is "some" machine slice grouping that lxc guest has on top of dnsmasq:

$ cat /sys/fs/cgroup/cpu,cpuacct/machine/lxc-1268-sl.libvirt-lxc/tasks 
1268
1280
1281

There is no counterpart to that for the dnsmasq processes to this.

The grouping that the lxc guests have seems to cover the following:

$ find /sys/fs/cgroup/ -name '*1268*' 
/sys/fs/cgroup/memory/machine/lxc-1268-sl.libvirt-lxc
/sys/fs/cgroup/blkio/machine/lxc-1268-sl.libvirt-lxc
/sys/fs/cgroup/perf_event/machine/lxc-1268-sl.libvirt-lxc
/sys/fs/cgroup/freezer/machine/lxc-1268-sl.libvirt-lxc
/sys/fs/cgroup/cpuset/machine/lxc-1268-sl.libvirt-lxc
/sys/fs/cgroup/devices/machine/lxc-1268-sl.libvirt-lxc
/sys/fs/cgroup/net_cls,net_prio/machine/lxc-1268-sl.libvirt-lxc
/sys/fs/cgroup/cpu,cpuacct/machine/lxc-1268-sl.libvirt-lxc

So maybe their grouping in these makes them actually susceptible for the cleanup they are hit by with the new systemd version?

poettering · 2019-09-24T20:05:34Z

How does systemd-cgls output look like when run on the host when the offending container is running?

cpaelzer · 2019-09-25T06:38:06Z

Hi @poettering , the first snippet of above's comment already was from systemd-cgls.
Let me attach a full output as file => systemd-cgls.txt

As mentioned above they are grouped with with the libvirtd.service under the system.slice. Not as expected by @berrange under their own machine.slice. The only machine related (c)grouping I found was what I reported above for lxc-1268-sl.libvirt-lxc.

cpaelzer · 2019-09-25T08:36:53Z

Since I know that @berrange is usually right :-) I was wondering why the guest isn't in that machine slice he mentioned. So I was revisiting systemd-cgls and might have found something.
I have gathered data on the same setup for:

Ubuntu 19.10 with systemd 240 libvirt 5.4 => systemd-cgls.ubu240.txt
Ubuntu 19.10 with systemd 243 and libvirt 5.4 => systemd-cgls.ubu243.txt
Ubuntu 19.04 with systemd 243 and libvirt 5.0 => systemd-cgls.ubu240-disco.txt
Fedora 30 with systemd 241 and libvirt 5.1 => systemd-cgls.f30.txt

Unfortunately I have no systemd 243 for Fedora at the moment to try it there and F31 didn't want to work for me this morning.

Chances are that we have two issues at once here:

systemd 243 now reaps processes in certain scenarios it didn't before
Ubuntu's libvirt-lxc does leave containers in exactly the now affected scenario

berrange · 2019-09-25T09:05:54Z

@cpaelzer it occurred to me that libvirt's support for cgroups has significantly changed over the last 6 months as we've integrated support for cgroups v2. The existing v1 support should not have changed, but there's a non-negligible risk that something was broken by mistake that I don't know about.

Also can you confirm that you actually have systemd-machined installed as that can alter the way libvirt deals with cgroups & the lack of machine.slice in your setup makes me worry it might be not installed.

cpaelzer · 2019-09-25T09:42:34Z

@berrange that cgroupv2 code was exactly why I also added libvirt 5.0 results above to check if behavior was different back then.

In the meantime I also checked latest Debian:

buster libvirt 5.0 / systemd 241
sid+experimental libvirt 5.6 / systemd 243

And Debian are affected just like Ubuntu is - both versions lack the machine.slice and systemd 243 breaks the guest on libvirtd restart.

cpaelzer · 2019-09-25T09:43:03Z

@berrange For your question about machined I did not find it in the default install of Debian/Ubuntu but it was installed and active on Fedora.
systemd-machined.service and siblings are part of package systemd-container.

Installing it on the affected systems, reboot and recreating the case showed that your guess was right. With systemd-container installed the grouping would be with the machine.slice as expected.

@berrange what is libvirt upstreams expectation/recommendation on running libvirt-lxc with/without systemd-machined. Should Debian/Ubuntu just make the package that has libvirt_driver_lxc.so or once we have the split daemons the package with the lxc related daemon recommend or even depend on systemd-container?

berrange · 2019-09-25T09:59:18Z

I don't think we've made a clear statement upstream on this matter. libvirt has cgroups code written so that it will try to work correctly on non-systemd based OS distros. So when you have a systemd host without machined present, we'll be falling back to that non-systemd cgroups code. I don't think it makes much sense to support this as an option though - it just increases the size of the test matrix & introduces new failure modes as you hit here.

So on balance, I think my suggestion is that the QEMU & LXC drivers should both depend on systemd-container. This is what I did for the Fedora / RHEL RPM spec with

 commit ffc49e579c14b1d3f24af8d004ded6e3a0e8900f
 Author: Daniel P. Berrange <berrange@redhat.com>
 Date:   Tue Jul 12 15:57:39 2016 +0100
 
     libvirt.spec.in: require systemd-container on >= f24
     
     The systemd-machined tools libvirt uses were split into a
     systemd-container RPM. Without depending on this, libvirt
     may silently fallback to the non-systemd cgroup impl which
     is not desirable.
     
     Signed-off-by: Daniel P. Berrange <berrange@redhat.com>

It might be a good idea if we change libvirt to at least issue a warning if it sees a systemd host without machined

cpaelzer · 2019-09-25T10:53:45Z

Ok, I now tested this on Ubuntu with systemd 243 and systemd-container installed.

I can confirm that then the slicing is correct

│ ├─libvirtd.service
│ │ ├─751 /usr/sbin/libvirtd
│ │ ├─976 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro -
│ │ ├─977 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro -
│ │ └─993 /usr/lib/libvirt/libvirt_lxc --name sl --console 20 --security=apparmor --handshake 
...
└─machine.slice
  └─machine-lxc\x2d993\x2dsl.scope
    └─994 /bin/bash

But, and that is the non-fun part. The libvirt-lxc container still gets reaped on restart of the libvirtd.service.

I can make sure that the dependency to systemd-container gets added to Debian / Ubuntu which will fix the unexpected grouping. But unfortunately that does not "fix" the issue reported here that the new systemd will kill the guests on restart.

I'll re-summarize logs of this latest setup (with systemd-container in the next post, so that we can find what might be missing / interesting for this issue here.

cpaelzer · 2019-09-25T10:59:39Z

With the need for systemd-containerd identified above I did a new test.
The following logs of systemd-cgls and journalctl (with debug verbosity of systemd) match each other (e.g. pids in one match the other).

The initial question remains, since commit 0219b35 either systemd needs to stop reaping these processes or libvirt needs to manage the processes representing the guest differently to not be victim of it.

berrange · 2019-09-25T11:06:15Z

AFAIK, libvirt was already requesting that systemd should not reap these processes. In the libvirtd.service we set KillMode=process which is documented as

If set to process, only the main process itself is killed
...
Processes remaining alive after stop are left in their control group and the control group continues to exist after stop unless it is empty

cpaelzer · 2019-10-08T08:41:24Z

Any suggestion how we should go on with this?
Ubuntu - for now - has not pulled in v243 so at least "I" have a bit of time now.
But this issue will come back as soon as the code is in.

As @berrange outlined the expectation from libvirts POV is that nothing should be reaped, but it is. Is there guidance by systemd how libvirt should further mark/setup the processes/groups/slices to not fall victim to the new code?

poettering · 2019-10-08T12:09:13Z

So what is this bug about now? It seems initially this bug report was caused by libvirt not following the documented logic for acquiring a delegate cgroup tree. But now it turned into something else about killing processes?
Are those processes you dont want killed being killed on start or on stop of the service in question? The general understanding has always been that KillMode controls what to do on stop and that it allows you to leave lingering processes on stop. However not that those would also be left running during start because that would mean services wouldnt start in a pristine execution environment anymore...

poettering · 2019-10-09T16:48:42Z

(btw, the delegation concept is documented here: https://systemd.io/CGROUP_DELEGATION.html)

cpaelzer · 2019-10-10T07:48:18Z

So what is this bug about now? It seems initially this bug report was caused by libvirt not following the documented logic for acquiring a delegate cgroup tree. But now it turned into something else about killing processes?

Sorry if that got lost in the former discussion. The bug always was (and still is) about libvirtd service restart unexpectedly reaping processes since 0219b35.
The former discussion identified different behavior of libvirt if systemd-containerd is installed or not. Going forward we should only consider the behavior if that is the case (that is the deviation from the initial report).

With systemd-containerd installed the problem looks as @berrange expected, see this comment. And the following one for systemd-cgls output and debug enabled journal.

Are those processes you dont want killed being killed on start or on stop of the service in question?

No, the service start/stop directly only starts libvirtd daemon itself. Later on, a user might take actions tos start a guest which will let libvirt spawn:

add a control process which will be part of the libvirtd.service scope (in the systemd-cgls output above that is libvirt_lxc)
spawn an extra machine slice which contains the process running inside the container context (in the output above that is machine-lxc\x2d1524\x2dsl.scope and the actual /bin/bash process in it.

I think that follows point #4 in the linked design doc. I'm not sure without checkign code and/or the live system which design it follows in regard to the three scenarios but maybe @berrange can advise here.

The general understanding has always been that KillMode controls what to do on stop and that it allows you to leave lingering processes on stop. However not that those would also be left running during start because that would mean services wouldn't start in a pristine execution environment anymore...

That is interesting, as restart obviously is stop+start we might run into this then. The expectation (I'd think) would be that libvirtd being the process actually started directly by the service would be stopped and started and the rest in the service would be left alive. That seems to work, we see e.g. the dnsmasq processes stay. But the elements in the machine slice also get a SIGCHLD when the service is restarted.

In the already attached logs you see two phases, initially it detects leftover processes that were part of the .service scope:

Sep 25 10:52:23 eoan systemd[1]: libvirtd.service: Found left-over process 976 (dnsmasq) in control group while starting unit. Ignoring.
Sep 25 10:52:23 eoan systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Sep 25 10:52:23 eoan systemd[1]: libvirtd.service: Found left-over process 977 (dnsmasq) in control group while starting unit. Ignoring.
Sep 25 10:52:23 eoan systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Sep 25 10:52:23 eoan systemd[1]: libvirtd.service: Found left-over process 1524 (libvirt_lxc) in control group while starting unit. Ignoring.
Sep 25 10:52:23 eoan systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

The control process 1524 is part of it and stays alive at this point (correect).

But later on we see the child process int he machine slice being removed (that is the new behavior since 0219b35).

Sep 25 10:52:24 eoan systemd[1]: Received SIGCHLD from PID 1526 (bash).
Sep 25 10:52:24 eoan systemd[1]: Child 1526 (bash) died (code=exited, status=0/SUCCESS)
Sep 25 10:52:24 eoan systemd[1]: machine.slice: Failed to read oom_kill field of memory.events cgroup attribute: No such file or directory
Sep 25 10:52:24 eoan systemd[1]: machine.slice: Child 1526 belongs to machine.slice.
Sep 25 10:52:24 eoan systemd[1]: Received SIGCHLD.
Sep 25 10:52:24 eoan systemd[1]: Child 1524 (libvirt_lxc) died (code=killed, status=15/TERM)

I'm not sure if it is related, but in between those two sections I see:

Sep 25 10:52:24 eoan systemd[1]: machine-lxc\x2d1524\x2dsl.scope: cgroup is empty
Sep 25 10:52:24 eoan systemd[1]: machine-lxc\x2d1524\x2dsl.scope: Succeeded.
Sep 25 10:52:24 eoan systemd[1]: machine-lxc\x2d1524\x2dsl.scope changed running -> dead
Sep 25 10:52:24 eoan systemd[1]: Failed to trim compat systemd cgroup /machine.slice/machine-lxc\x2d1524\x2dsl.scope: Device or resource busy

This is the scope that contained the /bin/bash process and I'm not sure why it is considered empty. I saw nothing removing the process from it. With the new code systemd v243 now ignores that the resource is busy and continues cleanup, that seems to me what eventually really kills PID 1526 (bash)

A user of lbvirt-lxc would not want that to happen, so the question is - what would libvirt want/need to do differently to have this process in the machine slice not killed on restart of the service?

poettering added cgroups pid1 labels Nov 4, 2019

poettering added lxc/lxd and removed lxc/lxd labels Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v243 breaks libvirt lxc guest support #13629

v243 breaks libvirt lxc guest support #13629

cpaelzer commented Sep 23, 2019 •

edited

berrange commented Sep 23, 2019

cpaelzer commented Sep 23, 2019 •

edited

cpaelzer commented Sep 23, 2019

cpaelzer commented Sep 23, 2019

cpaelzer commented Sep 23, 2019

poettering commented Sep 24, 2019

cpaelzer commented Sep 25, 2019

cpaelzer commented Sep 25, 2019

berrange commented Sep 25, 2019

cpaelzer commented Sep 25, 2019

cpaelzer commented Sep 25, 2019

berrange commented Sep 25, 2019

cpaelzer commented Sep 25, 2019

cpaelzer commented Sep 25, 2019

berrange commented Sep 25, 2019

cpaelzer commented Oct 8, 2019

poettering commented Oct 8, 2019

poettering commented Oct 9, 2019

cpaelzer commented Oct 10, 2019 •

edited

v243 breaks libvirt lxc guest support #13629

v243 breaks libvirt lxc guest support #13629

Comments

cpaelzer commented Sep 23, 2019 • edited

berrange commented Sep 23, 2019

cpaelzer commented Sep 23, 2019 • edited

cpaelzer commented Sep 23, 2019

cpaelzer commented Sep 23, 2019

cpaelzer commented Sep 23, 2019

poettering commented Sep 24, 2019

cpaelzer commented Sep 25, 2019

cpaelzer commented Sep 25, 2019

berrange commented Sep 25, 2019

cpaelzer commented Sep 25, 2019

cpaelzer commented Sep 25, 2019

berrange commented Sep 25, 2019

cpaelzer commented Sep 25, 2019

cpaelzer commented Sep 25, 2019

berrange commented Sep 25, 2019

cpaelzer commented Oct 8, 2019

poettering commented Oct 8, 2019

poettering commented Oct 9, 2019

cpaelzer commented Oct 10, 2019 • edited

cpaelzer commented Sep 23, 2019 •

edited

cpaelzer commented Sep 23, 2019 •

edited

cpaelzer commented Oct 10, 2019 •

edited