Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make even more nspawn concepts configurable #9024

Merged
merged 7 commits into from
May 24, 2018

Conversation

poettering
Copy link
Member

This is based on #8940 and includes it. Please review/merge that one first!

@poettering
Copy link
Member Author

I have now force pushed a new, rebased version, that is hence much much shorter, given that #8940 has been merged in the meantime. PTAL!

connectible its static <filename>resolv.conf</filename> file is used, and if not the host's
<filename>/etc/resolv.conf</filename> file is used. In the latter cases the file is copied if the image is
writable, and bind mounted otherwise. Note that if the file is copied the container payload can modify the file
during its runtime. If it is bind mounted that's not possible. Note that both if the file is bind mounted and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may unmount the file, and modify it. So effectively in both cases it can be modified. So maybe it's better to skip those two sentences altogether.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it only can do that if the container has the perms to. Let me rephrase this though.

I really want to get the message across that if people want the container to be able to deviate from the host DNS defaults they should use "copy", and otherwise "bind", as in the letter case the thing is read-only mounted, and hence very likely most scripts trying to modify it will fails, except if they go out of their way and unmount it first.

[RESOLV_CONF_BIND_STATIC] = "bind-static",
[RESOLV_CONF_DELETE] = "delete",
[RESOLV_CONF_OFF] = "off",
[RESOLV_CONF_AUTO] = "auto",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this in the definition order.

@@ -286,6 +287,7 @@ static void help(void) {
" --link-journal=MODE Link up guest journal, one of no, auto, guest, \n"
" host, try-guest, try-host\n"
" -j Equivalent to --link-journal=try-guest\n"
" --resolv-conf=MODE Select mode of /etc/resolv.conf synchronization\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synchronization → initialization to underline the fact that there is no synchronization actually?

assert(path);

if (access(path, F_OK) < 0) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline not needed.


assert(dest);

if (arg_private_network)
if (arg_resolv_conf == RESOLV_CONF_AUTO) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too.

log_warning_errno(found, "Failed to resolve /etc/resolv.conf path in container, ignoring: %m");

if (m == RESOLV_CONF_DELETE) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too.

if (r <= 0)
return r;
if (r < 0)
return log_debug_errno(r, "Failed to check whether resolved bus name is taken: %m");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nicer to just say "Failed to check if org.freedesktop.resolve1 name is taken: %m". This requires less context to understand the log line.

synchronization from host to container) shall be handled. Takes one of <literal>off</literal>,
<literal>copy</literal>, <literal>bind</literal>, <literal>symlink</literal>, <literal>delete</literal> or
<literal>auto</literal>. If set to <literal>off</literal> the <filename>/etc/localtime</filename> file in the
container is left as it is included in the image, and neither modified nor bind mounted over. If set to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is left as it is included in the image, and neither modified nor bind mounted over" → "is left untouched". Shorter but easier to read, and then later on overwriting and bind-mounting is talked about anyway, so no need to repeat this here.

if (r >= 0)
return mount_verbose(LOG_ERR, NULL, resolved, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY|MS_NOSUID|MS_NODEV, NULL);
}

/* If that didn't work, let's copy the file */
r = copy_file("/etc/resolv.conf", where, O_TRUNC|O_NOFOLLOW, 0644, 0, COPY_REFLINK);
r = copy_file(what, where, O_TRUNC|O_NOFOLLOW, 0644, 0, COPY_REFLINK);
if (r < 0) {
/* If the file already exists as symlink, let's suppress the warning, under the assumption that
* resolved or something similar runs inside and the symlink points there.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the log level should be changed to something higher then LOG_DEBUG here? It seems strange to request e.g. --resolv-conf=copy-static and get no message whatsoever if the operation fails because somebody did chattr +i /etc/resolv.conf on the container.

@keszybz
Copy link
Member

keszybz commented May 19, 2018

I tried to test this, and I have the strangest effect. I'd assume that this is some fluke or strange tty bug, but it's entirely repeatable:

$ sudo SYSTEMD_LOG_LEVEL=debug build/systemd-nspawn --link-journal=try-guest -U --resolv-conf=auto -M rawhidex --bind /home -b
Found cgroup2 on /sys/fs/cgroup/unified, unified hierarchy for systemd controller
Setting RLIMIT_CPU to infinity.
...
Spawning container rawhidex on /var/lib/machines/rawhidex.
Press ^] three times within 1s to kill container.
(sd-namespace) succeeded.
Init process invoked as PID 32105
...
Failed to chown "/sys/fs/cgroup/systemd/machine.slice/machine-rawhidex.scope/payload/cgroup.threads", ignoring: No such file or directory
(hangs)
$  pgrep systemd-nspawn|xargs ps uw|cat
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     32103  0.0  0.0  90636  6028 pts/1    S+   14:54   0:00 build/systemd-nspawn --link-journal=try-guest -U --resolv-conf=auto -M rawhidex --bind /home -b
2256404+ 32105  0.0  0.0  69004   884 pts/1    S+   14:54   0:00 build/systemd-nspawn --link-journal=try-guest -U --resolv-conf=auto -M rawhidex --bind /home -b
$  sudo pstack 32105
#0  0x00007f20b0e6f610 in cfsetospeed () from target:/lib64/libc.so.6
#1  0x00007f20b189d0a5 in write_to_console (level=31, error=0, file=0x7f20b19a68a1 "../src/basic/mount-util.c", line=837, func=0x7f20b19a7010 <__func__.10730> "mount_verbose", buffer=0x7ffc225f03a0 "Mounting cgroup2 on /sys/fs/cgroup/unified (MS_NOSUID|MS_NODEV|MS_NOEXEC \"\")...") at ../src/basic/log.c:369
#2  0x00007f20b189df80 in log_dispatch_internal (level=31, error=0, file=0x7f20b19a68a1 "../src/basic/mount-util.c", line=837, func=0x7f20b19a7010 <__func__.10730> "mount_verbose", object_field=0x0, object=0x0, extra_field=0x0, extra=0x0, buffer=0x7ffc225f03a0 "Mounting cgroup2 on /sys/fs/cgroup/unified (MS_NOSUID|MS_NODEV|MS_NOEXEC \"\")...") at ../src/basic/log.c:637
#3  0x00007f20b189e1a2 in log_internalv_realm (level=7, error=0, file=0x7f20b19a68a1 "../src/basic/mount-util.c", line=837, func=0x7f20b19a7010 <__func__.10730> "mount_verbose", format=0x7f20b19a6e50 "Mounting %s on %s (%s \"%s\")...", ap=0x7ffc225f0c20) at ../src/basic/log.c:694
#4  0x00007f20b189e2a2 in log_internal_realm (level=7, error=0, file=0x7f20b19a68a1 "../src/basic/mount-util.c", line=837, func=0x7f20b19a7010 <__func__.10730> "mount_verbose", format=0x7f20b19a6e50 "Mounting %s on %s (%s \"%s\")...") at ../src/basic/log.c:709
#5  0x00007f20b18a46c4 in mount_verbose (error_log_level=3, what=0x557b591850c9 "cgroup", where=0x7ffc225f0df0 "/sys/fs/cgroup/unified", type=0x557b591850c1 "cgroup2", flags=14, options=0x0) at ../src/basic/mount-util.c:836
#6  0x0000557b5915c489 in mount_legacy_cgroup_hierarchy (dest=0x557b59184be0 "", controller=0x557b5918506e "name=unified", hierarchy=0x557b59185214 "unified", read_only=false) at ../src/nspawn/nspawn-mount.c:952
#7  0x0000557b5915cb40 in mount_legacy_cgns_supported (dest=0x557b5918bd07 "", unified_requested=CGROUP_UNIFIED_SYSTEMD, userns=true, uid_shift=225640448, uid_range=65536, selinux_apifs_context=0x0) at ../src/nspawn/nspawn-mount.c:1058
#8  0x0000557b5915db6c in mount_cgroups (dest=0x557b5918bd07 "", unified_requested=CGROUP_UNIFIED_SYSTEMD, userns=true, uid_shift=225640448, uid_range=65536, selinux_apifs_context=0x0, use_cgns=true) at ../src/nspawn/nspawn-mount.c:1219
#9  0x0000557b59177be5 in inner_child (barrier=0x7ffc225f1650, directory=0x557b5b2f23d0 "/var/lib/machines/rawhidex", secondary=false, kmsg_socket=12, rtnl_socket=14, fds=0x0) at ../src/nspawn/nspawn.c:2583
#10 0x0000557b5917a537 in outer_child (barrier=0x7ffc225f1650, directory=0x557b5b2f23d0 "/var/lib/machines/rawhidex", console=0x557b5b2f2400 "/dev/pts/3", dissected_image=0x0, interactive=true, secondary=false, pid_socket=-1, uuid_socket=-1, notify_socket=-1, kmsg_socket=12, rtnl_socket=14, uid_shift_socket=-1, unified_cgroup_hierarchy_socket=-1, fds=0x0, netns_fd=-1) at ../src/nspawn/nspawn.c:3090
#11 0x0000557b5917dc03 in run (master=-1, console=0x557b5b2f2400 "/dev/pts/3", dissected_image=0x0, interactive=true, secondary=false, fds=0x0, veth_name=0x7ffc225f19e0 "", veth_created=0x7ffc225f178e, exposed=0x7ffc225f19c0, pid=0x7ffc225f179c, ret=0x7ffc225f1798) at ../src/nspawn/nspawn.c:3728
#12 0x0000557b59181706 in main (argc=9, argv=0x7ffc225f1b08) at ../src/nspawn/nspawn.c:4491
$ sudo pstack 32103
#0  0x00007f20b0e6da40 in __poll_nocancel () from /lib64/libc.so.6
#1  0x00007f20b185effc in barrier_read (b=0x7ffc225f1650, comp=0) at ../src/basic/barrier.c:230
#2  0x00007f20b185f4ac in barrier_sync (b=0x7ffc225f1650) at ../src/basic/barrier.c:401
#3  0x0000557b5916be64 in barrier_place_and_sync (b=0x7ffc225f1650) at ../src/basic/barrier.h:79
#4  0x0000557b5917f127 in run (master=5, console=0x557b5b2f2400 "/dev/pts/3", dissected_image=0x0, interactive=true, secondary=false, fds=0x0, veth_name=0x7ffc225f19e0 "", veth_created=0x7ffc225f178e, exposed=0x7ffc225f19c0, pid=0x7ffc225f179c, ret=0x7ffc225f1798) at ../src/nspawn/nspawn.c:4006
#5  0x0000557b59181706 in main (argc=9, argv=0x7ffc225f1b08) at ../src/nspawn/nspawn.c:4491

This seems to happen only with --resolv-conf=auto. bind-host, copy-host, off, bind-static, copy-static all work fine. So it seems the process of detection is actually the source of problems.

@keszybz
Copy link
Member

keszybz commented May 19, 2018

As expected

--- src/nspawn/nspawn.c
+++ src/nspawn/nspawn.c
@@ -1598,6 +1598,8 @@ static int resolved_listening(void) {
 
         /* Check if resolved is listening */
 
+        return true;
+
         r = sd_bus_open_system(&bus);
         if (r < 0)
                 return log_debug_errno(r, "Failed to open system bus: %m");

makes the hang go away. So this seems to be some strange interaction with dbus.

@poettering poettering mentioned this pull request May 22, 2018
@poettering
Copy link
Member Author

So the hang is a deadlock caused by the debug logging. Basically, our child process logs to stderr, but stderr is already connected to a pty, which the parent process is supposed to process and propagate, but it's waiting for some client operation to finish before it will do so.

So this bug is not introduced by this PR, all that happens is that we now generate so much debug output that the pty buffer will be fully filled.

We should really fix that (for example, by keeping the original log fd open in the child all the way until we are past the the last synchronization point against the parent, so that we never log to the pty on our own).

I'll prep a separate PR for this, I think we should treat this PR independent of that, as this is merely the trigger for the deadlock, not the cause.

@poettering
Copy link
Member Author

I force pushed a new version now. Changes are everything @keszybz pointed out (i reworded the thing about copy/mount, please have a look). i also added three minor commits on top.

@poettering
Copy link
Member Author

Force pushed a new rebased version now. I dropped the three extra commits again and put them in #9065 instead, I figure that should make it easier to review.

PTAL!

poettering added a commit to poettering/systemd that referenced this pull request May 22, 2018
…tderr as long as possible

If we log to the pty that is configured as stdin/stdout/stderr of the
container too early we risk filling it up in full before we start
processing the pty from the parent process, resulting in deadlocks.
Let's hence keep a copy of the original tty we were started on before
setting up stdin/stdout/stderr, so that we can log to it, and keep using
it as long as we can.

Since the kernel's pty internal buffer is pretty small this actually
triggered deadlocks when we debug logged at lot from nspawn's child
processes, see: systemd#9024 (comment)

With this change we won't use the pty at all, only the actual payload we
start will, and hence we won't deadlock on it, ever.
@poettering
Copy link
Member Author

I have prepped a fix for the logging deadlock independently in PR #9068 now. PTAL!

@keszybz
Copy link
Member

keszybz commented May 24, 2018

I can confirm that #9068 fixes the issue.

@keszybz keszybz merged commit 17c1b9a into systemd:master May 24, 2018
Yamakuzure pushed a commit to elogind/elogind that referenced this pull request Aug 23, 2018
…tderr as long as possible

If we log to the pty that is configured as stdin/stdout/stderr of the
container too early we risk filling it up in full before we start
processing the pty from the parent process, resulting in deadlocks.
Let's hence keep a copy of the original tty we were started on before
setting up stdin/stdout/stderr, so that we can log to it, and keep using
it as long as we can.

Since the kernel's pty internal buffer is pretty small this actually
triggered deadlocks when we debug logged at lot from nspawn's child
processes, see: systemd/systemd#9024 (comment)

With this change we won't use the pty at all, only the actual payload we
start will, and hence we won't deadlock on it, ever.
Yamakuzure pushed a commit to elogind/elogind that referenced this pull request Aug 24, 2018
…tderr as long as possible

If we log to the pty that is configured as stdin/stdout/stderr of the
container too early we risk filling it up in full before we start
processing the pty from the parent process, resulting in deadlocks.
Let's hence keep a copy of the original tty we were started on before
setting up stdin/stdout/stderr, so that we can log to it, and keep using
it as long as we can.

Since the kernel's pty internal buffer is pretty small this actually
triggered deadlocks when we debug logged at lot from nspawn's child
processes, see: systemd/systemd#9024 (comment)

With this change we won't use the pty at all, only the actual payload we
start will, and hence we won't deadlock on it, ever.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

None yet

2 participants