Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

"Bad system call" in rkt, works in docker #3820

Closed
pkmiec opened this issue Oct 8, 2017 · 11 comments
Closed

"Bad system call" in rkt, works in docker #3820

pkmiec opened this issue Oct 8, 2017 · 11 comments

Comments

@pkmiec
Copy link

pkmiec commented Oct 8, 2017

Environment

rkt Version: 1.29.0
appc Version: 0.8.11
Go Version: go1.8.3
Go OS/Arch: linux/amd64
Features: -TPM +SDJOURNAL
--
Linux 3.10.0-514.21.1.el7.x86_64 x86_64
--
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
--
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN

What did you do?
Run container and execute su,

%> sudo rkt --insecure-options=image run docker://appfolio/aggregator:master_fe5776e727fb360fe1c4bd543637ea7bd479b37b --interactive --exec bash
[root@rkt-1d413c53-5f25-4816-9c97-2bf199c54423 code]# su - afc -c "echo hi"
Bad system call

What did you expect to see?
The su works with docker out of the box, so I expected it to also work with rkt.

%> sudo docker run --rm -it appfolio/aggregator:master_fe5776e727fb360fe1c4bd543637ea7bd479b37b
[root@cf173ad0ef51 code]# su - afc -c "echo hi"
hi

What did you see instead?
I'm looking at switching from docker to rkt. The su works in docker, but seems to blocked either by capabilities or seccomp in rkt. I'm not too familiar with these and my google attempt were not helping. How can I figure why it works with docker and fails with rkt?

Here is the output of the rkt run with --debug,

[pkmiec@combo1.jke.dalin.appfolio.net ~]$ sudo rkt --debug \
> --insecure-options=image run \
> docker://appfolio/aggregator:master_fe5776e727fb360fe1c4bd543637ea7bd479b37b \
> --interactive \
> --exec bash
image: using image from local store for image name coreos.com/rkt/stage1-coreos:1.29.0
image: using image from local store for url docker://appfolio/aggregator:master_fe5776e727fb360fe1c4bd543637ea7bd479b37b
run: disabling overlay support: "unsupported filesystem: missing d_type support"
stage0: Preparing stage1
stage0: Writing image manifest
stage0: Loading image sha512-650c6fe1d53be410f5bd0450a6ab9311b9dea62314645a7992fb490bcc256af3
stage0: Writing image manifest
stage0: Writing pod manifest
stage0: Setting up stage1
stage0: Wrote filesystem to /var/lib/rkt/pods/run/1d413c53-5f25-4816-9c97-2bf199c54423
stage0: Pivoting to filesystem /var/lib/rkt/pods/run/1d413c53-5f25-4816-9c97-2bf199c54423
stage0: Execing [/var/lib/rkt/pods/run/1d413c53-5f25-4816-9c97-2bf199c54423/stage1/rootfs/init --debug --net=default --interactive --local-config=/etc/rkt 1d413c53-5f25-4816-9c97-2bf199c54423]
networking: loading networks from stage1/rootfs/etc/rkt/net.d
networking: loading networks from /etc/rkt/net.d
networking: loading network default with type ptp
stage1: canMachinedRegister true
stage1: args ["stage1/rootfs/usr/lib/ld-linux-x86-64.so.2" "stage1/rootfs/usr/bin/systemd-nspawn" "--boot" "--notify-ready=yes" "--register=true" "--link-journal=try-guest" "--uuid=1d413c53-5f25-4816-9c97-2bf199c54423" "--machine=rkt-1d413c53-5f25-4816-9c97-2bf199c54423" "--directory=stage1/rootfs" "--capability=CAP_AUDIT_WRITE,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FSETID,CAP_FOWNER,CAP_KILL,CAP_MKNOD,CAP_NET_RAW,CAP_NET_BIND_SERVICE,CAP_SETUID,CAP_SETGID,CAP_SETPCAP,CAP_SETFCAP,CAP_SYS_CHROOT" "--" "--default-standard-output=tty"]
stage1: env ["HOSTNAME=combo1.jke.dalin.appfolio.net" "TERM=xterm-256color" "LS_COLORS=rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:*.tar=38;5;9:*.tgz=38;5;9:*.arc=38;5;9:*.arj=38;5;9:*.taz=38;5;9:*.lha=38;5;9:*.lz4=38;5;9:*.lzh=38;5;9:*.lzma=38;5;9:*.tlz=38;5;9:*.txz=38;5;9:*.tzo=38;5;9:*.t7z=38;5;9:*.zip=38;5;9:*.z=38;5;9:*.Z=38;5;9:*.dz=38;5;9:*.gz=38;5;9:*.lrz=38;5;9:*.lz=38;5;9:*.lzo=38;5;9:*.xz=38;5;9:*.bz2=38;5;9:*.bz=38;5;9:*.tbz=38;5;9:*.tbz2=38;5;9:*.tz=38;5;9:*.deb=38;5;9:*.rpm=38;5;9:*.jar=38;5;9:*.war=38;5;9:*.ear=38;5;9:*.sar=38;5;9:*.rar=38;5;9:*.alz=38;5;9:*.ace=38;5;9:*.zoo=38;5;9:*.cpio=38;5;9:*.7z=38;5;9:*.rz=38;5;9:*.cab=38;5;9:*.jpg=38;5;13:*.jpeg=38;5;13:*.gif=38;5;13:*.bmp=38;5;13:*.pbm=38;5;13:*.pgm=38;5;13:*.ppm=38;5;13:*.tga=38;5;13:*.xbm=38;5;13:*.xpm=38;5;13:*.tif=38;5;13:*.tiff=38;5;13:*.png=38;5;13:*.svg=38;5;13:*.svgz=38;5;13:*.mng=38;5;13:*.pcx=38;5;13:*.mov=38;5;13:*.mpg=38;5;13:*.mpeg=38;5;13:*.m2v=38;5;13:*.mkv=38;5;13:*.webm=38;5;13:*.ogm=38;5;13:*.mp4=38;5;13:*.m4v=38;5;13:*.mp4v=38;5;13:*.vob=38;5;13:*.qt=38;5;13:*.nuv=38;5;13:*.wmv=38;5;13:*.asf=38;5;13:*.rm=38;5;13:*.rmvb=38;5;13:*.flc=38;5;13:*.avi=38;5;13:*.fli=38;5;13:*.flv=38;5;13:*.gl=38;5;13:*.dl=38;5;13:*.xcf=38;5;13:*.xwd=38;5;13:*.yuv=38;5;13:*.cgm=38;5;13:*.emf=38;5;13:*.axv=38;5;13:*.anx=38;5;13:*.ogv=38;5;13:*.ogx=38;5;13:*.aac=38;5;45:*.au=38;5;45:*.flac=38;5;45:*.mid=38;5;45:*.midi=38;5;45:*.mka=38;5;45:*.mp3=38;5;45:*.mpc=38;5;45:*.ogg=38;5;45:*.ra=38;5;45:*.wav=38;5;45:*.axa=38;5;45:*.oga=38;5;45:*.spx=38;5;45:*.xspf=38;5;45:" "SSH_AUTH_SOCK=/tmp/ssh-HPWD4wCFhI/agent.3583" "PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/dell/srvadmin/bin:/opt/puppetlabs/bin" "LANG=en_US.UTF-8" "SHELL=/bin/bash" "MAIL=/var/mail/root" "LOGNAME=root" "USER=root" "USERNAME=root" "HOME=/root" "SUDO_COMMAND=/bin/rkt --debug --insecure-options=image run docker://appfolio/aggregator:master_fe5776e727fb360fe1c4bd543637ea7bd479b37b --interactive --exec bash" "SUDO_USER=pkmiec" "SUDO_UID=1009" "SUDO_GID=100" "RKT_LOCK_FD=5" "RKT_SELINUX_CONTEXT=" "RKT_SELINUX_MOUNT_CONTEXT=" "LD_LIBRARY_PATH=stage1/rootfs/usr/lib:stage1/rootfs/usr/lib/systemd" "SYSTEMD_NSPAWN_CONTAINER_SERVICE=rkt" "SYSTEMD_NSPAWN_USE_CGNS=no"]
stage1: mounting source "" target "/" fstype "none" flags MS_REC|MS_SLAVE data ""
stage1: mounting source "" target "/" fstype "none" flags MS_REC|MS_SHARED data ""
stage1: unifiedCgroup false
stage1: subcgroup "machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope"
stage1: enabledCgroups map['\x06':["cpu" "cpuacct"] '\v':["freezer"] '\x05':["blkio"] '\a':["hugetlb"] '\n':["pids"] '\b':["cpuset"] '\x02':["devices"] '\t':["net_cls" "net_prio"] '\x04':["perf_event"] '\x03':["memory"]]
stage1: serviceNames ["aggregator.service"]
stage1: mounting source "sysfs" target "stage1/rootfs/sys" fstype "sysfs" flags MS_NODEV|MS_NOEXEC|MS_NOSUID data ""
stage1: mounting source "tmpfs" target "stage1/rootfs/sys/fs/cgroup" fstype "tmpfs" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_STRICTATIME data "mode=755"
stage1: mounting source "cgroup" target "stage1/rootfs/sys/fs/cgroup/hugetlb" fstype "cgroup" flags MS_NODEV|MS_NOEXEC|MS_NOSUID data "hugetlb"
stage1: mounting source "cgroup" target "stage1/rootfs/sys/fs/cgroup/pids" fstype "cgroup" flags MS_NODEV|MS_NOEXEC|MS_NOSUID data "pids"
stage1: mounting source "cgroup" target "stage1/rootfs/sys/fs/cgroup/cpuset" fstype "cgroup" flags MS_NODEV|MS_NOEXEC|MS_NOSUID data "cpuset"
stage1: mounting source "cgroup" target "stage1/rootfs/sys/fs/cgroup/cpu,cpuacct" fstype "cgroup" flags MS_NODEV|MS_NOEXEC|MS_NOSUID data "cpu,cpuacct"
stage1: mounting source "cgroup" target "stage1/rootfs/sys/fs/cgroup/freezer" fstype "cgroup" flags MS_NODEV|MS_NOEXEC|MS_NOSUID data "freezer"
stage1: mounting source "cgroup" target "stage1/rootfs/sys/fs/cgroup/blkio" fstype "cgroup" flags MS_NODEV|MS_NOEXEC|MS_NOSUID data "blkio"
stage1: mounting source "cgroup" target "stage1/rootfs/sys/fs/cgroup/memory" fstype "cgroup" flags MS_NODEV|MS_NOEXEC|MS_NOSUID data "memory"
stage1: mounting source "cgroup" target "stage1/rootfs/sys/fs/cgroup/devices" fstype "cgroup" flags MS_NODEV|MS_NOEXEC|MS_NOSUID data "devices"
stage1: mounting source "cgroup" target "stage1/rootfs/sys/fs/cgroup/net_cls,net_prio" fstype "cgroup" flags MS_NODEV|MS_NOEXEC|MS_NOSUID data "net_cls,net_prio"
stage1: mounting source "cgroup" target "stage1/rootfs/sys/fs/cgroup/perf_event" fstype "cgroup" flags MS_NODEV|MS_NOEXEC|MS_NOSUID data "perf_event"
stage1: mounting source "stage1/rootfs/sys/fs/cgroup" target "stage1/rootfs/sys/fs/cgroup" fstype "" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY|MS_STRICTATIME|MS_REMOUNT|MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/memory/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" target "stage1/rootfs/sys/fs/cgroup/memory/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" fstype "" flags MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/memory" target "stage1/rootfs/sys/fs/cgroup/memory" fstype "" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY|MS_REMOUNT|MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/devices/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" target "stage1/rootfs/sys/fs/cgroup/devices/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" fstype "" flags MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/devices" target "stage1/rootfs/sys/fs/cgroup/devices" fstype "" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY|MS_REMOUNT|MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" target "stage1/rootfs/sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" fstype "" flags MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/net_cls,net_prio" target "stage1/rootfs/sys/fs/cgroup/net_cls,net_prio" fstype "" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY|MS_REMOUNT|MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/perf_event/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" target "stage1/rootfs/sys/fs/cgroup/perf_event/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" fstype "" flags MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/perf_event" target "stage1/rootfs/sys/fs/cgroup/perf_event" fstype "" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY|MS_REMOUNT|MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/hugetlb/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" target "stage1/rootfs/sys/fs/cgroup/hugetlb/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" fstype "" flags MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/hugetlb" target "stage1/rootfs/sys/fs/cgroup/hugetlb" fstype "" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY|MS_REMOUNT|MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/pids/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" target "stage1/rootfs/sys/fs/cgroup/pids/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" fstype "" flags MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/pids" target "stage1/rootfs/sys/fs/cgroup/pids" fstype "" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY|MS_REMOUNT|MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/cpuset/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" target "stage1/rootfs/sys/fs/cgroup/cpuset/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" fstype "" flags MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/cpuset" target "stage1/rootfs/sys/fs/cgroup/cpuset" fstype "" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY|MS_REMOUNT|MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" target "stage1/rootfs/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" fstype "" flags MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/cpu,cpuacct" target "stage1/rootfs/sys/fs/cgroup/cpu,cpuacct" fstype "" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY|MS_REMOUNT|MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/freezer/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" target "stage1/rootfs/sys/fs/cgroup/freezer/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" fstype "" flags MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/freezer" target "stage1/rootfs/sys/fs/cgroup/freezer" fstype "" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY|MS_REMOUNT|MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/blkio/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" target "stage1/rootfs/sys/fs/cgroup/blkio/machine.slice/machine-rkt\\x2d1d413c53\\x2d5f25\\x2d4816\\x2d9c97\\x2d2bf199c54423.scope/system.slice" fstype "" flags MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys/fs/cgroup/blkio" target "stage1/rootfs/sys/fs/cgroup/blkio" fstype "" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY|MS_REMOUNT|MS_BIND data ""
stage1: mounting source "stage1/rootfs/sys" target "stage1/rootfs/sys" fstype "" flags MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY|MS_REMOUNT|MS_BIND data ""
Spawning container rkt-1d413c53-5f25-4816-9c97-2bf199c54423 on /var/lib/rkt/pods/run/1d413c53-5f25-4816-9c97-2bf199c54423/stage1/rootfs.
Press ^] three times within 1s to kill container.
systemd 233 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS -ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN default-hierarchy=legacy)
Detected virtualization rkt.
Detected architecture x86-64.

Welcome to Linux!

Set hostname to <rkt-1d413c53-5f25-4816-9c97-2bf199c54423>.
[  OK  ] Listening on Journal Socket.
[  OK  ] Listening on Journal Socket (/dev/log).
[  OK  ] Created slice system.slice.
         Starting Journal Service...
         Starting Create /etc/passwd and /etc/group...
[  OK  ] Created slice system-prepare\x2dapp.slice.
[  OK  ] Started Pod shutdown.
[  OK  ] Started aggregator Reaper.
[  OK  ] Started Create /etc/passwd and /etc/group.
[  OK  ] Started Journal Service.
         Starting Flush Journal to Persistent Storage...
         Starting Prepare minimum environment for chrooted applications...
[  OK  ] Started Prepare minimum environment for chrooted applications.
[  OK  ] Started Application=aggregator Image=registry-1.docker.io/appfolio/aggregator.
[  OK  ] Reached target rkt apps target.
         Starting rkt supervisor-ready signaling...
[  OK  ] Started rkt supervisor-ready signaling.
[root@rkt-1d413c53-5f25-4816-9c97-2bf199c54423 code]# [  OK  ] Started Flush Journal to Persistent Storage.

[root@rkt-1d413c53-5f25-4816-9c97-2bf199c54423 code]# su - afc -c "echo hi"
Bad system call
@lucab
Copy link
Member

lucab commented Oct 9, 2017

As the docker image doesn't seem to be publicly accessible, you'll need to strace the su invocation to determine where it is failing. Can you try to reproduce this on some common base image (debian/busybox/alpine)?

@pkmiec
Copy link
Author

pkmiec commented Oct 9, 2017

Good idea on the strace.

Here is the rkt trace,

...
[pid  7734] read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 909
[pid  7734] close(3)                    = 0
[pid  7734] munmap(0x7fd4c51a6000, 4096) = 0
[pid  7734] getuid()                    = 0
[pid  7734] getgid()                    = 0
[pid  7734] setregid(31337, 4294967295) = 0
[pid  7734] setreuid(31337, 4294967295) = 0
[pid  7734] +++ killed by SIGSYS +++
<... wait4 resumed> [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSYS}], WSTOPPED|WCONTINUED, NULL) = 26
rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0
ioctl(255, SNDRV_TIMER_IOCTL_SELECT or TIOCSPGRP, [6]) = 0
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
ioctl(255, SNDCTL_TMR_STOP or SNDRV_TIMER_IOCTL_GINFO or TCSETSW, {B38400 opost isig icanon echo ...}) = 0
ioctl(255, TIOCGWINSZ, {ws_row=41, ws_col=141, ws_xpixel=1410, ws_ypixel=820}) = 0
write(2, "Bad system call\n", 16)       = 16
...

vs the docker trace,

...
[pid 27410] read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 909
[pid 27410] close(3)                    = 0
[pid 27410] munmap(0x7ff754844000, 4096) = 0
[pid 27410] getuid()                    = 0
[pid 27410] getgid()                    = 0
[pid 27410] setregid(31337, 4294967295) = 0
[pid 27410] setreuid(31337, 4294967295) = 0
[pid 27410] setreuid(0, 4294967295)     = 0
[pid 27410] setregid(0, 4294967295)     = 0
[pid 27410] open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
[pid 27410] fstat(3, {st_mode=S_IFREG|0644, st_size=909, ...}) = 0
[pid 27410] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff754844000
[pid 27410] read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 909
[pid 27410] close(3)                    = 0
[pid 27410] munmap(0x7ff754844000, 4096) = 0
...

So it seems rkt trace receives SIGSYS when trying to executesetreuid(0, 4294967295). My understanding is that setreuid is covered by CAP_SETUID and CAP_SETUID does appear in the stage1 list of capabilities.

So I'm thinking that either that capability is not inherited by bash or the syscall is blocked by seccomp?

I'm googling to see whether it is possible to list capabilities / seccomp for a process.

@squeed
Copy link
Contributor

squeed commented Oct 9, 2017

One thing you can try is executing rkt with setreuid enabled:

rkt run --seccomp mode=retain,@docker/default-whitelist,setreuid ...

@pkmiec
Copy link
Author

pkmiec commented Oct 9, 2017

Here are the capabilities for the bash running inside rkt,

# /proc/<pid>/status
CapInh:	0000000000000000
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000
Seccomp:	2

and for bash inside docker,

CapInh:	00000000a80425fb
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000
Seccomp:	2

where 00000000a80425fb seems to be correct set of capabilities,

%> capsh --decode=00000000a80425fb
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

The main difference with these between rkt and docker is CapInh.

@squeed trying the --seccomp now.

@pkmiec
Copy link
Author

pkmiec commented Oct 9, 2017

I looked at

DockerDefaultSeccompBlacklist = []string{
and with trial and error found this works: --seccomp mode=retain,@docker/default-whitelist,keyctl

keyctl?! I don't see keyctl anywhere in output of sudo strace -f -p <pid of bash>. I also don't see keyctl anywhere in moby project (e.g. https://github.com/moby/moby/blob/master/profiles/seccomp/seccomp_default.go).

Any ideas what could be happening?

@euank
Copy link
Member

euank commented Oct 10, 2017

The difference here can minimally be reproduced with:

$ rkt run --interactive --insecure-options=image docker://alpine --exec=sh -- -c 'apk add --update keyutils && keyctl list @u'                                     
Bad system call

$ docker run alpine sh -c 'apk add --update keyutils && keyctl list @u'
keyctl_read_alloc: Operation not permitted

The nspawn list of syscalls here looks related.

(Note that in systemd v235, which isn't used by rkt yet, the structure of that code has changed somewhat)

I gained a little further evidence for my suspicion that the previous comment is related trying another element on the list: swapoff:

$ docker run -it alpine swapoff -a
swapoff: /dev/xxx: Operation not permitted

$ sudo rkt run --insecure-options=image docker://alpine --exec=sh -- -c 'swapoff -a'  
[86905.172178] alpine[5]: Bad system call

Another difference here, which probably doesn't matter, is that systemd explicitly creates a new keyring for systemd services nowadays, so the application inside your pod probably got its own keyring, while this wouldn't be the case under docker. I don't think that detail really matters here though.

I'm not sure if rkt should be trying to override that behaviour by default since systemd does have a reason for avoiding letting it through... though modelling it more accurately in the @docker/default-whitelist could be a usability thing worth looking into.

@euank
Copy link
Member

euank commented Oct 10, 2017

Actually, after thinking about this a moment longer, I realized the actual difference. The whole nspawn thing seems to be a total red herring.

Docker defaults its seccomp filter to errno=EPERM, rather than RET_KILL.

If you switch rkt to EPERM via something like --seccomp=mode=retain,@rkt/default-whitelist,errno=EPERM you'll observe closer behaviour to what docker is doing.

I'd bet the application in question handles that just fine since, well, docker is giving an EPERM for the keyctl related calls already.
The application is resilient enough to EPERM, but not to getting killed I guess.

I'm filing a followup about the usability vs security considerations here.

@pkmiec
Copy link
Author

pkmiec commented Oct 10, 2017

@euank Thanks for looking at it. I can confirm that --seccomp=mode=retain,@rkt/default-whitelist,errno=EPERM worked for me. The su is now able to execute and prints out hi.

@pkmiec
Copy link
Author

pkmiec commented Oct 10, 2017

I'm gonna close this issue in favor of #3823

@pkmiec pkmiec closed this as completed Oct 10, 2017
@lucab
Copy link
Member

lucab commented Oct 11, 2017

I'm still confused though, who is calling keyctl()? Is it su itself, glibc or something else? @pkmiec Which version/baseimage is that from?

@pkmiec
Copy link
Author

pkmiec commented Oct 11, 2017

@lucab

The appfolio/aggregator image is derived from appfolio/ruby_base image which in turn is derived from centos:7.3.1611.

But I'm not sure who is calling keyctl() either.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants