-
Notifications
You must be signed in to change notification settings - Fork 886
rkt run: acquire lock on container directory and record pid #244
Conversation
@vcaputo in the future, you can format code snippets on GitHub by doing
but with 3 ticks instead of 2. |
markdown in the commit message? |
I don't know if it'll render in the commit message, but in the PR body it would help. |
Well I had noticed the formatting from the commit message was lost in giithub, but didn't add markdown since I wasn't sure if those changes would be retained in the merged PR commit message if it were merged. |
va_list ap; | ||
long ret; | ||
|
||
if(number != __NR_clone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why return here? won't this stop other syscalls? (or is that the goal?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should probably be an assert actually. This is not intended to be a general syscall() wrapper, it's specific to the clone syscall number, and in the case of systemd-nspawn that's perfectly ok because it's the only way syscall() is used in nspawn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok listen I,do remember going to this site but I really don't have any
ideal what it is or what it's about or what y'all keep texting me or what
it means so please stop thinking me... and again I'm sorry I....
On Dec 8, 2014 11:49 PM, "Vito Caputo" notifications@github.com wrote:
In stage1/mkrootfs.sh:
+int close(int fd)
+{
- if(lock_fd != -1 && fd == lock_fd)
return fcntl(fd, F_SETFD, FD_CLOEXEC);
- return libc_close(fd);
+}
+long syscall(long number, ...)
+{
- unsigned long clone_flags;
- va_list ap;
- long ret;
- if(number != __NR_clone)
It should probably be an assert actually. This is not intended to be a
general syscall() wrapper, it's specific to the clone syscall number, and
in the case of systemd-nspawn that's perfectly ok.—
Reply to this email directly or view it on GitHub
https://github.com/coreos/rocket/pull/244/files#r21507003.
looks awesome and/or scary! |
Nit, could you separate the lock + the pid recording into separate commits? Seems like it might make it a bit easier to trace back should the need arise one day. |
This scares me too much. I was OK relying on the LD_PRELOAD hack until we can fix-up nspawn but this is going a bit too far. What about if we send a "life cycle" FD to systemd-nspawn that is socket activated to some unix domain inside the stage1 instead? |
@philips that may be workable as well, though I don't think this is really all that scary, and what I really wanted to solidify with this was things like: is there a pid flie? where do we put it? do we need to communicate the lock fdnum to stage1? through an environment variable? It's not immediately clear to me how the socket-activated unit would discover PID1's "outside" PID, whereas it's trivially done @ clone() in the systemd-nspawn parent context. We need to write that down somewhere for our nsenter-like operations. Ultimately I see this all vanishing in the not too distant future with a systemd-nspawn replacement, where all these fakesdboot.so intercepts become the replacement's normal implementation. |
stage0 acquires an exclusive advisory lock on the container directory, leaving the fd open when executing stage1/init with the fd value stored in the environment variable RKT_LOCK_FD. stage1's fakesdboot.so shim prevents closure of RKT_LOCK_FD in nspawn by replacing the close with a O_CLOEXEC fcntl, retaining the lock handle until nspawn exits. This facilitates simple and reliable container discovery by external processes via flock() attempts on container directories. Consult the flock(2) and flock(1) man pages for information on advisory file locks. With this change some basic lifecycle features are trivial using bash: Identifying which containers are active: shopt -s nullglob for c in /var/lib/rkt/containers/*; do flock --exclusive --nonblock "$c" echo "$c inactive" || echo "$c active" done GC of inactive containers: shopt -s nullglob mkdir -p /var/lib/rkt/gc for c in /var/lib/rkt/containers/*; do [ -e "$c/stage1" ] || continue flock --exclusive --nonblock "$c" mv "$c" /var/lib/rkt/gc done Assuming a periodic task clears out aged contents from /var/lib/rkt/gc
Intercept syscall() used by systemd-nspawn for clone, writing the newly-namespaced child's pid into /var/lib/rkt/containers/$cuuid from the parent before returning to systemd-nspawn. This change combined with locking enables simple status queries of running containers, for example in bash: shopt -s nullglob for c in /var/lib/rkt/containers/*; do [ -e "$c/pid" ] || continue flock --exclusive --nonblock "$c" /bin/true || { echo $c cpid=$(cat "$c/pid") nsenter --mount --uts --ipc --pid --root --wd --target "${cpid}" systemctl status } done
Merging this for now, we'll revisit this, but in the interim we can get all the lifecycle details sorted and built around advisory locks and the $cuuid/pid file. |
rkt run: acquire lock on container directory and record pid
stage0 acquires an exclusive advisory lock on the container directory,
leaving the fd open when executing stage1/init with the fd value stored in
the environment variable RKT_LOCK_FD.
stage1's fakesdboot.so shim prevents closure of RKT_LOCK_FD in nspawn by
replacing the close with a O_CLOEXEC fcntl, retaining the lock handle until
nspawn exits.
fakesdboot.so has also been modified to intercept the clone syscall from
nspawn, recording the pid of the container's PID 1 in
"/var/lib/rkt/containers/$cuuid/pid"
With these changes one can do things like:
query container status:
gc dead containers: