rkt run: acquire lock on container directory and record pid #244

vcaputo · 2014-12-08T23:50:48Z

stage0 acquires an exclusive advisory lock on the container directory,
leaving the fd open when executing stage1/init with the fd value stored in
the environment variable RKT_LOCK_FD.

stage1's fakesdboot.so shim prevents closure of RKT_LOCK_FD in nspawn by
replacing the close with a O_CLOEXEC fcntl, retaining the lock handle until
nspawn exits.

fakesdboot.so has also been modified to intercept the clone syscall from
nspawn, recording the pid of the container's PID 1 in
"/var/lib/rkt/containers/$cuuid/pid"

With these changes one can do things like:

query container status:

 shopt -s nullglob
 for c in /var/lib/rkt/containers/*; do
  [ -e "$c/pid" ] || continue
  flock --exclusive --nonblock "$c" /bin/true || {
   cpid=$(cat "$c/pid")
   nsenter --mount --uts --ipc --pid --root --wd --target "${cpid}" systemctl status
  }
 done

gc dead containers:

 shopt -s nullglob
 for c in /var/lib/rkt/containers/*; do
  [ -e "$c/stage1" ] || continue
  flock --exclusive --nonblock "$c" rm -Rf "$c"
 done

jzelinskie · 2014-12-09T00:29:50Z

@vcaputo in the future, you can format code snippets on GitHub by doing

``lang
code goes here
``

but with 3 ticks instead of 2.

vcaputo · 2014-12-09T00:36:15Z

markdown in the commit message?

jzelinskie · 2014-12-09T00:55:01Z

I don't know if it'll render in the commit message, but in the PR body it would help.
I guess it's better to have a nice commit message than a vague commit message and a nice PR in case the code ever leaves GitHub.

vcaputo · 2014-12-09T01:08:12Z

Well I had noticed the formatting from the commit message was lost in giithub, but didn't add markdown since I wasn't sure if those changes would be retained in the merged PR commit message if it were merged.

jonboulle · 2014-12-09T03:07:44Z

stage1/mkrootfs.sh

+	va_list		ap;
+	long		ret;
+
+	if(number != __NR_clone)


why return here? won't this stop other syscalls? (or is that the goal?)

It should probably be an assert actually. This is not intended to be a general syscall() wrapper, it's specific to the clone syscall number, and in the case of systemd-nspawn that's perfectly ok because it's the only way syscall() is used in nspawn.

Ok listen I,do remember going to this site but I really don't have any
ideal what it is or what it's about or what y'all keep texting me or what
it means so please stop thinking me... and again I'm sorry I....
On Dec 8, 2014 11:49 PM, "Vito Caputo" notifications@github.com wrote:

In stage1/mkrootfs.sh:

+int close(int fd)
+{

if(lock_fd != -1 && fd == lock_fd)

return fcntl(fd, F_SETFD, FD_CLOEXEC);

return libc_close(fd);
+}

+long syscall(long number, ...)
+{

unsigned long clone_flags;

va_list ap;

long ret;

if(number != __NR_clone)

It should probably be an assert actually. This is not intended to be a
general syscall() wrapper, it's specific to the clone syscall number, and
in the case of systemd-nspawn that's perfectly ok.

—
Reply to this email directly or view it on GitHub
https://github.com/coreos/rocket/pull/244/files#r21507003.

jonboulle · 2014-12-09T03:09:33Z

looks awesome and/or scary!

jonboulle · 2014-12-09T03:10:02Z

Nit, could you separate the lock + the pid recording into separate commits? Seems like it might make it a bit easier to trace back should the need arise one day.

philips · 2014-12-09T03:58:55Z

This scares me too much. I was OK relying on the LD_PRELOAD hack until we can fix-up nspawn but this is going a bit too far. What about if we send a "life cycle" FD to systemd-nspawn that is socket activated to some unix domain inside the stage1 instead?

vcaputo · 2014-12-09T05:03:18Z

@philips that may be workable as well, though I don't think this is really all that scary, and what I really wanted to solidify with this was things like: is there a pid flie? where do we put it? do we need to communicate the lock fdnum to stage1? through an environment variable?

It's not immediately clear to me how the socket-activated unit would discover PID1's "outside" PID, whereas it's trivially done @ clone() in the systemd-nspawn parent context. We need to write that down somewhere for our nsenter-like operations.

Ultimately I see this all vanishing in the not too distant future with a systemd-nspawn replacement, where all these fakesdboot.so intercepts become the replacement's normal implementation.

stage0 acquires an exclusive advisory lock on the container directory, leaving the fd open when executing stage1/init with the fd value stored in the environment variable RKT_LOCK_FD. stage1's fakesdboot.so shim prevents closure of RKT_LOCK_FD in nspawn by replacing the close with a O_CLOEXEC fcntl, retaining the lock handle until nspawn exits. This facilitates simple and reliable container discovery by external processes via flock() attempts on container directories. Consult the flock(2) and flock(1) man pages for information on advisory file locks. With this change some basic lifecycle features are trivial using bash: Identifying which containers are active: shopt -s nullglob for c in /var/lib/rkt/containers/*; do flock --exclusive --nonblock "$c" echo "$c inactive" || echo "$c active" done GC of inactive containers: shopt -s nullglob mkdir -p /var/lib/rkt/gc for c in /var/lib/rkt/containers/*; do [ -e "$c/stage1" ] || continue flock --exclusive --nonblock "$c" mv "$c" /var/lib/rkt/gc done Assuming a periodic task clears out aged contents from /var/lib/rkt/gc

Intercept syscall() used by systemd-nspawn for clone, writing the newly-namespaced child's pid into /var/lib/rkt/containers/$cuuid from the parent before returning to systemd-nspawn. This change combined with locking enables simple status queries of running containers, for example in bash: shopt -s nullglob for c in /var/lib/rkt/containers/*; do [ -e "$c/pid" ] || continue flock --exclusive --nonblock "$c" /bin/true || { echo $c cpid=$(cat "$c/pid") nsenter --mount --uts --ipc --pid --root --wd --target "${cpid}" systemctl status } done

vcaputo · 2014-12-10T23:21:43Z

Merging this for now, we'll revisit this, but in the interim we can get all the lifecycle details sorted and built around advisory locks and the $cuuid/pid file.

rkt run: acquire lock on container directory and record pid

jonboulle · 2014-12-10T23:27:21Z

jonboulle reviewed Dec 9, 2014
View reviewed changes

jonboulle mentioned this pull request Dec 9, 2014

rkt: basic gc command #35

Closed

Vito Caputo added 2 commits December 9, 2014 12:50

vcaputo mentioned this pull request Dec 9, 2014

rkt: lifecycle management #6

Closed

vcaputo added a commit that referenced this pull request Dec 10, 2014

Merge pull request #244 from vcaputo/lock_dir-record_pid

818a798

rkt run: acquire lock on container directory and record pid

vcaputo merged commit 818a798 into rkt:master Dec 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rkt run: acquire lock on container directory and record pid #244

rkt run: acquire lock on container directory and record pid #244

vcaputo commented Dec 8, 2014

jzelinskie commented Dec 9, 2014

vcaputo commented Dec 9, 2014

jzelinskie commented Dec 9, 2014

vcaputo commented Dec 9, 2014

jonboulle Dec 9, 2014

vcaputo Dec 9, 2014

cmaster1086 Dec 9, 2014

jonboulle commented Dec 9, 2014

jonboulle commented Dec 9, 2014

philips commented Dec 9, 2014

vcaputo commented Dec 9, 2014

vcaputo commented Dec 10, 2014

jonboulle commented Dec 10, 2014

rkt run: acquire lock on container directory and record pid #244

rkt run: acquire lock on container directory and record pid #244

Conversation

vcaputo commented Dec 8, 2014

jzelinskie commented Dec 9, 2014

vcaputo commented Dec 9, 2014

jzelinskie commented Dec 9, 2014

vcaputo commented Dec 9, 2014

jonboulle Dec 9, 2014

Choose a reason for hiding this comment

vcaputo Dec 9, 2014

Choose a reason for hiding this comment

cmaster1086 Dec 9, 2014

Choose a reason for hiding this comment

jonboulle commented Dec 9, 2014

jonboulle commented Dec 9, 2014

philips commented Dec 9, 2014

vcaputo commented Dec 9, 2014

vcaputo commented Dec 10, 2014

jonboulle commented Dec 10, 2014