Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

rkt run: acquire lock on container directory and record pid #244

Merged
merged 2 commits into from
Dec 10, 2014
Merged

rkt run: acquire lock on container directory and record pid #244

merged 2 commits into from
Dec 10, 2014

Conversation

vcaputo
Copy link
Contributor

@vcaputo vcaputo commented Dec 8, 2014

stage0 acquires an exclusive advisory lock on the container directory,
leaving the fd open when executing stage1/init with the fd value stored in
the environment variable RKT_LOCK_FD.

stage1's fakesdboot.so shim prevents closure of RKT_LOCK_FD in nspawn by
replacing the close with a O_CLOEXEC fcntl, retaining the lock handle until
nspawn exits.

fakesdboot.so has also been modified to intercept the clone syscall from
nspawn, recording the pid of the container's PID 1 in
"/var/lib/rkt/containers/$cuuid/pid"

With these changes one can do things like:

query container status:

 shopt -s nullglob
 for c in /var/lib/rkt/containers/*; do
  [ -e "$c/pid" ] || continue
  flock --exclusive --nonblock "$c" /bin/true || {
   cpid=$(cat "$c/pid")
   nsenter --mount --uts --ipc --pid --root --wd --target "${cpid}" systemctl status
  }
 done

gc dead containers:

 shopt -s nullglob
 for c in /var/lib/rkt/containers/*; do
  [ -e "$c/stage1" ] || continue
  flock --exclusive --nonblock "$c" rm -Rf "$c"
 done

@jzelinskie
Copy link
Contributor

@vcaputo in the future, you can format code snippets on GitHub by doing

``lang
code goes here
``

but with 3 ticks instead of 2.

@vcaputo
Copy link
Contributor Author

vcaputo commented Dec 9, 2014

markdown in the commit message?

@jzelinskie
Copy link
Contributor

I don't know if it'll render in the commit message, but in the PR body it would help.
I guess it's better to have a nice commit message than a vague commit message and a nice PR in case the code ever leaves GitHub.

@vcaputo
Copy link
Contributor Author

vcaputo commented Dec 9, 2014

Well I had noticed the formatting from the commit message was lost in giithub, but didn't add markdown since I wasn't sure if those changes would be retained in the merged PR commit message if it were merged.

va_list ap;
long ret;

if(number != __NR_clone)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why return here? won't this stop other syscalls? (or is that the goal?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably be an assert actually. This is not intended to be a general syscall() wrapper, it's specific to the clone syscall number, and in the case of systemd-nspawn that's perfectly ok because it's the only way syscall() is used in nspawn.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok listen I,do remember going to this site but I really don't have any
ideal what it is or what it's about or what y'all keep texting me or what
it means so please stop thinking me... and again I'm sorry I....
On Dec 8, 2014 11:49 PM, "Vito Caputo" notifications@github.com wrote:

In stage1/mkrootfs.sh:

+int close(int fd)
+{

  • if(lock_fd != -1 && fd == lock_fd)
  •   return fcntl(fd, F_SETFD, FD_CLOEXEC);
    
  • return libc_close(fd);
    +}

+long syscall(long number, ...)
+{

  • unsigned long clone_flags;
  • va_list ap;
  • long ret;
  • if(number != __NR_clone)

It should probably be an assert actually. This is not intended to be a
general syscall() wrapper, it's specific to the clone syscall number, and
in the case of systemd-nspawn that's perfectly ok.


Reply to this email directly or view it on GitHub
https://github.com/coreos/rocket/pull/244/files#r21507003.

@jonboulle
Copy link
Contributor

looks awesome and/or scary!

@jonboulle
Copy link
Contributor

Nit, could you separate the lock + the pid recording into separate commits? Seems like it might make it a bit easier to trace back should the need arise one day.

@philips
Copy link
Contributor

philips commented Dec 9, 2014

This scares me too much. I was OK relying on the LD_PRELOAD hack until we can fix-up nspawn but this is going a bit too far. What about if we send a "life cycle" FD to systemd-nspawn that is socket activated to some unix domain inside the stage1 instead?

@vcaputo
Copy link
Contributor Author

vcaputo commented Dec 9, 2014

@philips that may be workable as well, though I don't think this is really all that scary, and what I really wanted to solidify with this was things like: is there a pid flie? where do we put it? do we need to communicate the lock fdnum to stage1? through an environment variable?

It's not immediately clear to me how the socket-activated unit would discover PID1's "outside" PID, whereas it's trivially done @ clone() in the systemd-nspawn parent context. We need to write that down somewhere for our nsenter-like operations.

Ultimately I see this all vanishing in the not too distant future with a systemd-nspawn replacement, where all these fakesdboot.so intercepts become the replacement's normal implementation.

@jonboulle jonboulle mentioned this pull request Dec 9, 2014
Vito Caputo added 2 commits December 9, 2014 12:50
stage0 acquires an exclusive advisory lock on the container directory,
leaving the fd open when executing stage1/init with the fd value stored in
the environment variable RKT_LOCK_FD.

stage1's fakesdboot.so shim prevents closure of RKT_LOCK_FD in nspawn by
replacing the close with a O_CLOEXEC fcntl, retaining the lock handle until
nspawn exits.

This facilitates simple and reliable container discovery by external
processes via flock() attempts on container directories.  Consult the
flock(2) and flock(1) man pages for information on advisory file locks.

With this change some basic lifecycle features are trivial using bash:

 Identifying which containers are active:
  shopt -s nullglob
  for c in /var/lib/rkt/containers/*; do
   flock --exclusive --nonblock "$c" echo "$c inactive" || echo "$c active"
  done

 GC of inactive containers:
  shopt -s nullglob
  mkdir -p /var/lib/rkt/gc
  for c in /var/lib/rkt/containers/*; do
   [ -e "$c/stage1" ] || continue
   flock --exclusive --nonblock "$c" mv "$c" /var/lib/rkt/gc
  done

 Assuming a periodic task clears out aged contents from /var/lib/rkt/gc
Intercept syscall() used by systemd-nspawn for clone, writing the
newly-namespaced child's pid into /var/lib/rkt/containers/$cuuid from the
parent before returning to systemd-nspawn.

This change combined with locking enables simple status queries of running
containers, for example in bash:

 shopt -s nullglob
 for c in /var/lib/rkt/containers/*; do
  [ -e "$c/pid" ] || continue
  flock --exclusive --nonblock "$c" /bin/true || {
   echo $c
   cpid=$(cat "$c/pid")
   nsenter --mount --uts --ipc --pid --root --wd --target "${cpid}" systemctl status
  }
 done
@vcaputo vcaputo mentioned this pull request Dec 9, 2014
@vcaputo
Copy link
Contributor Author

vcaputo commented Dec 10, 2014

Merging this for now, we'll revisit this, but in the interim we can get all the lifecycle details sorted and built around advisory locks and the $cuuid/pid file.

vcaputo added a commit that referenced this pull request Dec 10, 2014
rkt run: acquire lock on container directory and record pid
@vcaputo vcaputo merged commit 818a798 into rkt:master Dec 10, 2014
@jonboulle
Copy link
Contributor

img

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants