Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runsv locks up after service's directory is removed and recreated (xbps-install -f) #25

Open
0x5c opened this issue Apr 12, 2023 · 15 comments

Comments

@0x5c
Copy link

0x5c commented Apr 12, 2023

When the service's directory is removed and recreated, like during an xbps package reinstall, runsv will enter a broken state. Once in that state, a command to down the service will cause runsv to terminate the supervised process, but will fail to update the status file and possibly more, soft-locking.

The same bug happens for the log service if present, but presence of a log service is not needed.

Minimal reproducer

This can be done as a normal user

  1. Have a <service> directory containing solely a run file and a supervise -> ../supervise symlink.

    #!/bin/sh
    
    exec 2>&1
    exec <some command here>
  2. From the parent directory, run runsv <service>

  3. Copy delete, and recreate <service>

  4. From the same dir as step 2, run SVDIR=$(pwd) sv stop <service>

  5. The supervised process will stop but the sv command will fail on timeout, reporting the supervised process' PID as the one it had before being terminated.
    Repeating the stop command will continue to fail and report that same invalid PID.
    Trying to start the service will immediately 'succeed', reporting the same PID.

I recommend using as the supervised process a program that outputs frequently, makes it easier to see when the supervised process is running/stopped. I using a simple binary that prints an incremented number each second.

@0x5c
Copy link
Author

0x5c commented Jun 4, 2023

More details on this, which I though I had added when exploring the issue:
runsv essentially gets a fd to the supervise/ directory, that it uses with *at() syscalls for many operations on that dir. So even when the directory is recreated, the fd no longer points to it.

@0x5c
Copy link
Author

0x5c commented Jun 25, 2023

The trigger of this bug is now fixed void-linux/xbps#561, however the bug itself (if it really is one) can still be triggered otherwise. Is this issue worth keeping?

@leahneukirchen
Copy link
Member

I think it could be reasonable to check if any *at() fails with ENOENT if the directory still exists, which can be done with fstat on the fd and testing if st_nlink > 0. If the service directory has been removed, runsv should simply exit and be restarted by runsvdir.

@leahneukirchen
Copy link
Member

Ok, i think the actual bug is that the runsv in this situation ignores SIGTERM, because runsvdir should kill it when it detects the svdir has changed inode.

@0x5c
Copy link
Author

0x5c commented Sep 12, 2023

Ok, i think the actual bug is that the runsv in this situation ignores SIGTERM, because runsvdir should kill it when it detects the svdir has changed inode.

The manual page for runsvdir suggests that this would trigger only if the symlink in /var/services were to be changed. Since the symlink is user-created and not managed by xbps, it wouldn't get changed.
Should the check extend to the inode of the link's destination?

@leahneukirchen
Copy link
Member

runsvdir calls stat (so follows symlinks), tho?

@0x5c
Copy link
Author

0x5c commented Sep 12, 2023

Hum right
But runsv correctly gets sigterm'd when the directory/link gets removed, so what would make it ignore SIGTERM in that specific case of the dir switcheroo? Or rather, is the sigterm actually sent for a very brief moment of the dir not being there.

@0x5c
Copy link
Author

0x5c commented Sep 12, 2023

   At least every five seconds runsvdir checks whether the time of last
   modification, the inode, or the device, of the services directory dir
   has changed.  If so, it re-scans the service directory, and if it sees
   a new subdirectory, or new symlink to a directory, in dir, it starts a
   new runsv(8) process; if runsvdir sees a subdirectory being removed
   that was previously there, it sends the corresponding runsv(8) process
   a TERM signal, stops monitoring this process, and so does not restart
   the runsv(8) process if it exits.

The link's destination changing wouldn't change the link itself, nor the dir the link is in, right? If I get it correctly, the rescan is not even done if the services dir has not changed.

@Duncaen
Copy link
Member

Duncaen commented Sep 12, 2023

There are no signals, the mtime of the /etc/runit/runsvdir/default directory never changes, only the sv directory changes.

@Duncaen
Copy link
Member

Duncaen commented Sep 12, 2023

Not sure what the bug is here then, everything seems to work as expected, runsv is just still running in a directory that doesn't exist and is unable to update the status stuff since its using relative paths from its service directory to the files in supervise.
It still reacts to control messages from sv since the fifo fd stays open and the fifo stays there and is the same because of the supervise symlink.

@0x5c
Copy link
Author

0x5c commented Sep 12, 2023

Not sure what the bug is here then, everything seems to work as expected, runsv is just still running in a directory that doesn't exist and is unable to update the status stuff since its using relative paths from its service directory to the files in supervise.

Removing the dir of a service in the services dir should cause the corresponding runsv to be killed by runsvdir. But since there's the indirection of the symlink, the actual trigger in runsvdir (the services dir changing) never activates.

Now I wonder if we'd get it to actually kill the runsv process by doing an unrelated change to the services dir and letting it rescan.

@Duncaen
Copy link
Member

Duncaen commented Sep 12, 2023

Now I wonder if we'd get it to actually kill the runsv process by doing an unrelated change to the services dir and letting it rescan.

Yes touching the runsvdir will trigger the rescan. It behaves a bit flaky, but works in the end. Since runsvdir will start new runsv processes before it terminates old ones, the first attempt of starting runsv on the "new" service will fail because of the lock in supervise. Eventually when the "old" runsv successfully exited further runs of runsv will successfully start the service.

    2651587- touch exited status=0 time=0.000s
        2651588+ runsv a
        2651588- runsv exited status=111 time=0.001s
        2651589+ runsv a
          2651590+ /bin/sh ./run
          2651590- /bin/sh execed time=0.001s
          2651590+ pause

runsvdir when I touch service:

poll(NULL, 0, 5020)                     = 0 (Timeout)
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
wait4(-1, 0x7ffe0d42775c, WNOHANG, NULL) = 0
newfstatat(AT_FDCWD, "service", {st_mode=S_IFDIR|0775, st_size=60, ...}, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
poll(NULL, 0, 5020)                     = 0 (Timeout)
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
wait4(-1, 0x7ffe0d42775c, WNOHANG, NULL) = 0
newfstatat(AT_FDCWD, "service", {st_mode=S_IFDIR|0775, st_size=60, ...}, 0) = 0

# "woke up" because the mtime of "service" changed
chdir("service")                        = 0
openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
newfstatat(4, "", {st_mode=S_IFDIR|0775, st_size=60, ...}, AT_EMPTY_PATH) = 0
getdents64(4, 0x5600a43b32d0 /* 3 entries */, 32768) = 72

# found the "new" service "a" and starting runsv for it
newfstatat(AT_FDCWD, "a", {st_mode=S_IFDIR|0775, st_size=80, ...}, 0) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb7b142aa10) = 2651588
getdents64(4, 0x5600a43b32d0 /* 0 entries */, 32768) = 0
close(4)                                = 0

# detected the old ino being gone, killing it pid of runsv
kill(2651573, SIGTERM)                  = 0
fchdir(3)                               = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
poll(NULL, 0, 1020)                     = 0 (Timeout)

# output of the old runsv process that wants to update its state after killing the service.
runsv a: warning: unable to open supervise/stat.new: file does not exist
runsv a: warning: unable to open supervise/stat.new: file does not exist
runsv a: warning: unable to open supervise/pid.new: file does not exist

# new runsv process that can't start because the supervise directory is locked
runsv a: fatal: unable to lock supervise/lock: temporary failure

rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0

# signal of the new runsv process terminating with 111 due to the lock
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2651588, si_uid=1000, si_status=111, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 2651573
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 111}], WNOHANG, NULL) = 2651588
wait4(-1, 0x7ffe0d42775c, WNOHANG, NULL) = -1 ECHILD (No child processes)

# rescan due to SIGCHLD
newfstatat(AT_FDCWD, "service", {st_mode=S_IFDIR|0775, st_size=60, ...}, 0) = 0
chdir("service")                        = 0
openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
newfstatat(4, "", {st_mode=S_IFDIR|0775, st_size=60, ...}, AT_EMPTY_PATH) = 0
getdents64(4, 0x5600a43b32d0 /* 3 entries */, 32768) = 72

# found service with -1 sv pid, starting the runsv agian
newfstatat(AT_FDCWD, "a", {st_mode=S_IFDIR|0775, st_size=80, ...}, 0) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb7b142aa10) = 2651589
getdents64(4, 0x5600a43b32d0 /* 0 entries */, 32768) = 0
close(4)                                = 0
fchdir(3)                               = 0

# back to main loop
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
poll(NULL, 0, 5020)                     = 0 (Timeout)
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
wait4(-1, 0x7ffe0d42775c, WNOHANG, NULL) = 0
newfstatat(AT_FDCWD, "service", {st_mode=S_IFDIR|0775, st_size=60, ...}, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
poll(NULL, 0, 5020)                     = 0 (Timeout)
```

@Duncaen
Copy link
Member

Duncaen commented Sep 12, 2023

I think the proper solution would be for runsv to exit if it detect its directory being gone, but I don't think there is actually a way to poll for that.

runsv could exit when it detects ENOENT, but that would be somewhat random as it would happen as a reaction to something else runsv is doing like updating the status or receiving a command from the user.

@0x5c
Copy link
Author

0x5c commented Sep 12, 2023

runsv gets an fd for the dir, could that be used to detect the dir being gone?

@Duncaen
Copy link
Member

Duncaen commented Sep 12, 2023

Not by polling, right now runsv processes are suspended in the poll syscall and wake up on signals like sigchild or data on the control fd. I don't think there is a way to make this poll wake up if the directory is deleted with posix mechanisms. Waking up at an interval to check the directory sucks, inotify would work, but is not portable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants