-
Notifications
You must be signed in to change notification settings - Fork 9
Audit CONTINUE dispatches for TOCTOU races on guest memory #40
Description
Problem
Every SECCOMP_USER_NOTIF_FLAG_CONTINUE dispatch path trusts that the
supervisor's pre-validation of guest memory (path strings, flag words, struct
contents) remains valid when the kernel re-executes the syscall. Since CONTINUE
causes the kernel to re-read pointer targets from guest memory, a malicious
guest thread can mutate the memory between process_vm_readv in the supervisor
and the kernel's re-read, bypassing the supervisor's policy decision.
The only safe CONTINUE paths are those where the supervisor's decision depends
solely on seccomp_data register values (syscall number, integer flags).
These are captured atomically by the kernel before notification and cannot be
modified by the guest.
Any CONTINUE path where the supervisor validates a pointer target (path string,
struct contents, buffer data) and then CONTINUEs is vulnerable to TOCTOU.
Proposed Changes
- Audit: enumerate every CONTINUE dispatch in
seccomp-dispatch.c.
For each one, classify whether the supervisor's decision depends on:- (a) register values only (safe), or
- (b) pointer-target data read via
process_vm_readv(vulnerable).
- Mitigate category (b): convert vulnerable CONTINUE paths to full
emulation. The supervisor performs the operation via LKL or host syscalls
and injects the result, rather than allowing the kernel to re-execute with
potentially mutated arguments. - Document safe CONTINUE policy: establish a rule that new CONTINUE paths
must only depend on register values. Add a comment at the dispatch entry
point documenting this invariant. - Test: add guest test binaries with a racing thread that mutates path
buffers between supervisor read and CONTINUE, verifying that the emulated
paths are not bypassable.
Considerations
- Converting CONTINUE to emulation has a performance cost (extra context
switch plus LKL syscall instead of native kernel execution). The audit
should quantify which paths are hot and whether the overhead is acceptable. /proc,/sys,/devpaths currently use CONTINUE for host kernel
handling. If the supervisor's path classification depends on reading the
path string from guest memory, these are vulnerable. The path string is a
pointer target, not a register value.- Some CONTINUE paths may be safe in practice because the kernel
re-validates independently (e.g. permission checks). But the security
model should not rely on defense-in-depth at the kernel level. The
supervisor is the policy enforcement point. - This is a security-critical audit. False negatives (missing a vulnerable
path) are worse than false positives (unnecessarily emulating a safe path).
References
src/seccomp-dispatch.c: allSECCOMP_USER_NOTIF_FLAG_CONTINUEreturn sitessrc/path.c: path classification (host-escape detection)include/kbox/seccomp-dispatch.h: dispatch return value definitions- seccomp_unotify(2): documents that CONTINUE re-reads from guest memory