Skip to content

Audit CONTINUE dispatches for TOCTOU races on guest memory #40

@jserv

Description

@jserv

Problem

Every SECCOMP_USER_NOTIF_FLAG_CONTINUE dispatch path trusts that the
supervisor's pre-validation of guest memory (path strings, flag words, struct
contents) remains valid when the kernel re-executes the syscall. Since CONTINUE
causes the kernel to re-read pointer targets from guest memory, a malicious
guest thread can mutate the memory between process_vm_readv in the supervisor
and the kernel's re-read, bypassing the supervisor's policy decision.

The only safe CONTINUE paths are those where the supervisor's decision depends
solely on seccomp_data register values (syscall number, integer flags).
These are captured atomically by the kernel before notification and cannot be
modified by the guest.

Any CONTINUE path where the supervisor validates a pointer target (path string,
struct contents, buffer data) and then CONTINUEs is vulnerable to TOCTOU.

Proposed Changes

  1. Audit: enumerate every CONTINUE dispatch in seccomp-dispatch.c.
    For each one, classify whether the supervisor's decision depends on:
    • (a) register values only (safe), or
    • (b) pointer-target data read via process_vm_readv (vulnerable).
  2. Mitigate category (b): convert vulnerable CONTINUE paths to full
    emulation. The supervisor performs the operation via LKL or host syscalls
    and injects the result, rather than allowing the kernel to re-execute with
    potentially mutated arguments.
  3. Document safe CONTINUE policy: establish a rule that new CONTINUE paths
    must only depend on register values. Add a comment at the dispatch entry
    point documenting this invariant.
  4. Test: add guest test binaries with a racing thread that mutates path
    buffers between supervisor read and CONTINUE, verifying that the emulated
    paths are not bypassable.

Considerations

  • Converting CONTINUE to emulation has a performance cost (extra context
    switch plus LKL syscall instead of native kernel execution). The audit
    should quantify which paths are hot and whether the overhead is acceptable.
  • /proc, /sys, /dev paths currently use CONTINUE for host kernel
    handling. If the supervisor's path classification depends on reading the
    path string from guest memory, these are vulnerable. The path string is a
    pointer target, not a register value.
  • Some CONTINUE paths may be safe in practice because the kernel
    re-validates independently (e.g. permission checks). But the security
    model should not rely on defense-in-depth at the kernel level. The
    supervisor is the policy enforcement point.
  • This is a security-critical audit. False negatives (missing a vulnerable
    path) are worse than false positives (unnecessarily emulating a safe path).

References

  • src/seccomp-dispatch.c: all SECCOMP_USER_NOTIF_FLAG_CONTINUE return sites
  • src/path.c: path classification (host-escape detection)
  • include/kbox/seccomp-dispatch.h: dispatch return value definitions
  • seccomp_unotify(2): documents that CONTINUE re-reads from guest memory

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions