Summary
The legacy clone(2) path silently ignores CLONE_NEW* namespace flags, while clone3(2) rejects the same flags with EINVAL. The two entry points therefore disagree on identical inputs.
Found while reviewing the OCI image support work (#31, PR #34): container runtimes and tools request namespaces through whichever clone variant their libc picks, so the behavior a guest sees depends on the syscall number rather than on a single, intentional policy.
Details
clone3(2) explicitly rejects unsupported isolation features:
// src/runtime/forkipc.c (sys_clone3)
if (ca.flags & LINUX_CLONE_INTO_CGROUP)
return -LINUX_EINVAL; /* cgroups not implemented */ // forkipc.c:1572
if (ca.flags & LINUX_CLONE3_NS_FLAGS)
return -LINUX_EINVAL; /* namespaces not implemented */ // forkipc.c:1574
if (ca.set_tid_size != 0)
return -LINUX_EINVAL; /* set_tid not implemented */ // forkipc.c:1576
The legacy path does not. sc_clone forwards the raw flags unchanged:
// src/syscall/syscall.c:1543
return sys_clone(current_thread->vcpu, g, x0, x1, 0, 0, x2, x3, x4, ...);
and sys_clone (src/runtime/forkipc.c:1061) contains no CLONE_NEW* / cgroup / set_tid validation anywhere. The CLONE_NEW* bits simply fall through to the normal posix_spawn-based fork path.
Impact
clone(CLONE_NEWPID | CLONE_NEWNS | ... ) returns a child PID and appears to succeed, but no isolation is set up. A caller that assumes the namespace was created proceeds on a false premise instead of getting a clean error it can handle or fall back from.
clone3() with the same flags returns EINVAL, so a runtime that probes clone3 first and falls back to clone will behave differently from one that calls clone directly.
Expected
Both entry points should apply the same policy. Given the project's current "namespaces not implemented" stance, the legacy path should also reject these flags with EINVAL.
Proposed fix
Mirror the clone3 checks in sys_clone before the spawn path, reusing the existing mask:
/* LINUX_CLONE3_NS_FLAGS is already defined for the clone3 path */
if (flags & LINUX_CLONE3_NS_FLAGS)
return -LINUX_EINVAL; /* namespaces not implemented */
if (flags & LINUX_CLONE_INTO_CGROUP)
return -LINUX_EINVAL; /* cgroups not implemented */
A regression test asserting that clone() and clone3() return the same result for each CLONE_NEW* flag would lock the two paths together.
Refs #31, PR #34.
Summary
The legacy
clone(2)path silently ignoresCLONE_NEW*namespace flags, whileclone3(2)rejects the same flags withEINVAL. The two entry points therefore disagree on identical inputs.Found while reviewing the OCI image support work (#31, PR #34): container runtimes and tools request namespaces through whichever
clonevariant their libc picks, so the behavior a guest sees depends on the syscall number rather than on a single, intentional policy.Details
clone3(2)explicitly rejects unsupported isolation features:The legacy path does not.
sc_cloneforwards the raw flags unchanged:and
sys_clone(src/runtime/forkipc.c:1061) contains noCLONE_NEW*/ cgroup /set_tidvalidation anywhere. TheCLONE_NEW*bits simply fall through to the normalposix_spawn-based fork path.Impact
clone(CLONE_NEWPID | CLONE_NEWNS | ... )returns a child PID and appears to succeed, but no isolation is set up. A caller that assumes the namespace was created proceeds on a false premise instead of getting a clean error it can handle or fall back from.clone3()with the same flags returnsEINVAL, so a runtime that probesclone3first and falls back toclonewill behave differently from one that callsclonedirectly.Expected
Both entry points should apply the same policy. Given the project's current "namespaces not implemented" stance, the legacy path should also reject these flags with
EINVAL.Proposed fix
Mirror the
clone3checks insys_clonebefore the spawn path, reusing the existing mask:A regression test asserting that
clone()andclone3()return the same result for eachCLONE_NEW*flag would lock the two paths together.Refs #31, PR #34.