Problem
The process lifecycle code in main.rs (run_child, lines 848-1040) uses 4 global atomics shared across 3 threads and a signal handler. Three race conditions have been proven deterministically with forced interleavings.
Proven Races
Q1: SIGKILL sent to recycled PID (kills innocent process)
The escalation thread reads CHILD_PID, then sends SIGKILL. Between the read and the kill, the child exits and its PID is recycled to a new process. The SIGKILL hits the wrong process.
Proven by forcing PID recycling via /proc/sys/kernel/ns_last_pid. The victim process (sleep 600) was killed by SIGKILL intended for the original child.
Q3: FORCE_KILLED flag set after main thread reads it
The main thread reads FORCE_KILLED as false and classifies the stop reason as Duration. The escalation thread then sets FORCE_KILLED to true. The "program did not respond to SIGTERM" warning is not printed even though SIGKILL was sent.
Proven with a barrier between the main thread's flag read and the escalation thread's flag write. One run, deterministic.
Q4: SIGINT arrives before CHILD_PID is stored (parent hangs)
SIGINT arrives between signal handler installation (line 940) and CHILD_PID store (line 948). The handler sees PID 0, skips the kill. The child never receives SIGTERM. With kill_timeout == 0, the parent hangs forever on child.wait(). A second Ctrl-C kills the parent (SA_RESETHAND), orphaning the child.
Proven with a barrier between handler installation and spawn. One run, deterministic. The comment at line 921 says "no Ctrl-C gap can orphan the child" but the proof shows the gap exists.
Root Cause
4 global atomics coordinating 3 threads and a signal handler. This is shared mutable state in concurrent code, violating Principle 1 of the project's architecture (philosophy.md: "Kill all globals").
Context
Found during CLI interaction contract enumeration when investigating the untested SIGTERM timeout warning messages (CI14/CI17). The warning messages are symptoms. The races are the disease.
Problem
The process lifecycle code in main.rs (run_child, lines 848-1040) uses 4 global atomics shared across 3 threads and a signal handler. Three race conditions have been proven deterministically with forced interleavings.
Proven Races
Q1: SIGKILL sent to recycled PID (kills innocent process)
The escalation thread reads CHILD_PID, then sends SIGKILL. Between the read and the kill, the child exits and its PID is recycled to a new process. The SIGKILL hits the wrong process.
Proven by forcing PID recycling via /proc/sys/kernel/ns_last_pid. The victim process (sleep 600) was killed by SIGKILL intended for the original child.
Q3: FORCE_KILLED flag set after main thread reads it
The main thread reads FORCE_KILLED as false and classifies the stop reason as Duration. The escalation thread then sets FORCE_KILLED to true. The "program did not respond to SIGTERM" warning is not printed even though SIGKILL was sent.
Proven with a barrier between the main thread's flag read and the escalation thread's flag write. One run, deterministic.
Q4: SIGINT arrives before CHILD_PID is stored (parent hangs)
SIGINT arrives between signal handler installation (line 940) and CHILD_PID store (line 948). The handler sees PID 0, skips the kill. The child never receives SIGTERM. With kill_timeout == 0, the parent hangs forever on child.wait(). A second Ctrl-C kills the parent (SA_RESETHAND), orphaning the child.
Proven with a barrier between handler installation and spawn. One run, deterministic. The comment at line 921 says "no Ctrl-C gap can orphan the child" but the proof shows the gap exists.
Root Cause
4 global atomics coordinating 3 threads and a signal handler. This is shared mutable state in concurrent code, violating Principle 1 of the project's architecture (philosophy.md: "Kill all globals").
Context
Found during CLI interaction contract enumeration when investigating the untested SIGTERM timeout warning messages (CI14/CI17). The warning messages are symptoms. The races are the disease.