Skip to content

Make testRunawayProcess() deterministic#295

Merged
iCharlesHu merged 1 commit into
swiftlang:mainfrom
broken-circle:test-runaway-process-determinism
Jun 3, 2026
Merged

Make testRunawayProcess() deterministic#295
iCharlesHu merged 1 commit into
swiftlang:mainfrom
broken-circle:test-runaway-process-determinism

Conversation

@broken-circle

Copy link
Copy Markdown
Contributor

testRunawayProcess() fails about 0.05% of the time on macOS.

On cancellation, the test's teardown sent the child SIGINT, escalating to SIGKILL after 100ms; the INT trap was meant to kill the runaway grandchild and exit 0, and the child otherwise blocked in wait "$child_pid". It failed intermittently in two ways, both from trapping an asynchronous signal around a blocking wait.

First, signaled(11): bash interrupts the wait for SIGINT via siglongjmp, and on Darwin's system bash (3.2, arm64e) that longjmp occasionally fails pointer authentication and crashes the child with SIGSEGV:

Exception Type:    EXC_BAD_ACCESS (SIGSEGV)
Exception Subtype: KERN_INVALID_ADDRESS ... (possible pointer authentication failure)
...
Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   libsystem_platform.dylib      	       0x18cdfd7cc _longjmp + 72
1   ???                           	     0xc5dc358af08 ???
2   libsystem_platform.dylib      	       0x18cdff744 _sigtramp + 56
3   libsystem_platform.dylib      	       0x18cdfd810 setjmp + 28
4   bash                          	       ...

Second, signaled(9): If the signal lands after bash last checks pending traps, but before it blocks in wait, the trap is recorded but never run. A wait on a child that never exits offers no further safe point, so the child lives until teardown escalates to SIGKILL.

This PR sends SIGTERM instead. Its trap runs at a deferred safe point rather than longjmp-ing out of the handler, so the crash cannot occur. Poll with kill -0/sleep rather than blocking in wait, so a deferred trap runs within one short interval; also widen the grace period so it comfortably exceeds that interval and the trap is serviced before escalation. This test passes when run 10,000 times consecutively.

I considered removing bash entirely, but that would require a purpose-built helper for various platforms and didn't seem worth it.

On cancellation, the test's teardown sent the child `SIGINT`, escalating
to `SIGKILL` after 100ms; the `INT` trap was meant to kill the runaway
grandchild and `exit 0`, and the child otherwise blocked in
`wait "$child_pid"`. It failed intermittently in two ways, both from
trapping an asynchronous signal around a blocking `wait`.

First, `signaled(11)`: bash interrupted the `wait` for `SIGINT` via
`siglongjmp`, and on Darwin's system bash (3.2, arm64e) that `longjmp`
occasionally failed pointer authentication and crashed the child with
`SIGSEGV`.

Second, `signaled(9)`: if the signal landed after bash last checked
pending traps, but before it blocked in `wait`, the trap was recorded
but never run. A `wait` on a child that never exited offers no further
safe point, so the child lived until teardown escalated to `SIGKILL`.

Send `SIGTERM` instead. Its trap runs at a deferred safe point rather
than `longjmp`-ing out of the handler, so the crash cannot occur. Poll
with `kill -0`/`sleep` rather than blocking in `wait`, so a deferred
trap runs within one short interval; also widen the grace period so it
comfortably exceeds that interval and the trap is serviced before
escalation.
@broken-circle broken-circle requested a review from iCharlesHu as a code owner June 3, 2026 16:10
@iCharlesHu iCharlesHu merged commit 366b569 into swiftlang:main Jun 3, 2026
44 checks passed
@broken-circle broken-circle deleted the test-runaway-process-determinism branch June 5, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants