Skip to content

Lock-order inversion deadlock in AsyncCurrentValue when cancellation races an update #153

@gqbit

Description

@gqbit

Lock-order inversion deadlock in AsyncCurrentValue when cancellation races an update

A deadlock occurs in AsyncCurrentValue due to a lock-order inversion between the internal mutex and the Swift runtime's task-status lock. When FlagChangeStream is being consumed and the consuming task is cancelled concurrently with a value update, both threads block waiting for each other's lock indefinitely.

This reliably reproduces within seconds (typically on the first few hundred iterations of the attached test).

Vexil version: 3.0.0-beta.2 / main branch
Swift version: swift-driver version: 1.148.6 Apple Swift version 6.3.1 (swiftlang-6.3.1.1.2 clang-2100.0.123.102) Target: arm64-apple-macosx26.0
Environment: Xcode 26.4.1 (17E202), macOS 26.4 (25E246)

✅ Checklist

  • If possible, I've reproduced the issue using the main branch of this package
  • I've searched for existing GitHub issues

🔢 Steps to Reproduce

  1. Unzip the attached reproducer project ([VexilReproducer.zip])
  2. Run swift test
  3. The test deadlocks within seconds. A watchdog aborts after 55s and writes thread backtraces to /tmp/vexil-deadlock-sample-<PID>.txt

The test races AsyncCurrentValue.update (which fires continuation.resume() inside the mutex via wrappedValue.didSet) against Task.cancel() on a task suspended in FlagChangeStream.AsyncIterator.next(isolation:).

Deadlock diagram
Thread A (update):                        Thread B (cancel):

AsyncCurrentValue.update {                Task.cancel()
  mutex.withLock {                          ↓
    ┌─── HOLDS: mutex                     withStatusRecordLock {
    │                                       ┌─── HOLDS: task-status lock
    │  wrappedValue.didSet {                │
    │    continuation.resume()              │  onCancel: {
    │      ↓                                │    mutex.withLock {
    │    flagAsAndEnqueueOnExecutor          │      ↓
    │      ↓                                │    BLOCKED waiting for mutex
    │    withStatusRecordLock {              │
    │      ↓                                │
    │    BLOCKED waiting for task-status     │
    │                    lock                │
    └───────────────────────────────────────┘

🎯 Expected behavior

The test completes all 100,000 iterations without hanging. Cancelling a task that is suspended on FlagChangeStream should not deadlock regardless of concurrent updates.

🕵️‍♀️ Actual behavior

The test deadlocks within seconds. The sample output shows two threads blocked on _os_unfair_lock_lock_slow (__ulock_wait2):

  • Thread AAsyncCurrentValue.update → Mutex.withLock → wrappedValue.didSet → continuation.resume → flagAsAndEnqueueOnExecutor → withStatusRecordLockblocked waiting for task-status lock
  • Thread Bswift_task_cancelImpl → withStatusRecordLock → onCancel: closure → Mutex.withLockblocked waiting for mutex
Relevant sample output
Thread_4432711   DispatchQueue_15: com.apple.root.default-qos.cooperative  (concurrent)
  noDeadlockOnConcurrentCancelAndUpdate()  DeadlockTests.swift:82
    swift_task_cancelImpl
      withStatusRecordLock
        FlagChangeStream.AsyncIterator.next onCancel: closure  FlagChangeStream.swift:66
          Mutex<>.withLock  Lock.swift:37
            _os_unfair_lock_lock_slow
              __ulock_wait2    ← BLOCKED on mutex

Thread_4432713   DispatchQueue_15: com.apple.root.default-qos.cooperative  (concurrent)
  closure #3 in noDeadlockOnConcurrentCancelAndUpdate()  DeadlockTests.swift:81
    AsyncCurrentValue.update  AsyncCurrentValue.swift:72
      Mutex<>.withLock  Lock.swift:37
        AsyncCurrentValue.State.wrappedValue.didset  AsyncCurrentValue.swift:26
          CheckedContinuation.resume
            swift_continuation_throwingResumeImpl
              flagAsAndEnqueueOnExecutor
                updateStatusRecord → withStatusRecordLock
                  _os_unfair_lock_lock_slow
                    __ulock_wait2    ← BLOCKED on task-status lock

💡 Suggested Fix

The root cause is that continuation.resume() is called inside the mutex's critical section (in wrappedValue.didSet). The fix is to collect continuations while holding the lock, then resume them after releasing it:

func update<R: Sendable>(_ body: (inout sending Wrapped) throws -> R) rethrows -> R {
    let continuationsToResume: [(Int, Wrapped, CheckedContinuation<(Int, Wrapped)?, Never>)]
    let result: R

    (result, continuationsToResume) = try allocation.mutex.withLock { state in
        // ... perform mutation ...
        let toResume = state.pendingContinuations
        state.pendingContinuations = []
        return (result, toResume.map { ($0.0, state.generation, state.wrappedValue, $0.1) })
    }

    // Resume AFTER releasing the mutex — safe from lock-order inversion
    for (_, generation, value, continuation) in continuationsToResume {
        continuation.resume(returning: (generation, value))
    }

    return result
}

VexilReproducer.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions