Skip to content

Fix TCP RX softirq wake deadlock#400

Merged
ryanbreen merged 1 commit into
mainfrom
feat/net-rx-stall-fix
May 31, 2026
Merged

Fix TCP RX softirq wake deadlock#400
ryanbreen merged 1 commit into
mainfrom
feat/net-rx-stall-fix

Conversation

@ryanbreen
Copy link
Copy Markdown
Owner

Summary

  • Mask TCP global connection/listener/ISN locks with irq_save/irq_restore so same-CPU NetRx softirq cannot reenter while a thread owns a TCP lock.
  • Keep NetRx wakeups lock-free by using the scheduler ISR wake buffer from TCP softirq wake paths.
  • Extend the ISR wake-buffer drain to also wake legacy ThreadState::Blocked socket waiters, preserving existing syscall blocking semantics without editing syscall/*.

Root Cause

The RX stall was a same-CPU lock reentry deadlock. A userspace TCP syscall path could hold TCP_CONNECTIONS or TCP_LISTENERS, then a NIC IRQ/NetRx softirq ran on the same CPU and entered handle_tcp(), which tried to take the same TCP lock. The interrupted owner could not resume, so NetRx stayed in-progress with NET_RX_PROCESSING_HELD=1, NET_RX_SOFTIRQ_ENTRY_TOTAL = EXIT_TOTAL + 1, and NET_PCI_RX_USED_IDX ahead of NET_PCI_RX_LAST_USED_IDX.

Negative control artifacts:
/Users/wrb/Downloads/Ralph/breenix-interrupt-io-roadmap-1780056222/turn94-artifacts/negative-control/

Proof

Build gate:

  • cargo build --release --features testing,external_test_bins --bin qemu-uefi passed clean with zero warning/error lines.

Required Parallels proof, same boot, no VM restart:

  • Artifacts: /Users/wrb/Downloads/Ralph/breenix-interrupt-io-roadmap-1780056222/turn94-artifacts/fixed-proof-3/
  • Ran 4 back-to-back cycles of outbound bssh 10.0.1.210 22 wrb --publickey --exec uname followed immediately by inbound host -> Breenix ssh ... 'echo AFTER && uname'.
  • All 4 outbound runs returned real Darwin, rc=0.
  • All 4 inbound runs returned AFTER and Breenix uname, rc=0.

Counter evidence from the fixed run:

  • sample 3: NET_RX_SOFTIRQ_ENTRY_TOTAL=110, NET_RX_SOFTIRQ_EXIT_TOTAL=110, NET_RX_PROCESSING_HELD=0, NET_PCI_RX_USED_IDX=180, NET_PCI_RX_LAST_USED_IDX=180.
  • sample 4: NET_RX_SOFTIRQ_ENTRY_TOTAL=237, NET_RX_SOFTIRQ_EXIT_TOTAL=237, NET_RX_PROCESSING_HELD=0, NET_PCI_RX_USED_IDX=374, NET_PCI_RX_LAST_USED_IDX=374.

Additional check:

  • cargo run -p xtask -- boot-stages was attempted after the proof. It passed 63 stages including the softirq stages, then timed out on the existing x86 ARP gateway marker ([22/252] ARP reply received and gateway MAC resolved). QEMU was killed afterward.

Mask TCP global locks against same-CPU NetRx softirq reentry and let the lock-free ISR wake drain resume legacy Blocked socket waiters.

Validated with four back-to-back outbound bssh -> inbound bsshd SSH cycles on one Parallels boot; RX counters stayed balanced with processing-held clear.

Co-authored-by: Ryan Breen <ryan@breen.com>

Co-authored-by: Claude Code <noreply@anthropic.com>
@ryanbreen ryanbreen merged commit 2482be1 into main May 31, 2026
@ryanbreen ryanbreen deleted the feat/net-rx-stall-fix branch May 31, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant