Skip to content

🐛 fix(execute): adopt CPython subprocess stream handling#3715

Merged
gaborbernat merged 4 commits intotox-dev:mainfrom
gaborbernat:local-sub
Feb 17, 2026
Merged

🐛 fix(execute): adopt CPython subprocess stream handling#3715
gaborbernat merged 4 commits intotox-dev:mainfrom
gaborbernat:local-sub

Conversation

@gaborbernat
Copy link
Member

@gaborbernat gaborbernat commented Feb 16, 2026

Subprocess output reading could deadlock on Windows during parallel test execution or when handling large amounts of output. The root cause was a chicken-and-egg problem where reader threads waited for the subprocess to close its pipes, but the subprocess wouldn't be terminated until after the readers exited. 🔒 This manifested as timeouts in CI when tox tried to clean up long-running backend processes.

The fix adopts CPython's approach to subprocess stream handling. On Unix, we switched from select.select() to the selectors module with proper EOF detection and interrupt handling. On Windows, we use overlapped I/O with non-blocking polling, mirroring CPython's implementation. Most importantly, we reordered the shutdown sequence in pep517_backend.py to terminate subprocesses before stopping reader threads, ensuring pipes close and pending I/O operations complete naturally. ⚡

This eliminates arbitrary timeouts and race conditions in the process cleanup logic. The implementation now handles EINTR signals gracefully and reads larger chunks (32KB instead of 1KB) for better performance with high-volume output.

References:

…adlocks

The previous implementation used a two-phase approach with stop events and
separate drain phases, leading to race conditions where data could arrive
between thread stop and drain execution. This caused occasional deadlocks
and incomplete stream reading, particularly during interrupt handling.

Replaced the custom implementation with CPython's proven approach:

Unix: Use selectors.DefaultSelector (poll/epoll/kqueue) instead of basic
select.select, with 32KB chunks instead of 1KB. Properly handle EINTR for
signal safety during interrupt scenarios.

Windows: Simplified from complex overlapped I/O to straightforward blocking
reads with fh.read(32KB) in a loop. Removed asyncio dependencies and the
error-prone overlapped mechanism that was causing sporadic failures.

Both platforms now read until EOF naturally rather than checking stop events
in the hot path, only consulting the stop flag between select() calls. The
drain phase is conditionally executed based on the drain parameter, allowing
long-running processes like the pep517 backend to skip unnecessary draining.

This matches the battle-tested implementation from CPython's subprocess.py
which has handled these edge cases correctly for years.
Threads were deadlocking on Windows because `ov.getresult(True)` blocks
indefinitely and cannot respond to stop events. When the main thread sets
the stop event to terminate subprocess readers, the I/O threads remained
stuck in the blocking wait, causing test timeouts and preventing graceful
shutdown.

Changed to `getresult(False)` with periodic polling that checks the stop
event every 50ms. This preserves Windows' efficient overlapped I/O mechanism
while allowing threads to exit cleanly when signaled. The polling interval
matches the Unix implementation's timeout for consistency across platforms.
@gaborbernat gaborbernat changed the title ⚡ perf(execute): adopt CPython's subprocess stream handling to fix deadlocks ⚡ perf(execute): adopt CPython subprocess stream handling Feb 16, 2026
@gaborbernat gaborbernat changed the title ⚡ perf(execute): adopt CPython subprocess stream handling 🐛 fix(execute): adopt CPython's subprocess stream handling to fix deadlocks Feb 16, 2026
@gaborbernat gaborbernat force-pushed the local-sub branch 5 times, most recently from bc10ac5 to dfff2a5 Compare February 16, 2026 23:57
The polling approach with getresult(False) was losing subprocess output data
because the read thread would exit when the stop event was set before data
could be fully captured, causing 29 test failures on Windows CI.

Switched to blocking getresult(True) which ensures complete data capture for
each overlapped read operation. The stop event is checked between complete
reads rather than during them, allowing threads to exit cleanly while
guaranteeing all subprocess output is processed.
Reader threads on Windows were deadlocking because pep517_backend.close()
called execute.__exit__() before terminating the subprocess. This created
a chicken-and-egg problem where readers couldn't exit because the subprocess
still had open pipes, but the subprocess wouldn't be killed until readers
exited first.

Reordered operations to terminate the subprocess first, which closes the
pipes and allows overlapped I/O operations to complete naturally. This
eliminates the need for arbitrary timeouts in the Windows reader polling
loop, resulting in cleaner code that relies on proper process lifecycle
management instead of guessing when it's safe to give up.
@gaborbernat gaborbernat marked this pull request as ready for review February 17, 2026 00:40
@gaborbernat gaborbernat changed the title 🐛 fix(execute): adopt CPython's subprocess stream handling to fix deadlocks 🐛 fix(execute): adopt CPython subprocess stream handling Feb 17, 2026
@gaborbernat gaborbernat merged commit 96a40f2 into tox-dev:main Feb 17, 2026
28 checks passed
@gaborbernat
Copy link
Member Author

This is now available in version 4.36.1 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant