Skip to content

🐛 fix(test): prevent PowerShell activation test from crashing xdist workers on Windows#3128

Draft
gaborbernat wants to merge 8 commits intopypa:mainfrom
gaborbernat:fix/powershell-test-flaky-windows
Draft

🐛 fix(test): prevent PowerShell activation test from crashing xdist workers on Windows#3128
gaborbernat wants to merge 8 commits intopypa:mainfrom
gaborbernat:fix/powershell-test-flaky-windows

Conversation

@gaborbernat
Copy link
Copy Markdown
Contributor

@gaborbernat gaborbernat commented Apr 19, 2026

The scheduled CI on main has been failing consistently over the past week — 7 out of 9 failures trace back to test_powershell crashing pytest-xdist workers on windows-2025 runners. The crash cascades: after the worker dies, remaining tests can't finish within the 30-minute CI timeout, so the entire job fails.

The root cause is a timeout race. 🏁 The communicate(timeout=120) in the activation test conftest matches the global pytest-timeout of 120 seconds exactly. When PowerShell hangs, the test setup consumes ~10 seconds before communicate() starts, so pytest-timeout fires first. On Windows, that plugin uses _thread.interrupt_main() to raise KeyboardInterrupt in the main thread — but when that thread is blocked in the C-level communicate() call, the interrupt doesn't unwind cleanly in xdist workers, killing the worker process ("node down: Not properly terminated").

The fix redirects stdin to subprocess.DEVNULL on all Popen calls in the activation tests, preventing shells from blocking on stdin in xdist workers where stdin state is undefined. 🔧 The communicate() timeout drops from 120s to 60s so the explicit timeout handler always fires before pytest-timeout, letting the clean process.kill() path run. Previously unbounded communicate() calls (get_version, post-kill cleanup, RaiseOnNonSourceCall) now have timeouts too.

Add stdin=subprocess.DEVNULL to all Popen calls in activation
tests to prevent subprocesses blocking on stdin in xdist workers.

Reduce the communicate timeout from 120s to 60s so the explicit
timeout handler fires before pytest-timeout (120s), avoiding a
raw KeyboardInterrupt that crashes xdist workers on Windows. Add
timeouts to previously unbounded communicate calls.
@gaborbernat gaborbernat changed the title Fix flaky PowerShell activation test on Windows 🐛 fix(test): prevent PowerShell activation test from crashing xdist workers on Windows Apr 19, 2026
@gaborbernat gaborbernat enabled auto-merge (squash) April 19, 2026 12:54
PowerShell hangs indefinitely on Windows Server 2025 GHA
runners during activation testing. Mark as xfail(strict=False)
so CI stays green while still running the test.
@gaborbernat gaborbernat marked this pull request as draft April 19, 2026 13:16
auto-merge was automatically disabled April 19, 2026 13:16

Pull request was converted to draft

Add echo markers between each test script command so timeout
failures show which step was last completed. Capture and display
partial output in the failure message for easier diagnosis.
Each python -c one-liner inside the activation test script was
inheriting COVERAGE_PROCESS_START/COVERAGE_RUN from the parent
pytest process. This caused every subprocess to start coverage
measurement and attempt to write to the same SQLite coverage
database on exit. On Windows, the file lock contention with
the main pytest/coverage process caused subprocesses to hang
indefinitely, never exiting and blocking the PowerShell script.
First PowerShell invocation on Windows 2025 GHA runners takes
47-62 seconds due to cold start (.NET JIT, AMSI scanning).
Increase communicate timeout from 60s to 90s to accommodate
this, and set pytest-timeout to 180s on the test to prevent
the timeout race that crashed xdist workers.
First powershell.exe invocation on Windows 2025 GHA runners
takes 47-62s due to .NET JIT and AMSI cold start. Add a warmup
step in CI so the cost is paid outside the test timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant