Read stderr from layer subprocesses in a thread

If the test runner runs a layer in a subprocess (due to not being able to tear down a previous layer), then the child process communicates the IDs of any tests that fail or error to its parent by a simple protocol over its stderr. However, the parent doesn't start reading this until it's finished reading everything from the child's stdout and its end of the pipe corresponding to the child's stdout signals end-of-file, which isn't going to happen until the child finishes writing everything to stderr and exits. This can result in a deadlock if the child process encounters enough failures or errors that their test IDs overflow the capacity of a pipe, which on Linux >= 2.6.11 is 65536 bytes. To avoid this, read the child's stderr in a thread. (On Unix, we could use select instead, but that doesn't work for pipes on Windows, and we're already using threads here anyway.) Fixes #105.
zopefoundation · Jun 18, 2020 · 1cfd75f · 1cfd75f
1 parent baf5486
commit 1cfd75f
Show file tree

Hide file tree

Showing 4 changed files with 116 additions and 2 deletions.
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -7,6 +7,11 @@
 
 - Add support for Python 3.8.
 
+- When a layer is run in a subprocess, read its stderr in a thread to avoid
+  a deadlock if its stderr output (containing failing and erroring test IDs)
+  overflows the capacity of a pipe (`#105
+  <https://github.com/zopefoundation/zope.testrunner/issues/105>`_).
+
 
 5.1 (2019-10-19)
 ================

diff --git a/src/zope/testrunner/runner.py b/src/zope/testrunner/runner.py
@@ -539,6 +539,17 @@ def spawn_layer_in_subprocess(result, script_parts, options, features,
             stdout=subprocess.PIPE, stderr=subprocess.PIPE, cwd=cwd,
             close_fds=not sys.platform.startswith('win'))
 
+        def reader_thread(f, buf):
+            buf.append(f.read())
+
+        # Start reading stderr in a thread.  This means we don't hang if the
+        # subprocess writes more to stderr than the pipe capacity.
+        stderr_buf = []
+        stderr_thread = threading.Thread(
+            target=reader_thread, args=(child.stderr, stderr_buf))
+        stderr_thread.daemon = True
+        stderr_thread.start()
+
         while True:
             try:
                 while True:
@@ -564,8 +575,9 @@ def spawn_layer_in_subprocess(result, script_parts, options, features,
             else:
                 break
 
-        # Now stderr should be ready to read the whole thing.
-        errlines = child.stderr.read().splitlines()
+        # Now we should be able to finish reading stderr.
+        stderr_thread.join()
+        errlines = stderr_buf[0].splitlines()
         erriter = iter(errlines)
         nfail = nerr = 0
         for line in erriter:

diff --git a/src/zope/testrunner/tests/testrunner-ex/sampletests_many.py b/src/zope/testrunner/tests/testrunner-ex/sampletests_many.py
@@ -0,0 +1,75 @@
+##############################################################################
+#
+# Copyright (c) 2020 Zope Foundation and Contributors.
+# All Rights Reserved.
+#
+# This software is subject to the provisions of the Zope Public License,
+# Version 2.1 (ZPL).  A copy of the ZPL should accompany this distribution.
+# THIS SOFTWARE IS PROVIDED "AS IS" AND ANY AND ALL EXPRESS OR IMPLIED
+# WARRANTIES ARE DISCLAIMED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+# WARRANTIES OF TITLE, MERCHANTABILITY, AGAINST INFRINGEMENT, AND FITNESS
+# FOR A PARTICULAR PURPOSE.
+#
+##############################################################################
+"""A large number of sample tests."""
+
+import unittest
+
+
+class Layer1:
+    """A layer that can't be torn down."""
+
+    @classmethod
+    def setUp(self):
+        pass
+
+    @classmethod
+    def tearDown(self):
+        raise NotImplementedError
+
+
+class Layer2:
+
+    @classmethod
+    def setUp(self):
+        pass
+
+    @classmethod
+    def tearDown(self):
+        pass
+
+
+class TestNoTeardown(unittest.TestCase):
+
+    layer = Layer1
+
+    def test_something(self):
+        pass
+
+
+def make_TestMany():
+    attrs = {'layer': Layer2}
+    # Add enough failing test methods to make the concatenation of all their
+    # test IDs (formatted as "test_foo (sampletests_many.TestMany)")
+    # overflow the capacity of a pipe.  This is system-dependent, but on
+    # Linux since 2.6.11 it defaults to 65536 bytes, so will overflow by the
+    # time we've written 874 of these test IDs.  If the pipe capacity is
+    # much larger than that, then this test might be ineffective.
+    for i in range(1000):
+        attrs['test_some_very_long_test_name_with_padding_%03d' % i] = (
+            lambda self: self.fail())
+    return type('TestMany', (unittest.TestCase,), attrs)
+
+
+TestMany = make_TestMany()
+
+
+def test_suite():
+    suite = unittest.TestSuite()
+    suite.addTest(unittest.makeSuite(TestNoTeardown))
+    suite.addTest(unittest.makeSuite(TestMany))
+    return suite
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/src/zope/testrunner/tests/testrunner-layers-ntd.rst b/src/zope/testrunner/tests/testrunner-layers-ntd.rst
@@ -268,4 +268,26 @@ like it once did).
 
     >>> sys.stderr = real_stderr
 
+When a layer is run in a subprocess, the test IDs of any failures and errors it
+generates are passed to the parent process via the child's stderr.  The parent
+reads these IDs in parallel with reading other output from the child, so this
+works even if there are enough failures to overflow the capacity of the stderr
+pipe.
 
+    >>> argv = [testrunner_script, '--tests-pattern', '^sampletests_many$']
+    >>> testrunner.run_internal(defaults, argv)
+    Running sampletests_many.Layer1 tests:
+      Set up sampletests_many.Layer1 in N.NNN seconds.
+      Ran 1 tests with 0 failures, 0 errors and 0 skipped in N.NNN seconds.
+    Running sampletests_many.Layer2 tests:
+      Tear down sampletests_many.Layer1 ... not supported
+      Running in a subprocess.
+      Set up sampletests_many.Layer2 in N.NNN seconds.
+    <BLANKLINE>
+    <BLANKLINE>
+    Failure in test test_some_very_long_test_name_with_padding_000 (sampletests_many.TestMany)
+    ...
+      Ran 1000 tests with 1000 failures, 0 errors and 0 skipped in N.NNN seconds.
+      Tear down sampletests_many.Layer2 in N.NNN seconds.
+    Total: 1001 tests, 1000 failures, 0 errors and 0 skipped in N.NNN seconds.
+    True