New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure Pipeline 3.8 CI: multiple tests hung and timed out on macOS 10.13 #81426
Comments
I backported a change to 3.8: The macOS job of Azure Pipelines failed badly:
0:20:21 load avg: 4.55 [155/423/1] test_importlib crashed (Exit code 1) -- running: test_concurrent_futures (16 min 12 sec), test_functools (13 min 30 sec), test_multiprocessing_spawn (18 min 51 sec) Thread 0x00007fff96f1a380 (most recent call first): 0:21:30 load avg: 4.76 [160/423/2] test_multiprocessing_spawn crashed (Exit code 1) -- running: test_concurrent_futures (17 min 21 sec), test_functools (14 min 39 sec), test_threading (35 sec 923 ms) Thread 0x00007fff96f1a380 (most recent call first): 0:24:09 load avg: 4.11 [207/423/3] test_concurrent_futures crashed (Exit code 1) -- running: test_functools (17 min 18 sec), test_timeout (34 sec 14 ms), test_threading (3 min 14 sec) Thread 0x0000700006141000 (most recent call first): Thread 0x0000700005c3e000 (most recent call first): Thread 0x000070000573b000 (most recent call first): Thread 0x0000700005238000 (most recent call first): Thread 0x00007fff96f1a380 (most recent call first): 0:26:51 load avg: 5.14 [259/423/4] test_functools crashed (Exit code 1) -- running: test_io (1 min 11 sec), test_threading (5 min 56 sec) Thread 0x00007fff96f1a380 (most recent call first): Thread 0x00007fff96f1a380 (most recent call first): 6 tests failed: -- pythoninfo: 2019-06-12T02:45:41.9759180Z Py_DEBUG: Yes (sys.gettotalrefcount() present) |
os.uname: posix.uname_result(sysname='Darwin', nodename='Mac-483.local', release='17.7.0', version='Darwin Kernel Version 17.7.0: Wed Apr 24 21:17:24 PDT 2019; root:xnu-4570.71.45~1/RELEASE_X86_64', machine='x86_64') |
FWIW, I tried reproducing with 3.8 at 996e526 (the PR 14000 checkin) on both a current 10.14.5 Mojave system and on a 10.13.6 High Sierra system (the version used in the failed Azure test) and did not see any unusual failures. I don't recall seeing a timeout like in test_concurrent_futures, at least recently!, but, if it is due to some race condition, there might be a more significant difference, like number of CPUs available, that might precipitate the failure. I'll leave it up to you, Victor, on whether or how long to leave this issue open but I don't see that there is anything practical to do until it can be reproduced. |
I'm still seeing this, maybe 1 in 20 builds, so it's semi-random. A new deadlock, maybe? |
It seems like only the jobs on Azure are killed by timeout. The jobs on macOS buildbots look fine. Maybe macOS on Azure is running slower and we should just increase the timeout? The bug still occurs: #15651 0:49:27 load avg: 1.41 [419/419/6] test_threading crashed (Exit code 1) 6 tests failed: The whole job was killed after 57 minutes. |
(Aside, why don't the macOS buildbots have a tag saying that? Took me ages to find them...) I doubt it's running 6-7x slower. More likely something is causing one of the workers to crash at a point where the lock remains held instead of being released (I saw this at work the other week in a slightly different context, but same symptoms). Could os._exit() at the wrong time cause it? It also looks like Azure is running tests with 4 processes, but the buildbot (at least the one I'm looking at) is only using 2. So perhaps there are more conflicts from that? |
Yeah, I agree that increasing the timeout shouldn't be the answer here. I still have never seen failure modes like this when running my own tests. The idea about CPUs is one worth pursuing although I usually run with -j3. Also I wonder how much memory the VM is configured with. Any way we can find out number of cpus and memory easily? |
It looks like the Azure macOS tests timed out again in the recently opened PR-15688. Specifically, for test_multiprocessing_spawn and test_functools (both of which also timed out in PR-15651, which Victor mentioned earlier): 0:26:41 load avg: 2.89 [418/419/1] test_multiprocessing_spawn crashed (Exit code 1) -- running: test_functools (14 min 38 sec) 0:32:03 load avg: 3.17 [419/419/2] test_functools crashed (Exit code 1) As far as I can tell, PR-15688 should have had no direct influence on test_multiprocessing_spawn or test_functools.
Since this seems to be affecting multiple PRs, would it be appropriate to attempt to increase the timeout duration as a temporary fix and open an issue for further investigation on the cause of the intermittent slowdown on those tests? |
I suspect this code is a repro - it certainly locks up the host process reliably enough. Perhaps if we unblock multiprocessing in the context of a crashed worker then it'll show what the underlying errors are? import os
from multiprocessing import Pool
def f(x):
os._exit(0)
return "success"
if __name__ == '__main__':
with Pool(1) as p:
print(p.map(f, [1])) |
Steve: Would you mind to open a separated issue for the multiprocessing bug? multiprocessing is supposed to handle this case. |
Filed as bpo-38084 I recommend not investigating this issue any further until that one is resolved. |
It seems like macOS job pass again on Azure Pipelines. I close the issue. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: