-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"dictionary changed size during iteration" error in _ExecutorManagerThread #87664
Comments
Recently several of our Python 3.9 builds froze during Listing .../components/python/python39/build/prototype/sparc/usr/lib/python3.9/lib2to3/tests/data/fixers/myfixes...
Exception in thread Thread-1:
Traceback (most recent call last):
File ".../components/python/python39/build/prototype/sparc/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File ".../components/python/python39/build/prototype/sparc/usr/lib/python3.9/concurrent/futures/process.py", line 317, in run
result_item, is_broken, cause = self.wait_result_broken_or_wakeup()
File ".../components/python/python39/build/prototype/sparc/usr/lib/python3.9/concurrent/futures/process.py", line 376, in wait_result_broken_or_wakeup
worker_sentinels = [p.sentinel for p in self.processes.values()]
File ".../components/python/python39/build/prototype/sparc/usr/lib/python3.9/concurrent/futures/process.py", line 376, in <listcomp>
worker_sentinels = [p.sentinel for p in self.processes.values()]
RuntimeError: dictionary changed size during iteration After this, the build freezes and never ends (most likely waiting for the broken thread). We see this only in Python 3.9 (3.7 doesn't seem to be affected, and we don't deliver other versions) and only when doing full builds of the entire Userland, meaning that this might be related to big utilization of the build machine? That said, it only happened three or four times, so this might be just a coincidence. Simple fix seems to be this (PR shortly): --- Python-3.9.1/Lib/concurrent/futures/process.py
+++ Python-3.9.1/Lib/concurrent/futures/process.py
@@ -373,7 +373,7 @@ class _ExecutorManagerThread(threading.T
assert not self.thread_wakeup._closed
wakeup_reader = self.thread_wakeup._reader
readers = [result_reader, wakeup_reader]
- worker_sentinels = [p.sentinel for p in self.processes.values()]
+ worker_sentinels = [p.sentinel for p in self.processes.copy().values()]
ready = mp.connection.wait(readers + worker_sentinels)
cause = None This is on Oracle Solaris and on both SPARC and Intel machines. |
I'm seeing the same error with Python 3.9.2 on Fedora 33, with a script that uses ProcessPoolExecutor. |
I investigated a little bit more and found out that this happens when With the following change, I can reproduce this reliably every time: --- Python-3.9.1/Lib/concurrent/futures/process.py
+++ Python-3.9.1/Lib/concurrent/futures/process.py
@@ -373,7 +373,14 @@ class _ExecutorManagerThread(threading.T
assert not self.thread_wakeup._closed
wakeup_reader = self.thread_wakeup._reader
readers = [result_reader, wakeup_reader]
- worker_sentinels = [p.sentinel for p in self.processes.values()]
+ worker_sentinels = []
+ for p in self.processes.values():
+ time.sleep(1)
+ worker_sentinels.append(p.sentinel)
ready = mp.connection.wait(readers + worker_sentinels)
cause = None Since |
Observed this same failure mode on a Raspberry Pi 1 while running 'make install' on Python 3.9.5 with 9 concurrent workers. Exception in thread Thread-1:
Traceback (most recent call last):
File "/var/tmp/portage/dev-lang/python-3.9.5_p2/image/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/var/tmp/portage/dev-lang/python-3.9.5_p2/image/usr/lib/python3.9/concurrent/futures/process.py", line 317, in run
result_item, is_broken, cause = self.wait_result_broken_or_wakeup()
File "/var/tmp/portage/dev-lang/python-3.9.5_p2/image/usr/lib/python3.9/concurrent/futures/process.py", line 376, in wait_result_broken_or_wakeup
worker_sentinels = [p.sentinel for p in self.processes.values()]
File "/var/tmp/portage/dev-lang/python-3.9.5_p2/image/usr/lib/python3.9/concurrent/futures/process.py", line 376, in <listcomp>
worker_sentinels = [p.sentinel for p in self.processes.values()]
RuntimeError: dictionary changed size during iteration |
I can confirm we are seeing the same issue when building Python 3.9 in the context of Buildroot. See http://autobuild.buildroot.net/results/ae6/ae6c4ab292589a4e4442dfb0a1286349a9bf4d29/build-end.log for an example build result. This happens since we have added 48-cores (96 threads) build machines to our build farm, which dramatically increased the build parallelism. |
For the record: we're seeing this issue ~50 times a day on our build infrastructure. |
It was mentioned in bpo-40327 that although copy() makes the situation much better, it doesn't solve the problem entirely, since the memory allocation of the copy() call can release the GIL. I don't know enough to know whether it would be worth it to add locking. |
I think that even if copy() doesn't fix it entirely, it's still much better than nothing. I never encountered the issue mentioned in bpo-40327, but I saw this issue several times a week (before applying the proposed patch). |
I'm experiencing the same issue on Python 3.10.0 when I execute the code that uses concurrent.futures.ProcessPoolExecutor. ======== Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 317, in run
result_item, is_broken, cause = self.wait_result_broken_or_wakeup()
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 376, in wait_result_broken_or_wakeup
worker_sentinels = [p.sentinel for p in self.processes.values()]
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 376, in <listcomp>
PROCESSING DATAFRAME: AKAM
worker_sentinels = [p.sentinel for p in self.processes.values()]
RuntimeError: dictionary changed size during iteration ======== I also tried to troubleshoot to find out the part that causes this exception, but the most difficult part is: it does not happen every time I execute my code that uses concurrent.futures.ProcessPoolExecutor. (Really like what Jakub mentioend earlier, it is like a coincidence.) At the same time, I am also testing if the same thing happens on other versions like Python 3.8.8 (on Rocky Linux 8.5), but we would appreciate it if someone can tell if this is a bug or not? Or even anything we should improve on my own code? (if needed I can share the sample code, but honestly I do not think this is something wrong with my code, since as I mentioned: the exception is not happening every time I execute my code, so I suspect this might be a bug of Python 3.10.0) (Since Jakub already reported it happens on Python 3.9, so I am not testing on 3.9) I would appreciate it if there is any update or info that can be shared. Thank you! |
Thanks for the report. Atomic copy ( I doubt if writing a reliable test for this situation is possible; multithreading is hard. I think we can accept a patch without a test but with an inline comment that describes why copy is crucial. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: