Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing overhead 100x higher with multiple cores but not if restricted to 1 core #98493

Closed
PhilippWendler opened this issue Oct 20, 2022 · 5 comments
Labels
performance Performance or resource usage topic-multiprocessing type-bug An unexpected behavior, bug, or error

Comments

@PhilippWendler
Copy link

PhilippWendler commented Oct 20, 2022

Bug report

The following simple Python script basically measures the overhead of using the multiprocessing module and typically prints a value less than 0.001 on my machines:

import multiprocessing
import timeit
count = 100
pool = multiprocessing.Pool(1)

with pool:
    print(timeit.timeit(lambda: pool.apply(int, [42]), number=count) / count)

On one of our machines (which has 256 CPU cores but the same software as others) it is two orders of magnitude slower and prints a value around 0.05, i.e., one multiprocessing call takes 50ms. (This is repeatable consistently and independent of how many iterations I let it run.)

The weird thing is that if I restrict the Python interpreter to a single core by starting it with taskset -c 0 (any other core works as well), the script is suddenly 100x faster and thus as fast as on other machines. As soon as I allow two or more cores to be used, the performance is bad again. It does not matter how many cores are allowed to use, as long as it is more than one. The machine is otherwise idle.

My guess is that something is slow in multiprocessing is slow if the machine has many cores, but there is some alternative fast path that is used only if only a single core is allowed.

I tried to debug and profile this and found out the following:

  • The time of each pool.apply() call is roughly the same (with the usual variations due to noise), i.e., it is not the case that some of the calls block for a long time and all others are fast.
  • The main thread just waits all the time inside the multiprocessing implementation for the threading.Event that signals that the result has arrived. It does not spend significant CPU time.
  • There are three additional threads created by this script, one of which runs at 100% usage of a CPU core and another one roughly at 40%.
  • The subprocess that is started by the pool does not consume significant CPU time.
  • Python 3.11.0rc2 is about 10x faster than Python 3.10.8 for this script, but still 10x slower than if I restrict the script to one CPU core. In the latter case, there is no noticeable performance difference between Python 3.10.8 and 3.11.0rc2. So there have already been improvements, but I am still reporting this because there is still a significant slowdown from 1 core to 2+ cores even with Python 3.11.

I used yappi as a profiler because it supports multiple threads (it significantly slows down everything but I hope this does not influence the results). Its thread stats are this:

name           id     tid              ttot      scnt        
Thread         3      139744484771392  1.087823  14        
Thread         2      139744467985984  0.103491  10        
_MainThread    0      139744494682112  0.002633  11        
Thread         1      139744476378688  0.002465  10        

Thread 3 was the one that ran at 100% CPU in this case, but sometimes it is Thread 2 instead. The function stats reported by yappi for this thread are (sorted by tsub, measured CPU time):

Function stats for (Thread) (3)

Clock type: CPU
Ordered by: tsub, desc

name                                  ncall  tsub      ttot      tavg      
..iprocessing/connection.py:922 wait  12270  0.118058  0.816814  0.000067
..ctors.py:352 PollSelector.register  24533  0.089522  0.427971  0.000017
..ctors.py:235 PollSelector.register  24533  0.087015  0.318705  0.000013
...10/selectors.py:21 _fileobj_to_fd  24533  0.063059  0.130226  0.000005
..lectors.py:403 PollSelector.select  12270  0.046019  0.110461  0.000009
<string>:1 <lambda>                   24533  0.041068  0.060914  0.000002
..y:216 PollSelector._fileobj_lookup  24533  0.040551  0.170776  0.000007
..ctors.py:348 PollSelector.__init__  12271  0.037467  0.083006  0.000007
..onnection.py:173 Connection.fileno  18402  0.032661  0.049131  0.000003
..ctors.py:210 PollSelector.__init__  12271  0.025910  0.036707  0.000003
..electors.py:269 PollSelector.close  12271  0.023426  0.033754  0.000003
../connection.py:258 Connection.poll  6140   0.023338  0.337786  0.000055
..g/pool.py:289 _join_exited_workers  6131   0.021958  0.087539  0.000014
..on.py:139 Connection._check_closed  24551  0.021673  0.021673  0.000001
..rocess.py:247 ForkProcess.sentinel  12262  0.021180  0.030983  0.000003
..ctors.py:203 PollSelector.__exit__  12271  0.020248  0.054003  0.000004
..sing/pool.py:500 _wait_for_updates  6130   0.019741  0.894825  0.000146
..rocess.py:224 ForkProcess.exitcode  6131   0.018399  0.061178  0.000010
..iprocessing/pool.py:286 <listcomp>  6131   0.017134  0.059192  0.000010
..ss.py:99 ForkProcess._check_closed  18393  0.014678  0.014678  0.000001
..connection.py:428 Connection._poll  6140   0.013473  0.304532  0.000050
..essing/popen_fork.py:24 Popen.poll  6131   0.013141  0.037903  0.000006
../pool.py:284 _get_worker_sentinels  6131   0.011803  0.070995  0.000012
..ng/queues.py:352 SimpleQueue.empty  6140   0.011367  0.349153  0.000057
..cessing/pool.py:333 _maintain_pool  6131   0.011109  0.098649  0.000016
..rs.py:64 _SelectorMapping.__init__  12271  0.010797  0.010797  0.000001
..tors.py:200 PollSelector.__enter__  12271  0.008526  0.008526  0.000001
..ssing/connection.py:938 <listcomp>  6140   0.005670  0.005670  0.000001
..s.py:276 PollSelector._key_from_fd  6140   0.005305  0.005305  0.000001
...py:143 Connection._check_readable  6149   0.004729  0.004729  0.000001
..connection.py:379 Connection._recv  18     0.000118  0.000254  0.000014
..sing/queues.py:363 SimpleQueue.get  9      0.000079  0.000591  0.000066
..tion.py:418 Connection._recv_bytes  9      0.000063  0.000337  0.000037
..ction.py:213 Connection.recv_bytes  9      0.000057  0.000416  0.000046
..ng/synchronize.py:97 Lock.__exit__  9      0.000030  0.000039  0.000004
..g/synchronize.py:94 Lock.__enter__  9      0.000021  0.000034  0.000004

At this point I am lost and do not know what else I could do to find out more. But I would be glad to provide additional information or assist otherwise if requested! I hope the above provides enough of a starting point.

Your environment

  • CPython versions tested on: 3.10.6 (Ubuntu package), 3.10.8 (official Docker image)
  • Operating system and architecture: Ubuntu 22.04 x64, Linux 5.15.0-48-generic
  • Hardware: 2x AMD Epyc 7713 CPUs (2*64 core plus hyperthreading -> 256 virtual cores)
@PhilippWendler PhilippWendler added the type-bug An unexpected behavior, bug, or error label Oct 20, 2022
@AlexWaygood AlexWaygood added topic-multiprocessing performance Performance or resource usage labels Oct 20, 2022
@gpshead
Copy link
Member

gpshead commented Oct 22, 2022

This is very easily reproduced.

greg@bluesteel:~/oss/b$ taskset -c 0,1 ./python ../mp-speed-issue-98493.py 
0.10913387069995224
greg@bluesteel:~/oss/b$ taskset -c 1 ./python ../mp-speed-issue-98493.py 
0.00034898220001196024
greg@bluesteel:~/oss/b$ taskset -c 1,3 ./python ../mp-speed-issue-98493.py 
0.06422008498004289
greg@bluesteel:~/oss/b$ ./python ../mp-speed-issue-98493.py 
0.07137351215998933

Note that this is Linux where the multiprocessing start method defaults to "fork". But changing it to "spawn" doesn't alter the performance significantly, but "forkserver" is significantly slower. The slow speed of the multi-core runs varies wildly. Wild guess: It smells like something is contending for a lock without letting the thing that needs have an opportunity to it get it. A process sitting there at 100% cpu for this makes no sense.

import multiprocessing
import timeit
count = 100

def f():
    pool = multiprocessing.Pool(1)
    with pool:
        print(timeit.timeit(lambda: pool.apply(int, [42]), number=count) / count)

if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")  # "fork" is the default
    f()

@gpshead gpshead changed the title Multiprocessing overhead 100x higher on machine with 256 cores but not if restricted to 1 core Multiprocessing overhead 100x higher with multiple cores but not if restricted to 1 core Oct 22, 2022
@PhilippWendler
Copy link
Author

Good to hear that I am not the only one seeing this. How many cores does you machine have in total? I couldn't reproduce it on my notebook with 16 cores.

PhilippWendler added a commit to sosy-lab/benchexec that referenced this issue Nov 4, 2022
The goal is to avoid the multiprocessing overhead for this method,
which is called often but always returns the same result.
The multiprocessing overhead is typically not large,
but in some situations it is:
python/cpython#98493

On systems that are affect by this, it can save several minutes of
execution time if there are 10000 runs.
@gpshead
Copy link
Member

gpshead commented Dec 5, 2022

@ambv as FYI

@PhilippWendler
Copy link
Author

I have tested this again, but cannot reproduce it anymore, on the same system as in my initial post. Nothing on the system has changed except that the kernel is 5.15.0-91-generic now and Python is 3.10.12. I have also tested with the official Docker containers for 3.10.6, 3.10.8, 3.11, and 3.12, none of them seem to show a difference with taskset -c 0 and without.

Can other people confirm this?

@gpshead
Copy link
Member

gpshead commented Jan 22, 2024

I cannot reproduce this either. The system I had reproduced it on was likely just running vanilla debian or ubuntu at that time (late October 2022). Anyways, without an ability to reproduce this... closing. mystery. possibly an issue with a specific common distro kernel at that time?

@gpshead gpshead closed this as not planned Won't fix, can't repro, duplicate, stale Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-multiprocessing type-bug An unexpected behavior, bug, or error
Projects
Development

No branches or pull requests

3 participants