-
-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Shared Array Memory Allocation Regression #75102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello, I have noticed a significant performance regression when allocating a large shared array in Python 3.x versus Python 2.7. The affected module seems to be The function I used for benchmarking: from timeit import timeit
timeit('sharedctypes.Array(ctypes.c_float, 500*2048*2048)', 'from multiprocessing import sharedctypes; import ctypes', number=1) And the results from executing it: Python 3.5.2 ------------------- Python 2.7.12 I will try to provide any information you need. Right now I am looking at callgrind/cachegrind without Debug symbols, and can post that, in the meantime I am building Python with Debug and will re-run the callgrind/cachegrind. Allocating the same-size array with numpy doesn't seem to have a difference between Python versions. The numpy command used was |
This is because instantiating new Arena is much slower in Python 3. ./python -m timeit -r1 -s 'from multiprocessing.heap import Arena' 'Arena(2**26)' Python 3.7: 1 loop, best of 1: 1.18 sec per loop May be the regression was introduced by bpo-8713. |
It can be significantly sped up if replace the writing loop with if size:
os.lseek(self.fd, size, 0)
os.write(self.fd, b'')
os.lseek(self.fd, 0, 0) or just with os.truncate(self.fd, size) (not sure the latter always works). But instantiating an open Arena will still be slower than in 2.7. |
In Python 2.7, multiprocessing.heap.Arena uses an anonymous memory mapping on Unix. Anonymous memory mappings can be shared between processes but only via fork(). But Python 3 supports other ways of starting subprocesses (see bpo-8713 [1]) and so an anonymous memory mapping no longer works. So instead a temporary file is created, filled with zeros to the given size, and mapped into memory (see changeset 3b82e0d83bf9 [2]). It is the zero-filling of the temporary file that takes the time, because this forces the operating system to allocate space on the disk. But why not use ftruncate() (instead of write()) to quickly create a file with holes? POSIX says [3], "If the file size is increased, the extended area shall appear as if it were zero-filled" which would seem to satisfy the requirement. [1] https://bugs.python.org/issue8713 |
Note that some filesystems (e.g. HFS+) don't support sparse files, so creating a large Arena will still be slow on these filesystems even if the file is created using ftruncate(). (This could be fixed, for the "fork" start method only, by using anonymous maps in that case.) |
If I understand correctly, there is no way to force the old behaviour in Python 3.5, i.e. to use an anonymous memory mapping in multiprocessing.heap.Arena so that the memory can be shared between the processes instead of writing to a shared file? The data sizes usually are, on average, 20 GB, and writing it out to a file is not desirable. As I understand from Gareth Rees' comment, ftruncate() will speed up initialisation, however any processing afterwards would be IO capped. To shed more light to the processing going on, the data is handled as a 3D array, so each process gets a 2D array to operate on, and no information needs to be shared between processes. If the anonymous memory mapping cannot be forced, then the multiprocessing module with a shared array becomes unusable for me. Are you aware of any way to use the multiprocessing module to run execution in parallel, that somehow doesn't use a shared array? |
If you need the 2.7 behaviour (anonymous mappings) in 3.5 then you can still do it, with some effort. I think the approach that requires the smallest amount of work would be to ensure that subprocesses are started using fork(), by calling multiprocessing.set_start_method('fork'), and then monkey-patch multiprocessing.heap.Arena.__init__ so that it creates anonymous mappings using mmap.mmap(-1, size). (I suggested above that Python could be modified to create anonymous mappings in the 'fork' case, but now that I look at the code in detail, I see that it would be tricky, because the Arena class has no idea about the Context in which it is going to be used -- at the moment you can create one shared object and then pass it to subprocesses under different Contexts, so the shared objects have to support the lowest common denominator.) |
Nonetheless this is bound to be a nasty performance for many people doing big data processing with NumPy/SciPy/Pandas and multiprocessing and moving from 2 to 3, so even if it can't be fixed, the documentation ought to warn about the problem and explain how to work around it. |
I have looked into your advice of changing multiprocessing.heap.Arena.__init__, I have removed the code that allocated the file and reverted to the old behaviour. I have done some brief runs and it seems to bring back the old behaviour which is allocating the space in RAM, rather than with IO. I am not sure what things this might break, and it might make the other usages of multiprocessing unstable! Can anyone think of anything this change might break? The Arena.__init__ code is the one from Python 2.7: class Arena(object):
def __init__(self, size, fd=-1):
self.size = size
self.fd = fd # still kept but is not used !
self.buffer = mmap.mmap(-1, self.size) There does not seem to be a difference regardless of the start method setting multiprocessing.set_start_method('fork') to be 'fork'. |
I see now that the default start method is 'fork' (except on Windows), so calling set_start_method is unnecessary. Note that you don't have to edit multiprocessing/heap.py, you can "monkey-patch" it in the program that needs the anonymous mapping: from multiprocessing.heap import Arena
def anonymous_arena_init(self, size, fd=-1):
"Create Arena using an anonymous memory mapping."
self.size = size
self.fd = fd # still kept but is not used !
self.buffer = mmap.mmap(-1, self.size)
Arena.__init__ = anonymous_arena_init As for what it will break — any code that uses the 'spawn' or 'forkserver' start methods. |
Thank you, that is indeed the solution I ended up with, along with a large explanation of why it was necessary. Do you think that it's worth updating the |
I propose:
|
The original changeset in Richard's repository is https://hg.python.org/sandbox/sbt/rev/6c8554a7d068. Unless Richard answers otherwise, I think it's likely the performance degradation was an oversight. Given the code we're talking about is POSIX-specific, truncate() (or its sibling os.ftruncate()) is specified to always zero-fill. So it should be safe to use. Another related point is that the arena file is created inside get_temp_dir(), which allocates something in /tmp. On modern Linux systems at least, /tmp is usually backed by persistent storage. Instead, we could use /run/user/{os.getuid()}, which uses a tmpfs and would therefore save on I/O when later accessing the shared array. |
Gareth, Dimitar, you may want to take a look at #2708 |
The size of file system mounted on "/run/user/${uid}" is limited (only 200 MiB on my computer). get_temp_dir() will be successful, but this may be not enough for allocating large shared array. And it is harder to control by user, because the location is not specified by an environment variable. |
Hmm, you're right. Is /dev/shm ok on your computer? |
Yes, /dev/shm is ok on my computer. But there is yet one drawback of this patch. Currently the size of shared array is limited by the free space on the file system for temporary files. If it isn't enough you can easy set TMPDIR to other path. In 2.7 and with hardcoded /dev/shm it is limited by the size of physical memory (maybe plus swap). It is not easy to increase it, especially for unprivileged user. The effect of hardcoding /dev/shm can be achieved by setting TMPDIR to /dev/shm. Using os.ftruncate() adds yet about 20% of speed up. This looks mainly as a documentation issue to me. The regression and the workaround should be documented. |
The main issue here is to restore performance of 2.7 version. Setting TMPDIR sounds too general and might have impact on other libraries. I think it's ok if shared array size is limited by virtual memory size on the computer, since the whole point is to expose a fast communication channel between processes. |
This can break Python 3 applications that work with shared arrays larger than the size of physical memory. If do this change, it should be optional. Setting TMPDIR=/dev/shm restores performance of 2.7 version even on versions that we will decide to not patch. |
Again, setting TMPDIR is a poor workaround as it will impact any library creating temporary files. I haven't found anyone complaining about limited shared array size on 2.7, so likely it's not a common concern. Right now, anyone creating a huge shared array on 3.x will be facing huge performance issues, as the array is backed by disk, so it doesn't sound realistic at all to assume people are doing that. |
Serhiy, I've changed the strategy in PR 2708: now the partition free space is tested before deciding to locate the file there or not. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: