-
-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change PyMem_Malloc to use pymalloc allocator #70437
Comments
The issue bpo-23601 showed speedup for the dict type by replacing PyMem_Malloc() with PyObject_Malloc() in dictobject.c. When I worked on the PEP-445, it was discussed to use the Python fast memory allocator for small memory allocations (<= 512 bytes), but I think that nobody tested on benchmark. So I open an issue to discuss that. By the way, we should also benchmark the Windows memory allocator which limits fragmentations. Maybe we can skip the Python small memory allocator on recent version of Windows? Attached patch implements the change. The main question is the speedup on various kinds of memory allocations (need a benchmark) :-) I will try to run benchmarks. -- If the patch slows down Python, maybe we can investigate if some Python types (like dict) mostly uses "small" memory blocks (<= 512 bytes). |
Ok, to avoid confusion, I opened an issue specific to Windows for its "Low-fragmentation Heap": issue bpo-26251. Other issues related to memory allocators. Merged:
Open:
|
Hum, the point of PyMem_Malloc() is that it's distinct from PyObject_Malloc(), right? Why would you redirect one to the other? |
(of course, we might question why we have two different families of allocation APIs...) |
For performances.
That's the real question: why does Python have PyMem family? Is it still justified in 2016? -- Firefox uses jemalloc to limit the fragmentation of the heap memory. Once I spent a lot of time to try to understand the principle of fragmentation, and in my tiny benchmarks, jemalloc was *much* better than system allocator. By the way, jemalloc scales well on multiple threads ;-) My notes on heap memory fragmentation: http://haypo-notes.readthedocs.org/heap_fragmentation.html |
About heap memory fragmentation, see also my attached two "benchmarks" in Python and C: python_memleak.py and tu_malloc.c. |
So, I ran ssh://hg@hg.python.org/benchmarks with my patch. It looks like some benchmarks are up to 4% faster: $ python3 -u perf.py ../default/python.orig ../default/python.pymem INFO:root:Automatically selected timer: perf_counter Report on Linux smithers 4.3.3-300.fc23.x86_64 #1 SMP Tue Jan 5 23:31:01 UTC 2016 x86_64 x86_64 ### 2to3 ### ### fastpickle ### ### fastunpickle ### ### json_dump_v2 ### ### regex_v8 ### ### tornado_http ### The following not significant results are hidden, use -v to show them: real 19m13.413s |
Please use -r flag for perf.py |
What this says is that some internals uses of PyMem_XXX should be replaced with PyObject_XXX. |
FYI benchmark result to compare Python with and without pymalloc (fast memory allocator for block <= 512 bytes). As expected, no pymalloc is slower, up to 30% slower (and it's never faster). Report on Linux smithers 4.3.3-300.fc23.x86_64 #1 SMP Tue Jan 5 23:31:01 UTC 2016 x86_64 x86_64 ### 2to3 ### ### chameleon_v2 ### ### django_v3 ### ### fastpickle ### ### fastunpickle ### ### json_dump_v2 ### ### json_load ### ### regex_v8 ### ### tornado_http ### The following not significant results are hidden, use -v to show them: |
Test with jemalloc using the shell script "python.jemalloc": Memory consumption: Report on Linux smithers 4.3.3-300.fc23.x86_64 #1 SMP Tue Jan 5 23:31:01 UTC 2016 x86_64 x86_64 ### 2to3 ### ### chameleon_v2 ### ### django_v3 ### ### fastpickle ### ### fastunpickle ### ### json_dump_v2 ### ### json_load ### ### nbody ### ### regex_v8 ### ### tornado_http ### Performance: Report on Linux smithers 4.3.3-300.fc23.x86_64 #1 SMP Tue Jan 5 23:31:01 UTC 2016 x86_64 x86_64 ### 2to3 ### ### chameleon_v2 ### ### nbody ### ### regex_v8 ### ### tornado_http ### The following not significant results are hidden, use -v to show them: |
Why not changing PyMem_XXX to use the same fast allocator than PyObject_XXX? (as proposed in this issue) FYI we now also have the PyMem_RawXXX family :) |
Le 02/02/2016 15:47, STINNER Victor a écrit :
These figures are not even remotely believable. |
Le 02/02/2016 15:48, STINNER Victor a écrit :
Why have two sets of functions doing exactly the same thing? |
To be honest, I didn't try to understand them :-) Are they the number of kB of the RSS memory? Maybe perf.py doesn't like my shell script? |
I have no idea. |
"perf.py -m" doesn't work with such bash script, but it works using exec:
Hum, it looks like jemalloc uses *more* memory than libc memory allocators. I don't know if it's a known Report on Linux smithers 4.3.3-300.fc23.x86_64 #1 SMP Tue Jan 5 23:31:01 UTC 2016 x86_64 x86_64 ### 2to3 ### ### chameleon_v2 ### ### django_v3 ### ### fastpickle ### ### fastunpickle ### ### json_dump_v2 ### ### json_load ### ### nbody ### ### regex_v8 ### ### tornado_http ### |
(Crap. I sent an incomplete message, sorry about that.)
I don't know if it's a known issue/property of jemalloc. |
Yury: "Please use -r flag for perf.py" Oh, I didn't know this flag. Sure, I can do that. New benchmark using --rigorous to measure the performance of attached pymem.patch. It always seems faster, newer slower. Report on Linux smithers 4.3.3-300.fc23.x86_64 #1 SMP Tue Jan 5 23:31:01 UTC 2016 x86_64 x86_64 ### 2to3 ### ### django_v3 ### ### fastpickle ### ### fastunpickle ### ### json_dump_v2 ### ### regex_v8 ### ### tornado_http ### The following not significant results are hidden, use -v to show them: |
Hi all, Please find below the results from a complete GUPB run on a patched CPython 3.6. In average, an improvement of about 2.1% can be observed. I'm also attaching an implementation of the patch for CPython 2.7 and its benchmark results. On GUPB the average performance boost is 1.5%. Compared to my proposition in issue bpo-26382, this patch yields slightly better results for CPython 3.6, gaining an average of +0.36% on GUPB, Hardware and OS configuration: BIOS settings: Intel Turbo Boost Technology: false OS: Ubuntu 14.04.2 LTS OS configuration: Address Space Layout Randomization (ASLR) disabled to reduce run Repository info: Results Table 1: CPython 3 GUPB results Table 2: CPython 2 GUPB results Table 3: OpenStack Swift ssbench results |
IMHO this change is too young to be backported to Python 2.7. I wrote it for Python 3.6 only. For Python 2.7, I suggest to write patches with narrow scope, as you did for the patch only modifying the list type. """ I surprised to see slow-down, but I prefer to think that changes smaller than 5% are pure noise. The good news is the long list of benchmarks with speedup larger than 5.0% :-) 22% on unpick list is nice to have too! |
I've just posted the results to an OpenStack Swift benchmark run using the patch from my proposition, issue bpo-26382. |
I created the issue bpo-26516 "Add PYTHONMALLOC env var and add support for malloc debug hooks in release mode" to help developers to detect bugs in their code, especially misuse of the PyMem_Malloc() API. |
Patch 3:
|
In february 2016, I started a thread on the python-dev mailing list: M.-A. Lemburg wrote: """
Yes: You cannot free memory allocated using pymalloc with the It would be better to go through the list of PyMem_() calls
The PyMem_*() APIs were needed to have a cross-platform malloc() M.-A. Lemburg fears that the PyMem_Malloc() API is misused: """ Sometimes, yes, but we also do allocations for e.g. https://docs.python.org/3.6/c-api/arg.html We do document to use PyMem_Free() on those; not sure whether M.-A. Lemburg suggested to the patch of this issue on: """ It may also be a good idea to check wrapper generators such |
numpy: good!
Commands ran in numpy tests in a virtual environment: numpy$ python setup.py install OK (KNOWNFAIL=7, SKIP=6) |
Victor, why do you insist on this instead of changing internal API calls in CPython? |
Antoine Pitrou added the comment:
https://mail.python.org/pipermail/python-dev/2016-February/143097.html "There are 536 calls to the functions PyMem_Malloc(), PyMem_Realloc() |
I'm sure you can use powerful tools such as "sed" ;-) |
I guess that PyMem functions are used in third party C extensions modules. I expect (minor) speedup in these modules too. I don't understand why we should keep a slow allocator if Python has a faster allocator? |
lxml: good!
lxml$ make test OK |
Pillow: good Note: I had to install JPEG headers (sudo dnf install -y libjpeg-turbo-devel). Tested version: git commit 555544c5cfc3874deaac9cfa87780822ee714c0d (Mar 8 2016). --- FAILED (SKIP=124, errors=2) The two errors are "OSError: decoder libtiff not available". |
Le 09/03/2016 18:01, STINNER Victor a écrit :
Define "slow". malloc() on Linux should be reasonably fast. Do you think it's reasonable to risk breaking external libraries just Again, why don't you try simply changing internal calls? |
See first messages of this issue for benchmark results. Some specific benchmarks are faster, none is slower.
Yes. It was discussed in the python-dev thread. |
I'm talking about the performance improvement in third-party libraries, not the performance improvement in CPython itself which can be addressed by replacing the internal API calls. |
Oh ok. I don't know how to measure the performance of third-party libraries. I expect no speedup or a little speedup, but no slow-down.
The question is if my change really breaks anything in practice. I'm testing some popular C extensions to prepare an answer. Early results is that developer use correctly the Python allocator API :-) I disagree on the fact that my change breaks any API. The API doc is clear. For example, you must use PyMem_Free() on memory allocated by PyMem_Malloc(). If you use free(), it fails badly with Python compiled in debug mode. My issue bpo-26516 "Add PYTHONMALLOC env var and add support for malloc debug hooks in release mode" may help developers to validate their own application. I suggest you to continue the discussion on python-dev for a wider audience. I will test a few more projects before replying on the python-dev thread. |
Le 09/03/2016 18:27, STINNER Victor a écrit :
Does the API doc say anything about the GIL, for example? Or Valgrind?
I have no interest in going back and forth between the Python tracker |
cryptography: good
"4 failed, 77064 passed, 3096 skipped in 405.09 seconds" 1 error is related to the version number (probably an issue on how I run the tests), 3 errors are FileNotFoundError related to cryptography_vectors. At least, there is no Python fatal error related to memory allocators ;-) -- Hum, just in case, I checked my venv: (ENV) haypo@smithers$ python -c 'import _testcapi; _testcapi.pymem_api_misuse()' (ENV) haypo@smithers$ python -c 'import _testcapi; _testcapi.pymem_buffer_overflow()' It works ;-) |
2016-03-09 18:28 GMT+01:00 Antoine Pitrou <report@bugs.python.org>:
For the GIL, yes, Python 3 doc is explicit: Red and bold warning: "The GIL must be held when using these functions." Hum, sadly it looks like the warning miss in Python 2 doc. The GIL was the motivation to introduce the PyMem_RawMalloc() function For Valgrind: using the issue bpo-26516, you will be able to use |
I modified Python to add assert(PyGILState_Check()); in PyMem_Malloc() and other functions. Sadly, I found a bug in Numpy: Numpy releases the GIL for performance but call PyMem_Malloc() with the GIL released. I proposed a fix: I guess that the fix is obvious and will be quickly merged, but it means that other libraries may have the issue. Using the issue bpo-26516 (PYTHONMALLOC=debug), we can check PyGILState_Check() at runtime, but there is currently an issue related to sub-interpreters. The assertion fails in support.run_in_subinterp(), function used by test_threading and test_capi for example. |
pymalloc.patch: Updated patch. |
I created bpo-26558 to implement GIL checks in PyMem_Malloc() and PyObject_Malloc(). |
I created the issue bpo-26563 "PyMem_Malloc(): check that the GIL is hold in debug hooks". |
New changeset 68b2a43d8653 by Victor Stinner in branch 'default': |
New changeset 104ed24ebbd0 by Victor Stinner in branch 'default': |
New changeset 7acad5d8f80e by Victor Stinner in branch 'default': |
I documented the change, buildbots are happy, I close the issue. |
68b2a43d8653 introduced memory leak. $ ./python -m test.regrtest -uall -R : test_format
Run tests sequentially
0:00:00 [1/1] test_format
beginning 9 repetitions
123456789
.........
test_format leaked [6, 7, 7, 7] memory blocks, sum=27
1 test failed:
test_format
Total duration: 0:00:01 |
New changeset 090502a0c69c by Victor Stinner in branch 'default': |
I was very surprised to see a regression in test_format since I didn't change any change related to bytes, bytearray or str formatting in this issue. In fact, it's much better than that! With PyMem_Malloc() using pymalloc, we benefit for free of the cheap "_Py_AllocatedBlocks" memory leak detector. I introduced the memory leak in the issue bpo-25349 when I optimimzed bytes%args and bytearray%args using the new _PyBytesWriter API. This memory leak gave me an idea, I opened the issue bpo-26850: "PyMem_RawMalloc(): update also sys.getallocatedblocks() in debug mode". |
There are no more know bugs related to this change, I close the issue. Thanks for the test_format report Serhiy, I missed it. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: