gh-81392: obmalloc: eliminate limit on pool size #13934

tim-one · 2019-06-10T00:43:41Z

As described in bpo-37211, this changes address_in_range() to be page-based rather than pool-based, allows pools to span any power-of-2 number of pages, and on 64-bit boxes quadruples the size of both pools and arenas.

It would be great to get feedback from 64-bit apps that do massive amounts of small-object allocations and deallocations.

https://bugs.python.org/issue37211

Issue: obmalloc: eliminate limit on pool size #81392

pitrou · 2019-06-14T22:01:28Z

It sounds irrational to multiply sizes by 4 (why not simply 2?) when going from 32-bit to 64-bit.

More generally, this change will make freed memory less likely to be reclaimed by the system. So there is a tradeoff.

tim-one · 2019-06-14T22:31:58Z

It sounds irrational to multiply sizes by 4 (why not simply 2?) when going from 32-bit to 64-bit.

If someone cares enough to pursue it, they can measure a range of possibilities. The intuition isn't just that we moved from 32- to 64-bit, but also that typical machines have far more RAM now than they did when obmalloc was first written (about 18 years ago). Even if 64-bit machines had never been created, I'd be in favor of at least doubling pool and arena sizes by now for 32-bit machines (but since I never use such machines anymore, nor does anyone I interact with, I'm not proposing to change what they use).

BTW, in related work on a different approach, Neil S saw consistent measurable speed gains from boosting arena size even more. Apparently mmap() and munmap() on Linux are expensive. 16 MiB arenas (16 times again larger than this patch) seem to have been working fine for him so far.

this change will make freed memory less likely to be reclaimed by the system. So there is a tradeoff.

Sure. But in the absence of quantification, I'm not much inclined to care. Whether arenas can be freed is mostly a matter of blind luck regardless of which sizes are used, with only one poke-&-hope heuristic employed to try to increase the likelihood of arenas emptying. I'm not worried about it. That doesn't mean I shouldn't be - but I nevertheless am not 😉.

pitrou · 2019-06-15T09:19:56Z

The intuition isn't just that we moved from 32- to 64-bit, but also that typical machines have far more RAM now than they did when obmalloc was first written (about 18 years ago).

That's true, but typical machines also run many processes at once.

Apparently mmap() and munmap() on Linux are expensive. 16 MiB arenas (16 times again larger than this patch) seem to have been working fine for him so far.

By "working fine", you mean he's fine with all Python processes allocating memory in chunks of 16 MiB and releasing those chunks only when they are perfectly empty? I'm not sure everyone would agree.

mmap/munmap is probably expensive. We should study what other allocators do. IIRC, jemalloc calls madvise() with MADV_FREE. A more sophisticated approach is a free list of arenas on which you call MADV_FREE, and when that free list overflows you just munmap() the extraneous arenas.

The granularity of memory management is a delicate tradeoff. For example, Linus Torvalds still is a proponent of 4 kB pages at the HW and OS level. See this subthread.

And this quote of his is also interesting for our context:

But then when you have effectively 2GB less memory in your machine, your actual real life benchmarks will be worse because you spend more time on IO.

Consuming more memory in Python might make system-level performance worse.

tim-one · 2019-06-16T03:09:30Z

Note that this PR only boosts arena size to 1 MiB - it's Neil who is usually using 16 MiB. If you're a fan of jemalloc, it appears to use the geometric mean (4 MiB) of those as its fundamental size ("chunk"):

Virtual memory is logically partitioned into chunks of size 2^k (4 MiB by default). As a result, it is possible to find allocator metadata for small/large objects (interior pointers) in constant time via pointer manipulations, and to look up metadata for huge objects (chunk-aligned) in logarithmic time via a global red-black tree.

About Torvalds, Python is not the OS. The OS has to worry about actually backing pages with physical RAM. We don't. We merely reserve address space, which doesn't get associated with actual RAM until we actually write into a page. So long as Linux sticks to 4 KiB pages, that's the granularity at which we consume RAM too, regardless of how large our arena address reservations are.

On a 64-bit box, an address reservation of 1 MiB is trivial. Even on the feeblest edition of 64-bit Windows, we can do that 8 million times (2**23) before exhausting a process's user virtual address space.

Nothing in the PR directly changes the amount of RAM we actually use (note that I rearranged the pool header so that its size didn't change despite adding a new member), and for the smallest size classes RAM efficiency directly increases a tiny bit (we can, e.g., fit 6 more 16-byte objects into a 4 KiB pool than in four 1 KiB pools).

Maybe we'll be able to free less arena space, but maybe not - depends on the app. So far I've seen cases go both ways - luck of the draw.

A bunch of comments need updating.

Restored the original pool & arena sizes for 32-bit boxes. Got rid of the distinct "page" overhead & quantization stats, and folded them into the "pool" stats.

(rather than pool-based) now. Also other assorted comment changes.

inputs and outputs are correctly aligned. This should have been done from day one.

arhadthedev · 2023-06-20T03:40:50Z

@tim-one What's the fate of this PR?

tim-one · 2023-06-20T03:55:00Z

This is dead. Arenas are 4x larger already now on 64-bit boxes, and so are pools if radix tree tracking is used (which it should be - there's no sane reason to keep the old code around anymore). It would probably be valuable to make both arenas and pools larger still, but I have no intent to pursue it.

the-knights-who-say-ni added the CLA signed label Jun 10, 2019

bedevere-bot added the awaiting core review label Jun 10, 2019

tim-one self-assigned this Jun 10, 2019

tim-one force-pushed the obmalloc-big-pools branch 2 times, most recently from 0da278d to c5ef280 Compare June 14, 2019 20:32

tim-one added 7 commits June 18, 2019 11:58

Minor initial changes.

56cbc14

Pretty much everything seems to be working fine.

175f140

A bunch of comments need updating.

Clarify comment.

76fe82e

Rearranged the page/pool/arena macros, and rewrote the comments.

dd52757

Restored the original pool & arena sizes for 32-bit boxes. Got rid of the distinct "page" overhead & quantization stats, and folded them into the "pool" stats.

Rewrote address_in_range() docs to reflect that it's page-based

509e164

(rather than pool-based) now. Also other assorted comment changes.

Some build is whining about offsetof(). Try including stddef.h.

a9db57e

Add asserts that obmalloc's free/malloc/realloc pointer

2cf1f3c

inputs and outputs are correctly aligned. This should have been done from day one.

tim-one force-pushed the obmalloc-big-pools branch from c5ef280 to 2cf1f3c Compare June 18, 2019 17:00

tim-one added 6 commits June 20, 2019 22:45

Merge remote-tracking branch 'upstream/master' into obmalloc-big-pools

28feb9c

Merge remote-tracking branch 'upstream/master' into obmalloc-big-pools

013a57b

Merge remote-tracking branch 'upstream/master' into obmalloc-big-pools

3c7885b

Merge remote-tracking branch 'upstream/master' into obmalloc-big-pools

d2b6cf6

Merge remote-tracking branch 'upstream/master' into obmalloc-big-pools

e1c3b48

Repair new _Py_GetAllocatedBlocks implementation from master.

f2b703e

tim-one added the DO-NOT-MERGE label Jul 10, 2019

tim-one added 2 commits July 11, 2019 22:43

Merge remote-tracking branch 'upstream/master' into obmalloc-big-pools

1df8820

Merge remote-tracking branch 'upstream/master' into obmalloc-big-pools

18acd95

ezio-melotti removed the CLA signed label Jul 13, 2022

arhadthedev changed the title ~~bpo-37211: obmalloc: eliminate limit on pool size~~ gh-81392: obmalloc: eliminate limit on pool size Jun 20, 2023

tim-one closed this Jun 20, 2023

tim-one deleted the obmalloc-big-pools branch June 20, 2023 03:57

arhadthedev mentioned this pull request Jun 20, 2023

obmalloc: eliminate limit on pool size #81392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-81392: obmalloc: eliminate limit on pool size #13934

gh-81392: obmalloc: eliminate limit on pool size #13934

tim-one commented Jun 10, 2019 •

edited by bedevere-bot

Loading

pitrou commented Jun 14, 2019

tim-one commented Jun 14, 2019

pitrou commented Jun 15, 2019

tim-one commented Jun 16, 2019

arhadthedev commented Jun 20, 2023

tim-one commented Jun 20, 2023 •

edited

Loading

gh-81392: obmalloc: eliminate limit on pool size #13934

gh-81392: obmalloc: eliminate limit on pool size #13934

Conversation

tim-one commented Jun 10, 2019 • edited by bedevere-bot Loading

pitrou commented Jun 14, 2019

tim-one commented Jun 14, 2019

pitrou commented Jun 15, 2019

tim-one commented Jun 16, 2019

arhadthedev commented Jun 20, 2023

tim-one commented Jun 20, 2023 • edited Loading

tim-one commented Jun 10, 2019 •

edited by bedevere-bot

Loading

tim-one commented Jun 20, 2023 •

edited

Loading