Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-81392: obmalloc: eliminate limit on pool size #13934

Closed
wants to merge 15 commits into from

Conversation

tim-one
Copy link
Member

@tim-one tim-one commented Jun 10, 2019

As described in bpo-37211, this changes address_in_range() to be page-based rather than pool-based, allows pools to span any power-of-2 number of pages, and on 64-bit boxes quadruples the size of both pools and arenas.

It would be great to get feedback from 64-bit apps that do massive amounts of small-object allocations and deallocations.

https://bugs.python.org/issue37211

@pitrou
Copy link
Member

pitrou commented Jun 14, 2019

It sounds irrational to multiply sizes by 4 (why not simply 2?) when going from 32-bit to 64-bit.

More generally, this change will make freed memory less likely to be reclaimed by the system. So there is a tradeoff.

@tim-one
Copy link
Member Author

tim-one commented Jun 14, 2019

It sounds irrational to multiply sizes by 4 (why not simply 2?) when going from 32-bit to 64-bit.

If someone cares enough to pursue it, they can measure a range of possibilities. The intuition isn't just that we moved from 32- to 64-bit, but also that typical machines have far more RAM now than they did when obmalloc was first written (about 18 years ago). Even if 64-bit machines had never been created, I'd be in favor of at least doubling pool and arena sizes by now for 32-bit machines (but since I never use such machines anymore, nor does anyone I interact with, I'm not proposing to change what they use).

BTW, in related work on a different approach, Neil S saw consistent measurable speed gains from boosting arena size even more. Apparently mmap() and munmap() on Linux are expensive. 16 MiB arenas (16 times again larger than this patch) seem to have been working fine for him so far.

this change will make freed memory less likely to be reclaimed by the system. So there is a tradeoff.

Sure. But in the absence of quantification, I'm not much inclined to care. Whether arenas can be freed is mostly a matter of blind luck regardless of which sizes are used, with only one poke-&-hope heuristic employed to try to increase the likelihood of arenas emptying. I'm not worried about it. That doesn't mean I shouldn't be - but I nevertheless am not 😉.

@pitrou
Copy link
Member

pitrou commented Jun 15, 2019

The intuition isn't just that we moved from 32- to 64-bit, but also that typical machines have far more RAM now than they did when obmalloc was first written (about 18 years ago).

That's true, but typical machines also run many processes at once.

Apparently mmap() and munmap() on Linux are expensive. 16 MiB arenas (16 times again larger than this patch) seem to have been working fine for him so far.

By "working fine", you mean he's fine with all Python processes allocating memory in chunks of 16 MiB and releasing those chunks only when they are perfectly empty? I'm not sure everyone would agree.

mmap/munmap is probably expensive. We should study what other allocators do. IIRC, jemalloc calls madvise() with MADV_FREE. A more sophisticated approach is a free list of arenas on which you call MADV_FREE, and when that free list overflows you just munmap() the extraneous arenas.

The granularity of memory management is a delicate tradeoff. For example, Linus Torvalds still is a proponent of 4 kB pages at the HW and OS level. See this subthread.

And this quote of his is also interesting for our context:

But then when you have effectively 2GB less memory in your machine, your actual real life benchmarks will be worse because you spend more time on IO.

Consuming more memory in Python might make system-level performance worse.

@tim-one
Copy link
Member Author

tim-one commented Jun 16, 2019

Note that this PR only boosts arena size to 1 MiB - it's Neil who is usually using 16 MiB. If you're a fan of jemalloc, it appears to use the geometric mean (4 MiB) of those as its fundamental size ("chunk"):

Virtual memory is logically partitioned into chunks of size 2^k (4 MiB by default). As a result, it is possible to find allocator metadata for small/large objects (interior pointers) in constant time via pointer manipulations, and to look up metadata for huge objects (chunk-aligned) in logarithmic time via a global red-black tree.

About Torvalds, Python is not the OS. The OS has to worry about actually backing pages with physical RAM. We don't. We merely reserve address space, which doesn't get associated with actual RAM until we actually write into a page. So long as Linux sticks to 4 KiB pages, that's the granularity at which we consume RAM too, regardless of how large our arena address reservations are.

On a 64-bit box, an address reservation of 1 MiB is trivial. Even on the feeblest edition of 64-bit Windows, we can do that 8 million times (2**23) before exhausting a process's user virtual address space.

Nothing in the PR directly changes the amount of RAM we actually use (note that I rearranged the pool header so that its size didn't change despite adding a new member), and for the smallest size classes RAM efficiency directly increases a tiny bit (we can, e.g., fit 6 more 16-byte objects into a 4 KiB pool than in four 1 KiB pools).

Maybe we'll be able to free less arena space, but maybe not - depends on the app. So far I've seen cases go both ways - luck of the draw.

A bunch of comments need updating.
Restored the original pool & arena sizes for 32-bit boxes.

Got rid of the distinct "page" overhead & quantization stats,
and folded them into the "pool" stats.
(rather than pool-based) now.

Also other assorted comment changes.
inputs and outputs are correctly aligned.  This should
have been done from day one.
@arhadthedev
Copy link
Member

@tim-one What's the fate of this PR?

@arhadthedev arhadthedev changed the title bpo-37211: obmalloc: eliminate limit on pool size gh-81392: obmalloc: eliminate limit on pool size Jun 20, 2023
@tim-one
Copy link
Member Author

tim-one commented Jun 20, 2023

This is dead. Arenas are 4x larger already now on 64-bit boxes, and so are pools if radix tree tracking is used (which it should be - there's no sane reason to keep the old code around anymore). It would probably be valuable to make both arenas and pools larger still, but I have no intent to pursue it.

@tim-one tim-one closed this Jun 20, 2023
@tim-one tim-one deleted the obmalloc-big-pools branch June 20, 2023 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants