Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user-defined allocators #538

Closed
pnkfelix opened this issue Dec 22, 2014 · 28 comments
Closed

user-defined allocators #538

pnkfelix opened this issue Dec 22, 2014 · 28 comments
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.

Comments

@pnkfelix
Copy link
Member

pnkfelix commented Dec 22, 2014

We will want a user-defined allocator API in our standard library. It has been moved off of the 1.0 requirements, but it remains an important post 1.0 task.

See for reference:

Also, it is good to be aware of whatever happens with placement new and the box syntax: See e.g. #413 (though that is arguably a draft RFC that has been filed as an issue).

@ghost
Copy link

ghost commented Dec 23, 2014

I've played around with the concept a bit in Jurily/rust-allocator, but many of my use cases would be suboptimal with panic. While it might make sense to kill everything when we run out of global heap, it would be unnecessarily fragile for size-limited pools (sandbox) or allocators that don't implement some operations (arena cannot reallocate).

What exactly is the scope of this API?

@Gankra
Copy link
Contributor

Gankra commented Dec 23, 2014

All Collections and SmahtPtr types should absolutely support the custom allocator API, as far as I'm aware. This results in some... nasty... details as you're finding.

@reem
Copy link

reem commented Dec 23, 2014

I don't think it's reasonable to expect allocation errors to bubble up all the way through to users of collections and smart pointers since in the vast majority of cases the only reasonable thing to do is abort or panic and we'll just end up with a million meaningless and annoying unwraps all over most rust code. Remember, for 99.9% of Rust code which will just use the default allocator, an allocation failure means OOM and handling anything under OOM is basically impossible.

@thestinger
Copy link

an allocation failure means OOM and handling anything under OOM is basically impossible.

There's often a sane way for the application to handle it, and in some niches it must be handled. Rust isn't entirely ruling out those niches by doing this because the alloc and core crates are (mostly) usable with that constraint but it isn't a non-issue.

@reem
Copy link

reem commented Dec 23, 2014

@thestinger The allocator API should definitely support handling OOM - I'm just saying it's unreasonable for most collections and smart pointers to propagate those errors in any way other than panicking or aborting. You could still write custom collections or smart pointers which exposed the error.

@ghost
Copy link

ghost commented Dec 23, 2014

@reem You can do anything under OOM as long as you don't allocate more memory. For example, you can free memory or wait for some other thread to do so.

I find it a perfectly reasonable expectation from a modern systems language to not need to fork the standard library for the cases where that's important. It's not like we don't have enough precedents on how to solve this. We even have default type params now.

The main problem is that bubbling up errors from multiple sources is an unsolved ergonomic problem in libstd, which is just hilarious since it has access to Result, try! and i32.

@pythonesque
Copy link
Contributor

I agree with @reem. Rust can't enforce a constraint like "doesn't allocate memory" in destructors for generic collection types, and panicking can allocate memory too, so if you really need to handle OOM I don't think you can rely on the standard library anyway.

@thestinger
Copy link

Rust can't enforce a constraint like "doesn't allocate memory" in destructors for generic collection types

The inability to enforce error handling in the type system certainly doesn't prevent handling that class of error correctly. We can't enforce that other types of errors are handled correctly either...

and panicking can allocate memory too

This means that panicking is not a solution to handling out-of-memory errors, which is already a given considering that the niches where this needs to be handled are often the same ones where unwinding isn't supported / tolerable.

don't think you can rely on the standard library anyway.

The core crate already works fine, as does most of the alloc crate. The alloc crate could be improved to fully handle this situation via alternative methods, as could other areas of the standard libraries.

@pythonesque
Copy link
Contributor

My point was that for an application that cares about OOM, it won't be able to use any data structures without auditing the code to make sure it can't allocate memory in destructors, since Rust can't enforce that behavior. Once that code is audited, it has to be forked anyway, unless there are actual guarantees that that won't occur in the future. The standard library (outside of libcore and liballoc) could be changed to guarantee it wouldn't allocate in destructors as long as the generic destructors didn't allocate, I guess. I'm pretty sure that's not currently a guarantee.

@thestinger
Copy link

My point was that for an application that cares about OOM, it won't be able to use any data structures without auditing the code to make sure it can't allocate memory in destructors, since Rust can't enforce that behavior.

I don't know why you're getting caught up on destructors. Rust can't enforce effects for any functions but that doesn't mean that effects can't be documented / guaranteed.

Once that code is audited, it has to be forked anyway, unless there are actual guarantees that that won't occur in the future.

No, it just has to make the guarantee that it won't allocate. It's not possible to write any code at all if you're unable to trust APIs to uphold their contracts... this is no different than any other part of the API.

The standard library (outside of libcore and liballoc) could be changed to guarantee it wouldn't allocate in destructors as long as the generic destructors didn't allocate

The core crate doesn't allocate. There's nothing special about destructors when it comes to this. There's nothing stopping any crate from documenting the API surface that's either free of dynamic allocations or that returns errors on allocation failures.

I'm pretty sure that's not currently a guarantee.

It's already a strong guarantee in core. The fact that it isn't documented elsewhere is a flaw to fix, not a reason to avoid working on it...

@thestinger
Copy link

"it's not implemented" makes no sense as the rationale for "it shouldn't be implemented"

@pythonesque
Copy link
Contributor

Okay, you've convinced me, as long as the guarantee only holds if unwinding is turned off.

@rrichardson
Copy link

Are there plans for Box to be able to reference the allocator that allocated it? I am thinking of arena style allocators. Drop would need to be self.myalloc.free(self.ptr) instead of free(self.ptr)

@pnkfelix
Copy link
Member Author

@rrichardson one example from RFC PR 244 shows embedding "the allocator" into the allocated object. (And then the embedded "allocator" is really just a reference to the actual shared state of the allocator itself.)

I don't know how much the eventual allocator design will continue to look like RFC PR 244, but I think we do intend to continue to allow the encoding of such back-references (with the expectation that many allocators will be represented by zero-sized types to avoid unnecessary overhead in the allocated state).

@nmyers12
Copy link

nmyers12 commented Jun 3, 2015

We have a simulation benchmark, with plausible memory usage patterns, that in some circumstances runs eight times faster with custom allocators than with global malloc, on modern hardware. The program spends a negligible amount of time actually allocating and freeing memory; effectively all of the runtime difference is the result of better or worse cache locality resulting from the elements used together being allocated together, despite being allocated at widely different times.

The simulation benchmark models a system composed of subsystems that are activated, one at a time, in random order. When a subsystem is activated, it runs for a while touching memory allocated during previous activations, and doing some further allocating and freeing before returning control to the dispatcher. This usage pattern is far from unusual in real systems: imagine a web service that handles various POSTs that touch different database tables, or a client with multiple (tab or window) workspaces.

Fully integrating allocators into a foundation library has profound architectural costs, particularly as regards safety. E.g., moving an object with parts that live in one allocator into a container constructed to use another to has typically catastrophic consequences. The need to store an allocator pointer in the container and in elements adds space overhead and cache pressure. Sharing an allocator object among different containers or elements in a container presents complex ownership questions. Nonetheless, the profound performance consequences (8x!) justify quite a lot of pain. These consequences are not likely to fade away with future changes in system design.

@rrichardson
Copy link

Out of curiosity, was this in the BDE? :) Threading allocators through object hierarchies was a bit of a PITA, but sometimes useful. It is an interesting design choice, to make most objects take ptrs ot alloc instances at construction time.

Some questions: Were the allocators at each subsystem arena, slab or otherwise? Thread safe, or intended to be use thread-local? Cache locality is nice when you can get it, but if the objects are too large to fit in a couple cache lines that benefit is gone.

Fully integrating allocators into a foundation library has profound architectural costs, particularly as regards safety.
Rust has the advantage of potential compiler support, and already much better support for tracking lifetimes and object safety :) It would be tricky to enforce binding an object to its allocator without explicitly letting the object reference the allocator, which of course mucks with things like cache performance, that's 8 bytes that is best used for something else.

@nmyers12
Copy link

nmyers12 commented Jun 4, 2015

Good guess. Yes, if you clone bloomberg/bde-allocator-benchmarks here, you can see the program, benchmarks/allocators/locality.cc. It's a little program with 5 knobs. Runtimes for a series of runs are in (e.g.) benchmarks/allocators/results/intel-clang-20150521/locality.csv. You can see, in the line 4th from the bottom (arguments "-21 17 1 5 256"), a 6x speedup vs. malloc, i.e. 7.5s vs. 45s. (This is on an Ivybridge Xeon. The 8x result was from a series run on a Haswell desktop box.) It's notable that there are plenty of cases where using separate allocators was actually slower, and you would use the default global allocator in those circumstances.

Trafficking in raw pointers to allocators is admittedly retro, but the CPU doesn't care. The allocators used in the benchmarks were what we call "multipools", which segregate allocated blocks by size, and re-use freed memory only for subsequent allocations of the same size. Each block has a size prefix which Rust wouldn't need, so its gains might be larger. The allocators we used are thread-safe, but the simulation was single-threaded, and anyway time spent actually in the allocator was negligible. Most of the time in the slow runs was spent waiting on cache misses.

The amount of memory involved is a lot more than a few cache lines. We have runs that straddle L1, L2, and L3 cache size boundaries, with visible consequences. The elements allocated are 20 bytes; 28 bytes with the prefix, or two to a cache line. I think malloc overhead nowadays is the same.

There are legitimate reasons to want polymorphic allocators in real programs, so that the same container implementation (i.e. it lives in a ".o" file) may be passed allocators for different strategies in different circumstances. The strategies we find most commonly used for performance tuning are multipool and monotonic. In the latter, free is a no-op and the memory is all reclaimed when the allocator goes away; people make a buffer on the stack that is usually big enough. Polymorphic dispatch turns out to add negligible overhead with modern branch predictors.

Transitively, anything that is put in a container and uses allocation must use identically the same allocator instance as the container itself, or catastrophic things happen at free time. E.g., in a hash (string -> pair(vector, string)), the hash, the strings, and the vector need to be enforced using the same allocator instance, even though the pair does not itself do allocation. Compiler support would go a long way to making that requirement practical. It's hard to get any help from C++ on that point.

@nfrechette
Copy link

The primary usefulness of polymorphic allocators is to avoid code duplication for every allocator type. In C++ it also serves the purpose that it allows you to avoid templating everything that requires an allocator. This is very important and should definitely be considered as such.

In my experience, allocators state comes in three flavors: global (eg: jemalloc like we have now), local (eg: monotonic on the stack), and thread local (eg: stack frame allocator). While global/thread local allocators do not require a pointer to the allocator all the time, it is still required when mixed allocator strategies are used (in a function or structure) or when freeing does not happen on the same thread that allocated.

A pointer is always required when the allocator is hidden from the user via polymorphic allocators which is typically required for things like containers and high level data structures.

While having contained objects having the same allocator as the container object is probably best regarding the TLB/cache, I'm not sure if this is really an issue that needs to be enforced by the language/library. Surely the increased TLB/cache cost of having 2 or more separate allocators involved is similar in cost to polymorphic dispatch or am I missing something? I have never experienced major issues regarding mixed allocators involved with containers/containees. I presume how your containers are implemented might be a large part of the reason. The impact on the TLB is probably much more important than on the cache since data between the container/containee is unlikely to share identical cache lines.

Adding a pointer to every allocation simply to point to the owning allocator is a bit overkill. I see at least 2 other alternatives (which could be mixed): the allocator must be provided when freeing memory (using the wrong allocator is cause for panic, omitting the allocator we might assume the global heap and use that instead) or a management structure performs a look up based on the pointer address to be freed and find which allocator it belongs to. The later approach is how I believe jemalloc functions to find which segment/block of memory owns the allocation. In practice, something like a judy array might be used.

Additionally, ideally containers might not always directly store an allocator pointer inline since in practice it is rarely accessed and it will simply bloat all owner objects that contain such containers. It might be best to hide the pointer as part of the allocated memory either as a header or a footer once the memory is allocated. For example, a vector that is uninitialized has no data and no allocator assossiated, an initialized vector has no data but the data pointer points to the allocator to use (vector capacity is 0), and an allocated vector has data and the pointer is stored at data - sizeof(ptr).

Also having a single allocator interface might be hard. Alignment is a must for embedded platforms and many high performance applications, and simply aligning to natural alignment is not always desirable or possible. Further flags are also required and it is not clear how many such flags are enough or required. For example, embedded platforms can routinely specify which page size to use and other things such as write combining, cpu/gpu memory, etc. There are also the cpu/gpu access flags (which might be separate). An allocator might be created to handle a certain specific set of flags (eg: r/w gpu memory, write combining) at which point adding these in the interface might not make as much sense. Another common flag is one of lifetime. EA STL for example allows you to specify if your allocation is long lived or short lived in which case different heap sections are used to help reduce fragmentation. Adding a typed flags argument to the API might not be feasible but if it is not typed, it is certainly less safe.

To complicate things further, not all memory allocation strategies agree on what freeing memory entails. Heaps and pool type allocators typically support freeing at the pointer granularity while monotonic and frame based allocators do not and instead free in frames (typically in C++ applications, destructors are not even called on objects allocated by these allocators since it is generally not known how many objects are freed when it happens). Whether or not this is an issue is debatable.

I have begun discussing similar issues on my blog and I offer a separate C++ API that differs from jemalloc and bloombergs approach a bit in that I offer polymorphic allocators but avoid using the vtable as much as possible. Details can be found here. It is nowhere near finished nor production ready but it offers different trade-offs.

@nmyers12
Copy link

nmyers12 commented Jun 5, 2015

(I read the blog entry, but Disqus does not let me comment there... Merging allocation, deallocation, and re-sizing into one function is clever, and support for resizing is welcome, but it is probably not necessary to second-guess a virtual-function dispatch mechanism: modern CPUs cache recent destinations, tagged to addresses of branch instructions, eliminating most pipeline stalls in loops where they matter most. The vtbl itself eats just one cache line. Since each operation would, in practice, always be one or the other, merging them adds overhead. Also: "IsOwnerOf" (which seems like a long way to say "Owns") seems like it could be a very expensive operation to be obliged to support universally. Is an allocator that can't tell allowed to return false?)

In C++ we have burnt bridges that Rust has not (yet) such as the need for objects have a recoverable size. In practice making the allocator track size need not add overhead, as multiple small objects in a page can share a size annotation (located by address arithmetic) while larger objects can afford their own annotations, and their own pages. Still, it is better for the user of storage to keep track of where they got it and how much is there--presuming they can be trusted to get it right. Can forcing correctness be automated? E.g. each allocation with a static size is matched to a corresponding deallocation, and each pair of dynamic-sized operations identifies a machine word shared between them that records the size. (Tracking reallocation would need some trickery because there is temporarily a second size that, at least momentarily, lives elsewhere.) Memory allocation seems important and hazardous enough to deserve core language support to verify correctness.

The matter of contained objects obliged to "bloat" themselves with copies of an allocator identifier is difficult. In principle, objects contained could instead point to the outermost container, and keep one copy of the allocator identifier there, but that doesn't seem to save any bloat. It may be that objects which do no re-allocation during their lifetime could avoid keeping such an identifier by requiring somebody else (e.g. the container) to provide it at destruction time.

@nfrechette
Copy link

Very strange about the commenting, I'll have to double check what is up
with that.

The main idea of inlining the realloc function pointer is, as I mention in
the post, to save 1 cache miss (the vtable data) and the corresponding TLB
miss. Most allocators should be able to fit inside a single cache line or
have the hot data near the start of the object. While the branch prediction
will work regarding the code to be executed after the branch, it does
nothing regarding the vtable cache line/tlb entry (hardware prefetching
will not help here). It is perhaps extreme but it optimizes the important
case of a large number of allocations coming from containers which, in the
aggregate, will save a few cache lines and tlb entries. For the other major
portion of allocations, I intend to use placement new with the base
allocator (or typed) and simply call realloc if untyped or directly the
typed function if I can. I am aiming for the vtable to serve as a
convenience for debugging and a few exotic use cases. The cache and TLB are
precious and shouldn't be wasted where not needed (when alternatives exist
that don't have too terrible trade-offs).

Indeed, memory allocation is tricky to get right and there are many edge
cases (such as overflow and OOM) and it isn't always obvious when and where
to handle all of them. Correctness is indeed critical. I think for rust,
since it is aimed at safety and speed, allocators should support high
satefy such as overflow checks as well as disabling those checks when they
aren't needed. Ideally, this should be a template argument which will allow
specialization for particular use cases. A sensible default should be
picked as well.

Dealing with OOM is also a major pain point... I am unsure what the best
approach to this might be. C++ containers do not handle this very well so I
am not sure they are a sound source of inspiration. It might make sense to
add variants to all container functions that might allocate and return an
appropriate result (option or otherwise). In the embedded applications I
programmed, OOM was always fatal as such a simple callback to log the issue
was usually sufficient.

On Fri, Jun 5, 2015 at 7:02 PM, nmyers12 notifications@github.com wrote:

(I read the blog entry, but Disqus does not let me comment there...
Merging allocation, deallocation, and re-sizing into one function is
clever, and support for resizing is welcome, but it is probably not
necessary to second-guess a virtual-function dispatch mechanism: modern
CPUs cache recent destinations, tagged to addresses of branch instructions,
eliminating most pipeline stalls in loops where they matter most. The vtbl
itself eats just one cache line. Since each operation would, in practice,
always be one or the other, merging them adds overhead. Also: "IsOwnerOf"
(which seems like a long way to say "Owns") seems like it could be a very
expensive operation to be obliged to support universally. Is an allocator
that can't tell allowed to return false?)

In C++ we have burnt bridges that Rust has not (yet) such as the need for
objects have a recoverable size. In practice making the allocator track
size need not add overhead, as multiple small objects in a page can share a
size annotation (located by address arithmetic) while larger objects can
afford their own annotations, and their own pages. Still, it is better for
the user of storage to keep track of where they got it and how much is
there--presuming they can be trusted to get it right. Can forcing
correctness be automated? E.g. each allocation with a static size is
matched to a corresponding deallocation, and each pair of dynamic-sized
operations identifies a machine word shared between them that records the
size. (Tracking reallocation would need some trickery because there is
temporarily a second size that, at least momentarily, lives elsewhere.)
Memory allocation seems important and hazardous enough to deserve core
language support to verify correctness.

The matter of contained objects obliged to "bloat" themselves with copies
of an allocator identifier is difficult. In principle, objects contained
could instead point to the outermost container, and keep one copy of the
allocator identifier there, but that doesn't seem to save any bloat. It may
be that objects which do no re-allocation during their lifetime could avoid
keeping such an identifier by requiring somebody else (e.g. the container)
to provide it at destruction time.


Reply to this email directly or view it on GitHub
#538 (comment).

@nmyers12
Copy link

nmyers12 commented Jun 8, 2015

(When we think about performance, we don't worry about the first cache miss or the first pipeline stall. We worry about the following million accesses: how many operations do I get per stall or miss? Runtime dispatching in your pointed-to function guarantees plenty of stalls in not-unusual programs. Second-guessing your caches without comprehensive measurements is always wrong. But let's not get distracted.)

My touchpoint for allocators in containers is map : string -> pair ( string, vector ). It seems to me a meaningful distinction whether an object uses an allocator only at construction and destruction time, vs. possibly other times. In the example, the key string is presumably constant, the value maybe not. The pair doesn't know about allocators, but its components do, and they (or, maybe, whoever mutates them?) need to know the map's allocator. When the key is constructed and when it's destroyed, the container is in control and can supply an allocator. When the vector is mutated, finding and conveying to it the right allocator to use for resizing, if it doesn't store one itself, might be prohibitively tricky.

This suggests to me that mutable objects which may allocate again, after construction is over, need space to store (a pointer to) an allocator, but when the same sort of object is not mutable, it should not need to store an allocator. (This implies it has a different, but related type.) The constructor and destructor of a const object that might allocate during construction (including aggregators of such types) need an allocator argument; if mut, it might allocate at other times and need to keep a copy, too. A container or aggregator of a mutable object (anywhere down the hierarchy) needs to be equipped to ensure, on construction, that the contained objects' allocators match, and that they match whatever container it's moved into.

The type system needs to be able to answer, for any given object, whether any subobject might need an allocator, and when. Types that never use an allocator, then, are easy. Those that do need an allocator, but only at construction and destruction (such as constant objects, but many mut objects too), need to be supplied one at the right times. (Maybe some mut subobjects can share a copy if the definitions of allocating operations are visible.) Separately allocated mut subobjects that might need to allocate again each need their own copy.

C++ is stuck with keeping copies of allocator pointers everywhere. In Rust it might not be too late to do something smarter.

@nfrechette
Copy link

nfrechette commented Jun 9, 2015 via email

@nmyers12
Copy link

nmyers12 commented Jun 9, 2015

All of that sidesteps the difficult problem of ensuring that the same allocator is used throughout the container and all its elements, transitively, including elements that, at top level, appear not to need an allocator at all.

Obviously everything can be left to be done manually, with massive duplication and no help for correctness. That is what we are left with if nothing smart surfaces. The topic here is what a smart design might look like, with help from the core language if necessary.

@Ericson2314
Copy link
Contributor

This can be closed now that the RFC is accepted?

@pnkfelix
Copy link
Member Author

@Ericson2314 I think that the merge of #1398 probably means we can close this, though I want to review the thread here and make sure that other concerns that were not addressed by that RFC (e.g. swapping out the default global allocator) are addressed by either an RFC or a follow up issue.

@xgalaxy
Copy link

xgalaxy commented Dec 4, 2017

Whats the status of this? Its been three years =/

@sfackler
Copy link
Member

sfackler commented Dec 4, 2017

@Centril
Copy link
Contributor

Centril commented Feb 23, 2018

Triage ping @pnkfelix - have you reviewed this so that we can close the issue?

@Centril Centril added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Feb 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

No branches or pull requests