Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vec's reallocation strategy needs settling #29931

Open
Gankra opened this issue Nov 19, 2015 · 46 comments

Comments

@Gankra
Copy link
Contributor

commented Nov 19, 2015

Background

Currently, Vec's documentation frequently contains the caveat that "more space may be reserved than requested". This is primarily in response to the fact that jemalloc (or any other allocator) can actually reserve more space than you requested because it relies on fixed size-classes to more effeciently dole out memory. (see the table here)

Jemalloc itself exposes a malloc_usable_size function which can be used to determine how much capacity was actually allocated for a pointer, as well as nallocx which can be used to determine how much capacity will be allocated for a given size and alignment. Vec can in principle query one of these methods and update its capacity field with the result.

The question at hand is: is this ever worth it to check usable_capacity, and is it ever undesirable?

This issue was kicked off by Firefox devs who have experience doing exactly this optimization, and claim it being profitable. Facebook has also claimed excellent dividends in making its allocation strategy more jemalloc friendly, but to my knowledge do not actually query usable_capacity.

Currently our alloction strategy is almost completely naive. One can see the source here, which boils down to:

let new_cap = if self.cap == 0 { 
    4
} else { 
    2 * self.cap 
}

To the best of my knowledge, this is all the capacity logic for FBVector, which boils down to:

let new cap = if self.cap == 0 {
    max(64 / size_of_elem, 1)   // empty
} else if self.cap < 4096 / size_of_elem) {
    self.cap * 2                // small
} else if self.cap > 4096 * 32 / size_of_elem) {
    self.cap * 2                // huge
} else {
    (self.cap * 3 + 1) / 2      // moderate
}

The only major deviation from Rust today being a 1.5 growth factor for "moderate" sized allocations. Note that there is a path in their small_vector type that queries malloc_usable_size.

Unfortunately I've been unable to find good information on what Firefox does here. The best I could find is this discussion which demonstrates big wins -- 6.5% less calls to malloc and 11% less memory usage. However the effort seems to peter out when it is asserted that this is obsoleted by Gecko just using power-of-two capacities. mxr and dxr seem to suggest it's only used for statistics.

Hopefully Firefox and Facebook devs can chime in on any experiences.

Going Forward

I have several outstanding concerns before we push further on this matter.

Rust is not C++, so I am partially suspicious of any benchmarks that demonstrate value in C++. In particular, the proliferation of size_hint and extend could completely blast away any need to be clever with capacities for most programs. I would also expect a large Rust application to frob the allocator less in general just because we default to move semantics, and don't have implicit assignment/copy constructors. Also, Rust programs are much more "reckless" with just passing around pointers into buffers because the borrow checker will always catch misuse. Slices in particular make this incredibly ergonomic. We need Rust-specific benchmarks. I'm sure the Hyper, Servo, Glium, Serde, and Rust devs all have some interesting workloads to look at.

Rust's Vec is basically the community's malloc/calloc, because actual malloc is unstable and unsafe. We explicitly support extracting the Vec's pointer and taking control of the allocation. In that regard I believe it is desirable for it to be maximally well-behaved and performant for "good" cases (knowing exactly the capacity you want, having no use for extra capacity). There's a certain value in requesting a certain capacity, and getting the capacity. Both nallocx and malloc_usable_size are virtual function calls with non-trivial logic, and may have unacceptable overhead for responsible users of Vec.

Note that anything we do must enable Vec and Box<[T]> to reliably roundtrip through each other without reallocating in certain cases. If I invoke shrink_to_fit or with_capacity, it ought to not reallocate when I try to convert to a Box<[T]>. As far as I can tell, this should be possible to uphold even when using malloc_usable_size because jemalloc is "fuzzy" and only requires that the given size is somewhere between the requested one and the usable one.

Anything we do must also work with allocators that aren't jemalloc. This may be as simple as setting usable_size = requested_size for every allocator that isn't jemalloc.

CC @pnkfelix @erickt @reem @seanmonstar @tomaka @SimonSapin @larsberg @pcwalton @Swatinem @nnethercote

CC #29848 #29847 #27627

CC @rust-lang/libs

@aturon

This comment has been minimized.

Copy link
Member

commented Nov 19, 2015

@jdm

This comment has been minimized.

Copy link
Contributor

commented Nov 19, 2015

cc me

@briansmith

This comment has been minimized.

Copy link

commented Nov 19, 2015

The question at hand is: is this ever worth it to check usable_capacity, and is it ever undesirable?

I guess you mean s/usable_capacity/malloc_usable_size/.

If I invoke shrink_to_fit or with_capacity, it ought to not reallocate when I try to convert to a Box<[T]>

I think that's a "nice to have" but not very important. Is real-world code doing shrink_to_fit and then into_boxed_slice? I searched the uses of into_boxed_slice and into_boxed_str in the Rust and Servo repos and it looks like they are rarely used in those repos. I think using those functions is a code smell, actually.

Rust's Vec is basically the community's malloc/calloc,

I disagree with this. Vec is the data structure you use by default, and it should have good performance by default. For example, its the thing I collect iterators into by default. Some people might be using Vec to simulate malloc, but I think that's much more rare. I use with_capacity(N) because I know I'm going to be pushing at least N elements into the vector and I don't want to be wasteful; I'm noever using it because I want to create a boxed slice with N elements in it. It makes sense to have a separate interface for people that want a "safe" malloc; perhaps that separate interface would just be additional stuff added to Vec. But I think the behavior of the common (existing) interface should be geared towards more naive uses.

@SimonSapin

This comment has been minimized.

Copy link
Contributor

commented Nov 19, 2015

Rust's Vec is basically the community's malloc/calloc

That’s a hack to work around the lack of proper stable API, not how it should be.

@bluss

This comment has been minimized.

Copy link
Contributor

commented Nov 19, 2015

Regardless of what we want to do now, we need to remove the "guarantee" in the nightly docs about capacity, which locks us in. We have no reason to promise all kinds of things about Vec internals.

@Ms2ger

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2015

@Swatinem

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2015

Maybe instead of actually using usable_size (which has a non-zero cost as you mentioned), the better thing to do would be to just round up to the nearest power-of-two (in bytes) for small allocations and to the nearest MiB for large ones, like mozillas nsTArray does.

I have also looked a bit at the code, and at least VecDeque rounds up to the nearest pot (in elements), probably because it does a + 1 which is like the most wasteful thing to do related to this issue.
The docs for VecDeque btw say it reserves space for "at least" that many elements, not exactly.

Btw, just grepping through the rustc code for with_capacity reveals a few cases where it does a + 1 as well.

@Gankra

This comment has been minimized.

Copy link
Contributor Author

commented Nov 20, 2015

VecDeque needs to maintain its power of two capacity in order to support efficient modular arithmetic (just a bit mask). HashMap does the same.

@huonw

This comment has been minimized.

Copy link
Member

commented Nov 20, 2015

FWIW, one can use constant-division tricks to do (relatively) efficient modular arithmetic, if the bit pattern is fixed (e.g. if the capacities are guaranteed to be either 1 << n or 3 << n). Of course, branching and then doing the multiplications/shifts for the latter case definitely isn't free.

(I believe x % (3 << n) can be computed via something like:

x - ((x * C) >> (n + 1)) * (3 << n)

for some constant C.)

@Gankra

This comment has been minimized.

Copy link
Contributor Author

commented Nov 20, 2015

Oh and they both keep some slack capacity for similar perf reasons. Hashmap will only use 91% of its cap because it keeps access times down. VecDeque keeps an empty slot so that the "two indices" representation can differentiate between the empty and full state without us maintaining some flag that constantly needs to be checked/maintained.

I would also like to slightly dissent on the point of Vec being a good malloc. It's an excellent memory-safe and exception-safe interface for acquiring, initializing, deinitializing, and releasing an allocation. It's a bad choice for lower-level uses (all the other collections), of course.

@arthurprs

This comment has been minimized.

Copy link
Contributor

commented Nov 21, 2015

It's important to note that jemalloc isn't the only allocation that exhibits this behavior (having some slack), actually many do.

I assumed we had the trio: reserve, reserve_exact and shrink_to_fit; exactly in order to allow reserve to be highly optimized and do smart reallocation, like in this case.

@larsberg

This comment has been minimized.

Copy link

commented Nov 23, 2015

I think you've got the wrong larsberg. I believe you want @larsbergstrom

Swatinem added a commit to Swatinem/rust that referenced this issue Dec 9, 2015

alloc: make RawVec round up allocation sizes to next power-of-two
The system allocator may give more memory than requested, usually rounding
up to the next power-of-two.
This PR uses a similar logic inside RawVec to be able to make use of this
excess memory.

NB: this is *not* used for `with_capacity` and `shrink_to_fit` pinding
discussion in rust-lang#29931

Swatinem added a commit to Swatinem/rust that referenced this issue Dec 10, 2015

alloc: make RawVec round up allocation sizes to next power-of-two
The system allocator may give more memory than requested, usually rounding
up to the next power-of-two.
This PR uses a similar logic inside RawVec to be able to make use of this
excess memory.

NB: this is *not* used for `with_capacity` and `shrink_to_fit` pending
discussion in rust-lang#29931

@aturon aturon removed the P-high label Jan 6, 2016

@kajalsinha

This comment has been minimized.

Copy link

commented Feb 23, 2016

hi,

I am learning rust and when I run the attached programs with parameter 100 million in Point size, Rust takes much more memory in vector. Rust compiler generated program (-O) takes 4.5 GB RAM whereas clang++ generated program (-O) takes 3GB.

I tried a swift program as well which tool 3 GB around (same as C++).

The time command output for rust and c++ respectively is:

[Rust 1.5]
MacBook-Pro:Pointrust kajal$ time cargo run --release 100000000
Running target/release/pointrust 100000000
-13091672652, 50607436937

real 0m25.189s
user 0m15.511s
sys 0m9.144s

[C++ compiled using clang++ with -O flag]
MacBook-Pro:somedir kajal$ time ./a.out 100000000
Average is 1073876718, 1073753699

real 0m21.816s
user 0m18.231s
sys 0m3.167s

Queries:

  1. Why sys time is more in rust and what it indicates?
  2. Can this 1.5 times more memory allocation be reduced to same as C++.

pointrust.zip
PointTestCpp.cpp.zip

@ticki

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2016

@kajalsinha You might want to specify the capacity when you create the vector (see with_capacity).

@kajalsinha

This comment has been minimized.

Copy link

commented Mar 3, 2016

The memory consumption remains 1.5 times even with_capacity

@hexsel

This comment has been minimized.

Copy link

commented Mar 3, 2016

@kajalsinha your C++ program is using long, Rust is using f64, right?

According to google, a long may be 32 bits, I expect f64 is 64 bits. If this is the case, you should be careful with any comparisons, integers are usually faster than floating point even for same size.

Caveat: I was never a C++ programmer, and google is occasionally wrong.

@notriddle

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2016

The C/C++ standards define long as a number with no less capacity than an int and no more capacity than a long long. That's it. On Mac OS X (which @kajalsinha is using, based on the console paste), it's one word long: 32 bits on 32-bit OSX and 64 bits on 64-bit OSX. For a fair comparison with Rust, you should use an isize.

The fact that you're using integers in C and floating point numbers in Rust also accounts for the performance difference.

@Storyyeller

This comment has been minimized.

Copy link

commented Sep 6, 2016

What is the point of even having reserve_exact(), if it provides no additional guarentees over reserve()?

@ticki

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2016

@Storyyeller reserve guarantees that cap is greater than or equal to the specified amount. reserve_exact guarantees that cap is equal to the specified amount, that however does not mean the allocator can't give more memory (usable_size) back.

@ollie27

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2016

that however does not mean the allocator can't give more memory (usable_size) back.

@ticki what does that mean if it doesn't mean that the capacity can be greater than requested?

I believe that reserve and reserve_exact make exactly the same guarantee, namely that you can push the specified number of elements onto the Vec without any reallocation. The only difference is the strategy for reallocation. reserve might try to reserve more space than requested to avoid reallocations if you push more elements than you requested. reserve_exact will only try to reserve enough space for the extra elements. This is useful to reduce memory usage if you know exactly how many more elements you need to push or at least have an upper bound.

@ticki

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2016

what does that mean if it doesn't mean that the capacity can be greater than requested?

To make bookkeeping of the memory more smooth, most allocators has some fixed sizes you can allocate from. If the given value is not one of these, it will round up to the nearest fitting size.

@ollie27

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2016

Right, but that doesn't mean that "reserve_exact guarantees that cap is equal to the specified amount". In fact the docs say that that is not the case:

Note that the allocator may give the collection more space than it requests. Therefore capacity can not be relied upon to be precisely minimal.

@ticki

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2016

@ollie27 No, the point is that the capacity is updated with the (uncanonicalised) memory returned by the allocator.

@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

commented May 24, 2017

Actually, I'm going to reopen. This is something that has potential performance impacts and we should probably keep tracking.

@gnzlbg

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2017

I don't see the growth factor being discussed much here.

@gankro mentions FBVector and its growth factor of 1.5. A factor of 1.5 allows reusing previously freed memory after 4 reallocations, 1.45 allows memory reuse after 3 reallocations, and 1.3 allows reuse after 2 reallocations. Which one is better is application dependent, although some claim that 1.5 is close to optimal.

What seems clear is that a growth factor of 2, that is, doubling the size of the vector on reallocation, is the worst possible growth factor that can be chosen because it never allows the vector to reuse any previously freed memory. That is, for a vector that reallocates multiple times, a growth factor of 2 means that the memory freed from these reallocations will never be usable for a future allocation of the vector.

Since Rust currently uses a growth factor of 2, a small incremental step could be to try a smaller growth factor and see if that improves memory usage. Any one will do, although 1.3, 1.45 or 1.5 make sense to me.


From the fbvector docs:

Of course, the above makes a number of simplifying assumptions about how the memory allocator works, but definitely you don't want to choose the theoretically absolute worst growth factor (2). fbvector uses a growth factor of 1.5. That does not impede good performance at small sizes because of the way fbvector cooperates with jemalloc (below).

@steveklabnik

This comment has been minimized.

Copy link
Member

commented Sep 9, 2017

Is this even something we can change at this point?

@gnzlbg

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2017

Also, I don't know if the code of fbvector.h has changed much, but the part cited by @gankro now looks like this:

  size_type computePushBackCapacity() const {
    if (capacity() == 0) {
      return std::max(64 / sizeof(T), size_type(1));
    }
    if (capacity() < folly::jemallocMinInPlaceExpandable / sizeof(T)) {
      return capacity() * 2;
    }
    if (capacity() > 4096 * 32 / sizeof(T)) {
      return capacity() * 2;
    }
    return (capacity() * 3 + 1) / 2;
  }

It hasn't changed much, but note that this function only computes the "increased capacity on push back". This is then used in:

  size_type computeInsertCapacity(size_type n) {
    size_type nc = std::max(computePushBackCapacity(), size() + n);
    size_type ac = folly::goodMallocSize(nc * sizeof(T)) / sizeof(T);
    return ac;
  }

and in emplace_back:

template <typename T, typename Allocator>
template <class... Args>
void fbvector<T, Allocator>::emplace_back_aux(Args&&... args) {
  size_type byte_sz = folly::goodMallocSize(
    computePushBackCapacity() * sizeof(T));
   // ...
}

to predict the future capacity, but goodMallocSize has the last say on this. ~~~I can't find the source of goodMallocSize, so I don't know if it calls nallocx, usable_capacity, or does something else.~~~ EDIT, found the source of goodMallocSize:

inline size_t goodMallocSize(size_t minSize) noexcept {
  if (minSize == 0) {
    return 0;
  }

  if (!usingJEMalloc()) {
    // Not using jemalloc - no smarts
    return minSize;
  }

  return nallocx(minSize, 0);
}

// We always request "good" sizes for allocation, so jemalloc can
// never grow in place small blocks; they're already occupied to the
// brim.  Blocks larger than or equal to 4096 bytes can in fact be
// expanded in place, and this constant reflects that.
static const size_t jemallocMinInPlaceExpandable = 4096;

So currently, fbvector always calls nallocx. @gankro thoughts?


@steveklabnik neither RawVec nor RawVec::double are stabilized, and std::Vec documentation says:

Vec does not guarantee any particular growth strategy when reallocating when full, nor when reserve is called. The current strategy is basic and it may prove desirable to use a non-constant growth factor. Whatever strategy is used will of course guarantee O(1) amortized push.

So as long as we still guarantee O(1) amortized push we can continue to improve this.


EDIT: IMO it makes sense to rename double to grow to indicate that we grow the RawVec, and not necessarily double it, and then to switch the default capacity to 1.5, measure, use a better heuristic for small and large allocations while leaving the 1.5 heurisitc for medium sized allocations, measure, pass alignment and capacity information to the allocator to check how much memory we actually get and try to reuse most of that, measure.

@Gankra

This comment has been minimized.

Copy link
Contributor Author

commented Sep 9, 2017

Yes this will always be freely changeable. Someone just needs to put in the legwork to prove it's actually profitable. Keeping in mind many platforms don't use jemalloc, and that number will probably only be going down with time.

@nagisa

This comment has been minimized.

Copy link
Contributor

commented Sep 11, 2017

@gnzlbg

This comment has been minimized.

Copy link
Contributor

commented Oct 16, 2017

I am working on this. If anyone interested and with experience would like to mentor me please let me know.

@mratsim

This comment has been minimized.

Copy link

commented Nov 10, 2017

@gnzlbg The optimal growth factor is probably the golden ratio Φ (1.618...) as discussed here facebook/folly#543 and in depth in this blog post.

@gnzlbg

This comment has been minimized.

Copy link
Contributor

commented Nov 10, 2017

@nnethercote

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2017

I strongly advocate for a doubling strategy, as opposed to 1.5x, or the golden ratio, or any other number.

As I understand them, the arguments in favour of smaller values are based on a mistaken assumption. Let's assume we use doubling and start at 8 bytes, and then go 16, 32, 64, 128, 256, 512, 1024. For the 8..512 allocations we will have used 1016 bytes, and so we can't reuse those bytes for the 1024 bytes. With a 1.5x growth factor we would theoretically be able to reuse and combine multiple previous freed allocations to satisfy subsequent requests.

But in practice we can't, because the above argument assumes in the 1.5x case that these smaller allocations may be laid out in such a way that they can be combined into larger allocations, which is almost certainly not true. jemalloc rounds up requests to various size classes, and allocations of different size classes are put into different runs, where runs are measured in pages; so it is guaranteed that not a single one of those allocations in the 8..512 sequence will be adjacent, and there will be no chance of combing the for such reuse. Other allocators may behave differently, but it would be a very unusual allocator that would lay out such sequences contiguously.

So with that mistaken argument out of the way, we can see the arguments in favour of 2x: it's super simple and it's super fast to compute.

@gnzlbg

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2017

But in practice we can't, because the above argument assumes in the 1.5x case that these smaller allocations may be laid out in such a way that they can be combined into larger allocations, which is almost certainly not true. jemalloc rounds up requests to various size classes, and allocations of different size classes are put into different runs, where runs are measured in pages; so it is guaranteed that not a single one of those allocations in the 8..512 sequence will be adjacent,

Which is why the proposed strategy doubles up the memory up to the page size [*], and then switches to 1.5x.

we can see the arguments in favour of 2x: it's super simple and it's super fast to compute.

A better growth-factor can 1) reduce the number of jemalloc calls, 2) reduce the time spent in these calls, and 3) reduce memory usage. These all can have a measurable performance impact.

But whether computing the growth factor takes 1 cycle or 20 is irrelevant compared to the cost of a single jemalloc call: nallocx alone is already over 50 cycles.

[*] Which is arguably too simple. We could probably do better than this by rounding to the next power of two, so that instead of copying the memory on growth within the same memory pool (such that the older memory can't be reused) jemalloc would instead copy the memory into the next pool freeing the old memory.

@glandium

This comment has been minimized.

Copy link
Contributor

commented Nov 13, 2017

Above the page size, the assumption is still bogus, at least in the case of jemalloc:
If you have a 4K allocation and realloc() it to 8K, and there are 8K available directly next to the original 4K, jemalloc is not going to allocate the 8K, copy the data and free the original 4K: it's just going to expand the original 4K to 8K, because there's 4K free right after it.
But this assumes you're using realloc, which I guess Vec does. Now, the big difference with C++ vectors is that C++ vectors may semantically need an actual malloc/copy/free sequence, where copy uses the copy constructor for the type stored in the vector. And in that case, the assumption does apply.
Now, the real argument is that past a certain size, doubling can be considered too much memory overhead.

@gnzlbg

This comment has been minimized.

Copy link
Contributor

commented Nov 13, 2017

Above the page size, the assumption is still bogus, at least in the case of jemalloc:

Yes, this is why the proposed strategy only uses 1.5x for medium sized vectors and switches the strategy again for "large" vectors (>= 32 * 4096 bytes), currently back to doubling.

But this assumes you're using realloc, which I guess Vec does.

Yes, Vec currently uses realloc but with a growth factor of 2 which means that this never happens. There is a PR that uses 1.5x for medium sized vectors and switches from realloc to realloc_excess, which in practice means that medium sized vectors will always be using a multiple of the page size.

Now, the real argument is that past a certain size, doubling can be considered too much memory overhead.

I think that we might be able to do better than doubling for "large" vectors by tweaking the growth factor, but not much better. The PR switches back to doubling here because this is what we are currently doing. Is there any literature or recommendations from jemalloc / malloc about this?

Anyways my plan for large vectors is not to tweak the growth factor between choosing either 1.5 or 2. The main problem with "very large" vectors is the cost of allocating new memory, copying memory (they are very large), and freeing the old memory.

On any of the major 64 bit platforms (Linux, Windows, MacOs, ...) we can reduce the cost of this operations to ~0 for very large vectors by, e.g., once a vector becomes "very large" allocating 4 Gb of virtual memory (e.g. using VirtualAlloc on Windows, malloc or mmap /dev/zero on Linux, ...). That way the vector can keep growing in place, using only as much memory as it needs (as long as it doesn't touch a memory page it will not use it), without ever reallocating, copying memory, invalidating pointers, ...

I think that this will have a much larger impact on performance than tweaking the growth factor.

@glandium

This comment has been minimized.

Copy link
Contributor

commented Nov 13, 2017

Actually, C++ vectors always need an actual malloc/copy/free sequence, because

  • C++ doesn't actually use malloc/free, it uses operator new/operator delete, and doesn't actually have a realloc. Practically speaking, in many cases operator new/operator delete are equivalent to malloc/free, and realloc would work on operator new allocated buffers, but there's no guarantee by the standard that this is supposed to work.
  • std::vector doesn't even use operator new/operator delete directly, it goes through std::allocator, which also doesn't have a function to reallocate (obviously).

Yes, Vec currently uses realloc but with a growth factor of 2 which means that this never happens.

There is no reason for this not to happen with jemalloc, even with larger factors, for sizes between one page and the maximum size for large allocations (chunk size minus a few pages). Except, obviously, if the memory following the current buffer is not free, in which case the original assumption doesn't apply anyways.

@gnzlbg

This comment has been minimized.

Copy link
Contributor

commented Nov 13, 2017

Actually, C++ vectors always need an actual malloc/copy/free sequence, because

I don't follow the point you are trying to make. Yes, C++ vectors need to do this, but Rust's Vec does not, or what am I missing?

There is no reason for this not to happen with jemalloc, even with larger factors, for sizes between one page and the maximum size for large allocations (chunk size minus a few pages).

Maybe we are talking past each other and you meant something different, but:

Suppose you have a contiguous chunk of the address space that is free (before and after this chunk the memory is not free), and that fits the initial content of a vector plus 4x vector regrowths.

Now suppose that after the 4th regrowth, the first part of the buffer is still free.

The difference between a growth-factor of 1.5x and 2x is that the vector with a growth-factor of 1.5x can grow a fifth time inside that buffer while the one with a growth-factor of 2x never can. This is because, for a growth-factor of 1.5, the size of the previously freed allocations in the buffer is exactly 1.5x the size of the 4th vector, while for a growth-factor of 2x, the size of the previously-freed allocations is always smaller than 2x the size of the 4th vector.

Except, obviously, if the memory following the current buffer is not free, in which case the original assumption doesn't apply anyways.

So if the memory following the current buffer is not free, anything better than the worst theoretically-possible growth-factor allows you to grow the vector one extra time before having to go look for a larger buffer.

But even if the memory following the current buffer is free, reusing the previously-freed memory reduces fragmentation, and can potentially reduce the number of jemalloc calls as well (if the free size after the 4th vector is returned as an Excess).

Or where you talking about a completely different situation?

@gnzlbg

This comment has been minimized.

Copy link
Contributor

commented May 18, 2018

Now that we are moving to the System allocator by default, I think it makes sense to at least take a look at what the C++ vector implementations do w.r.t the growth factor on the different platforms, since it could make sense that the platforms tune their growth factors for their allocators:

libc++

System implementation on MacOSX and some *BSDs, code:

//  Precondition:  __new_size > capacity()
template <class _Tp, class _Allocator>
inline _LIBCPP_INLINE_VISIBILITY
typename vector<_Tp, _Allocator>::size_type
vector<_Tp, _Allocator>::__recommend(size_type __new_size) const
{
    const size_type __ms = max_size();
    if (__new_size > __ms)
        this->__throw_length_error();
    const size_type __cap = capacity();
    if (__cap >= __ms / 2)
        return __ms;
    return _VSTD::max<size_type>(2*__cap, __new_size);
}

The growth factor is 2. Note that on *BSDs jemalloc is the default system allocator, although FBVector is tuned for jemalloc and does a couple of things differently as mentioned above (2 for tiny and large vectors, and 1.5 for the cases in which the allocation hits jemalloc pools).

libstdc++

System implementation on Linux. The code is here:

// Called by _M_fill_insert, _M_insert_aux etc.
size_type
_M_check_len(size_type __n, const char* __s) const
{
	if (max_size() - size() < __n)
	  __throw_length_error(__N(__s));

	const size_type __len = size() + std::max(size(), __n);
	return (__len < size() || __len > max_size()) ? max_size() : __len;
}

The growth factor is 2 (size + size).

MSVC

Windows. I couldn't find the code only anywhere, but all internet sources point to a factor of 1.5. This factor is also used by FBVector, Boost, etc.


Ok so what is going on here?

Well, Windows is the outlier, and that makes sense! Windows does not have overcommit, so all the memory Vec reserves is physical memory. When STL, a std lib maintainer at Microsoft, tried to changed it people complained that they preferred the lower memory consumption.

On the other hand, for Linux and MacOSX, it might well be that 2 is more tuned for the pools that their allocators use internally. For large allocations (larger than a couple of page sizes), however, it doesn't really matter because the pages won't be committed to physical memory until they are touched. Using the larger 2x factor is a problem on 32-bit systems, but it is basically free on 64-bit systems and maybe that's what libstdc++ and libc++ are tuned for?

Hell, if one wouldn't be able to disable overcommit in these systems we could just have allocations of large enough vector on 64-bit systems just jump to reserving XGb of RAM and continue with a 2x factor after that. That basically allows realloc to growth the vector in place without any memcpys.

So IMO, if we are going to change this, I'll say change this to whatever each system does. That is, 1.5 on Windows, and 2x on Linux and MacOSX. For targets that have jemalloc as a system allocator (newer Androids, *BSDs, etc.) we could consider doing what FBVector does, but IMO we should discuss that on a different issue that is specific for those systems.

@leventov

This comment has been minimized.

Copy link

commented Nov 22, 2018

Windows does not have overcommit, so all the memory Vec reserves is physical memory.

I think these are two unrelated things. Windows might not allow to overcommit, but virtual memory is still virtual. According to this article there have been some substantial improvements to how does it manage memory in Windows 8.1 and Windows Server 2012 R2, both released in 2013. The comment by STL:

Yes - but the last time I mumbled about possibly changing this, many people said they preferred the more conservative space consumption (it's not a big difference, but it's there). I'm not terribly inclined to mess with it.

Was made on Aug 30 2014 and the "last time" mentioned could well be several years before that, so likely prior to the improvements.

@leventov

This comment has been minimized.

Copy link

commented Nov 22, 2018

On the other hand, for Linux and MacOSX, it might well be that 2 is more tuned for the pools that their allocators use internally. For large allocations (larger than a couple of page sizes), however, it doesn't really matter because the pages won't be committed to physical memory until they are touched. Using the larger 2x factor is a problem on 32-bit systems, but it is basically free on 64-bit systems and maybe that's what libstdc++ and libc++ are tuned for?

The factor itself (is it 2 or a different number) couldn't be tuned for allocator alone because it doesn't take the size of the element into consideration. The capacities could always be multiples of two, but if the size of the element is 12 bytes, it's always off. Compare with FBVector's initial capacity of max(CACHE_LINE_SIZE / size_of_elem, 1).

I think it's as important not to overestimate the amount of thought put into old systems code as not to underestimate it. I might be wrong but I have an impression that it's pretty much impossible to push any changes to those policies such as vector's growth factor, hash table load factor, etc. in C++ standard library implementations because some users may depend on the existing behaviour and the priority is avoiding regressions rather than improving the average. That's probably why companies like Facebook and Google, albeit having significant power in GCC / Clang development, come along with their independent implementations. Rust still has the luxury (but probably not for long) to reconsider the old defaults and I think it should take advantage of it.

That is, I think Rust should pick up the FBVector's strategy (or something similar) on all platforms.

@gnzlbg

This comment has been minimized.

Copy link
Contributor

commented Nov 22, 2018

The capacities could always be multiples of two, but if the size of the element is 12 bytes, it's always off.

This doesn't really matter much if you always extend your vector to the usable_size.

The factor itself (is it 2 or a different number) couldn't be tuned for allocator alone because it doesn't take the size of the element into consideration.

If your allocator only has even-sized or power-of-two-sized bins, a growth factor of two for a vector could make sense, even if this growth factor doesn't take the element size into account.

@leventov

This comment has been minimized.

Copy link

commented Nov 22, 2018

If your allocator only has even-sized or power-of-two-sized bins, a growth factor of two for a vector could make sense, even if this growth factor doesn't take the element size into account.

If it starts with something like FBVector's 64 / sizeof(T), yes. If no (libc++ and libstdc++ implementations that you pointed to don't, as far as I can tell, their capacity chain is always 1 -> 2 -> 4 -> ...), I don't see how does it make sense, if sizeof(T) is not a power of two.

@gnzlbg

This comment has been minimized.

Copy link
Contributor

commented Nov 24, 2018

If it starts with something like FBVector's 64 / sizeof(T), yes [...] I don't see how does it make sense, if sizeof(T) is not a power of two.

It's maybe a bit counter intuitive. Alloc::usable_size will always set the capacity of the vector at BIN_SIZE / sizeof(T), so that the vector always uses at least BIN_SIZE - sizeof(T) + 1 bytes of the allocation (if BIN_SIZE % sizeof(T) == 0 then it uses BIN_SIZE bytes). As the vector grows, BIN_SIZE grows as a function of the capacity, but sizeof(T) + 1 remains constantan

That is, a Vec that uses Alloc::usable_size() grows with factor * capacity ~= factor * BIN_SIZE, so sizeof(T) does not really play a role here.

Maybe it is a bit clearer with an example, let's suppose sizeof(T) == 13 (prime) and that our allocator only has power-of-two sizes. I'll directly jump to the cases that make this obvious:

let mut v = Vec::with_capacity(8); 
// requests 8 * 13 = 104 bytes, allocator bin:  128 bytes
// 128 / 13 = 9
assert_eq!(v.capacity(), 9);

v.extend(0..10); // push 10 elements to grow the vector
// requested capacity: 2 * capacity() = 2 * 9 = 18 => 234 bytes
// allocator bin: 256 bytes => 256 / 13 = 19
assert_eq!(v.capacity(), 19); 

v.extend(0..10); // push another 10 elements to grow the vector
// requested capacity: 2 * capacity() = 2 * 19 = 38 => 494 bytes
// allocator bin: 512 bytes => 512 / 13 = 39
assert_eq!(v.capacity(), 39);

// etc.

So that is what I meant with, if your Vec uses Alloc::usable_size, the exact growth-factor that you use is less "relevant". For this particular case, you can actually exploit this knowledge to don't need a growth factor at all.

Your allocator might not have size-classes, might lack an accurate Alloc::usable_size(), or might even be a real allocator like the one FBVector assumes (with accurate Alloc::usable_size() which FBVector uses, power-of-two size classes in a particular range, ... requiring FBVector to use multiple growth factors).

But even if you don't know anything about your allocator, a power-of-two size class assumptions makes sense. Many allocators use power-of-two size classes, and many types have a power-of-two size due to alignment requirements and fit it perfectly. If your vector starts with one such element and grows, it will do so perfectly, and if it doesn't fit a bin perfectly, it will always grow into the next bin (something that 1.5x growth-factor doesn't give you). A growth factor of 2 is also dirty cheap to compute.

If you have an accurate Alloc::usable_size, then the growth factor for small and medium allocations that fall in the allocators size classes doesn't really matter. As you move to large allocations, Alloc::usable_size just rounds up to the next multiple of a page size, and a factor of 1.5 grows less but more often, while a factor of 2 grows less often but more. Which one is best on that range really depends, but whether the arguments in favor of a 1.5 factor hold on that range (e.g. being able to reuse previously freed memory when growing), really depend on your allocator, and it is unclear to me that any allocator does actually anything clever enough to be helpful here. You might as well use a factor of 2 only with the hope of having to grow one time less than if you were using a factor of 1.5.

@leventov

This comment has been minimized.

Copy link

commented Nov 24, 2018

I should have clarified that I pointed to the lack of sizeof(T) consideration in std::vector implementations solely to make it more evident that those growth policies might be not results of deep thought or super fine tuning for specific allocators, but rather arbitrary decisions that were made decades ago and which nobody was able to or willed to change since. I mean, when you look at the relative complexity of the FBVector's growth policy, you clearly see that authors did think about that. std::vector double growth policy is simple, but it's likely not simplicity with "great wisdom behind it", so the 2.0 growth factor shouldn't receive additional credit because it appears in that std::vector impls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.