Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-place iteration results in too big allocations #120091

Closed
TethysSvensson opened this issue Jan 18, 2024 · 79 comments
Closed

In-place iteration results in too big allocations #120091

TethysSvensson opened this issue Jan 18, 2024 · 79 comments
Labels
A-collections Area: std::collections. C-discussion Category: Discussion or questions that doesn't represent real issues. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Milestone

Comments

@TethysSvensson
Copy link
Contributor

(This bug report was inspired by this blog post https://blog.polybdenum.com/2024/01/17/identifying-the-collect-vec-memory-leak-footgun.html)

After #110353 was landed, in-place iteration can reuse allocations in many more places. While this is a good thing, in some cases this can cause overly large capacity for the destination vector.

Additionally, in some this will cause the destination vector to have a very large capacity even for non-shared allocations. In my opinion, this should never happen.

For an example see this code:

fn dbg_vec<T>(v: &Vec<T>) {
    println!(
        "vec data ptr={:?} len={} cap={}",
        v.as_ptr(),
        v.len(),
        v.capacity()
    );
}

fn main() {
    let v1 = (0u16..128).map(|i| [i; 1024]).collect::<Vec<_>>();
    dbg_vec(&v1);
    let v2 = v1.into_iter().map(|x| x[0] as u8).collect::<Vec<_>>();
    dbg_vec(&v2);
}

On stable this code works as expected, i.e. two different vectors with reasonably lengths and capacities.

On beta however, v2 will have a capacity of 262144, even though it does not share an allocation with v1. If you remove the as u8 part, then the allocations will be shared, but the capacity is still overly large.

My suggested fix is:

  • Do not attempt to use in-place collection if the destination alignments do not match, as reallocation will always be necessary. On my machine, these reallocations do not appear to return the samme allocation, so we might as well allocate up front to avoid a memcpy.
  • The behavior regarding capacities in the remaining cases needs to either change or be documented somewhere.

Meta

I am running NixOS with rustup.

$ rustup --version --verbose
rustup 1.26.0 (1980-01-01)
info: This is the version for the rustup toolchain manager, not the rustc compiler.
info: The currently active `rustc` version is `rustc 1.77.0-nightly (6ae4cfbbb 2024-01-17)`
$ rustc +stable --version --verbose
rustc 1.75.0 (82e1608df 2023-12-21)
binary: rustc
commit-hash: 82e1608dfa6e0b5569232559e3d385fea5a93112
commit-date: 2023-12-21
host: x86_64-unknown-linux-gnu
release: 1.75.0
LLVM version: 17.0.6
$ rustc +beta --version --verbose
rustc 1.76.0-beta.5 (f732c37b4 2024-01-12)
binary: rustc
commit-hash: f732c37b4175158d3af9e2e156142ffb0bff8969
commit-date: 2024-01-12
host: x86_64-unknown-linux-gnu
release: 1.76.0-beta.5
LLVM version: 17.0.6
@TethysSvensson TethysSvensson added the C-bug Category: This is a bug. label Jan 18, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jan 18, 2024
@TethysSvensson TethysSvensson changed the title In-place iteration does results in too big allocations In-place iteration results in too big allocations Jan 18, 2024
@the8472
Copy link
Member

the8472 commented Jan 18, 2024

On my machine, these reallocations do not appear to return the samme allocation, so we might as well allocate up front to avoid a memcpy.

The pointer not being the same does not necessarily mean the allocation hasn't been reused. The allocator can call mremap to move it to a new virtual address which has a much lower cost than actually copying the bytes or faulting in new pages. This may also depend on the allocation size. For smaller pool-based allocations it might be an actual memmove and a mremap for larger ones.

The behavior regarding capacities in the remaining cases needs to either change or be documented somewhere.

Well, this is a non-guaranteed optimization so we can't document the exact behavior. But it probably makes sense to mention that something like this may happen.

Additionally, in some this will cause the destination vector to have a very large capacity even for non-shared allocations. In my opinion, this should never happen.

That's intended to allow vec-allocation recycling even when the type changes, which can mean the excess factor is infinite.

@the8472 the8472 added A-collections Area: std::collections. T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Jan 18, 2024
@TethysSvensson
Copy link
Contributor Author

On my machine this does result in a memmove. This is the strace output of an execution:

29839 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f50437bd000
29839 write(1, "vec data ptr=0x7f50437bd010 len="..., 44) = 44
29839 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f504377c000
29839 munmap(0x7f50437bd000, 266240)    = 0
29839 write(1, "vec data ptr=0x7f504377c010 len="..., 47) = 47
29839 munmap(0x7f504377c000, 266240)    = 0

@TethysSvensson
Copy link
Contributor Author

This extra capacity also has negative performance impacts:

fn test_func() {
    let v1 = (0u16..0x4000).map(|i| [i; 32]).collect::<Vec<_>>();
    let v2 = v1.into_iter().map(|x| x[0] as u8).collect::<Vec<_>>();
    // Prevent the optimizer from removing the allocation
    assert_eq!(v2[0], 0);
}

fn main() {
    let mut now = std::time::Instant::now();
    for _ in 0..10_000 {
        test_func();
    }
    println!("{}", now.elapsed().as_millis());
}

This code takes more than 10 times as long on my machine if it was compiled with the beta compiler rather than the stable one.

@kevincox
Copy link
Contributor

kevincox commented Jan 18, 2024

It seems that much like capacity growth has a scaling factor there should be some sort of limit on the unused capacity that will be used for optimizations like this. Maybe there should be a general rule for Vec such as "The capacity will never automatically grow more than 2x (or some other number) the maximum elements that have been part of the Vec since the last explicit capacity operation."

So the manual capacity operations still work, but automatic capacity growth and allocation will not use more than some constant factor of the in-use memory.

It seems obvious to me that some rule is required here. If the next version of rust made push allocate a thousand times extra capacity it would "break" many programs. So there is an implicit contract here. It may make sense to make it explicit. IDK what the number we should promise is, maybe 2x or 4x. But it would be nice to write down some guarantee that can be relied on.

If we have that rule spelt out then we can decide if this operation should follow the rule, or have some sort of exception. But I think it should probably follow it. (Or maybe have a slightly relaxed threshold.)

@the8472
Copy link
Member

the8472 commented Jan 18, 2024

It seems obvious to me that some rule is required here.

Not necessarily. The capacity has already been allocated, so the cost has been paid. And vecs never shrink by themselves, even if you pop from them. So there's no guarantee that you always get minimal possible memory footprint.

@kevincox
Copy link
Contributor

I agree that this is a bit of a weird case, but I would argue that from the user's perspective this is a "new Vec". So it is somewhat surprising that it is holding on to arbitrary amounts of memory from another vector. It would be nice to provide some guarantees.

even if you pop from them. So there's no guarantee that you always get minimal possible memory footprint.

Please not that in my rule I said "maximum elements since ...". This would allow the current behaviour of pop not freeing memory. In other words my rule only controls growth, it never implies that the Vec will shrink other than explicit capacity operations such as shrink_to_fit.

@majaha
Copy link
Contributor

majaha commented Jan 18, 2024

It seems obvious to me that some rule is required here.

Not necessarily. The capacity has already been allocated, so the cost has been paid. And vecs never shrink by themselves, even if you pop from them. So there's no guarantee that you always get minimal possible memory footprint.

The cost of an allocation is over it's entire lifetime: It's memory cannot be used by something else.

I pretty much agree with @kevincox and was about to post something similar: There should be a note in the docs for .collect<Vec<_>() that states that the resulting capacity can never be more than twice the length. (Or wherever in the docs that would cover any other similar cases)

@the8472
Copy link
Member

the8472 commented Jan 18, 2024

"surprising" is a matter of expectations. Different people have different expectations. E.g. some people have expressed surprise when collect doesn't reuse allocations.

And as I have mentioned in a previous comment using a factor-based heuristic doesn't help if you're intentionally doing something like vec_t.filter_map(|_| None).collect::<Vec<U>> to keep the allocation. Allocation recycling has been requested before.

There should be a note in the docs for .collect<Vec<_>() that states that the resulting capacity can never be more than twice the length.

Vec does not guarantee that anywhere for any operation. Quite the opposite:

Vec does not guarantee any particular growth strategy when reallocating when full, nor when reserve is called. The current strategy is basic and it may prove desirable to use a non-constant growth factor. Whatever strategy is used will of course guarantee O(1) amortized push.

@kevincox
Copy link
Contributor

Small note: I think that collect::<Vec<_>> should probably have some sort of exception for size_hint. For example if size_hint says a minimum size of 1k elements but then ends after 2 I think it is acceptable that the capacity can be 500x the actual used space. But I would document this as an exception to the general rule since this it only occurs if the implementation of size_hint is broken.

@the8472
Copy link
Member

the8472 commented Jan 18, 2024

If your size hint is incorrect we could just panic. You're breaking an API contract. It's called "hint" because it's not required to be tight, not because you're allowed to return garbage.

@kevincox
Copy link
Contributor

I don't think anyone is claiming that there is currently a guarantee. Personally I am claiming that there should be a guarantee because users are relying on some sort of reasonable memory usage, and it would be nice to actually promise that.

@kevincox
Copy link
Contributor

if you're intentionally doing something like vec_t.filter_map(|_| None).collect::<Vec> to keep the allocation.

Maybe this would best be provided by a dedicated API rather than this weird looking trick? This would make it explicit and obvious to readers.

@the8472
Copy link
Member

the8472 commented Jan 18, 2024

Well, there's comments here and there telling people to use just that if they want the recycling behavior... so you could say people also rely on that behavior and have different expectations.

@jhorstmann
Copy link
Contributor

Looking at the assembly for the following two functions

pub fn test_array_u8(input: Vec<[u8; 8]>) -> Vec<u8> {
    input.into_iter().map(|x| x[0] as u8).collect()
}

pub fn test_array_u16(input: Vec<[u16; 4]>) -> Vec<u8> {
    input.into_iter().map(|x| x[0] as u8).collect()
}

It seems to me that on beta, the first one (where the element type matches the result type), works completely in place without allocations. So here the resulting capacity being bigger would be expected.

The second function also first calculates the result in place, but then seems to do an allocation and copying of the data. So there might be a bug or missed optimization in the size calculation for this new allocation.

With rust 1.75, both look like they first allocate a new location for the result.

@the8472
Copy link
Member

the8472 commented Jan 18, 2024

That's unexpected. It should be calling into realloc, not memcpy

@LegionMammal978
Copy link
Contributor

LegionMammal978 commented Jan 18, 2024

The pointer not being the same does not necessarily mean the allocation hasn't been reused. The allocator can call mremap to move it to a new virtual address which has a much lower cost than actually copying the bytes or faulting in new pages. This may also depend on the allocation size. For smaller pool-based allocations it might be an actual memmove and a mremap for larger ones.

This is not really relevant, since it's impossible to reuse an allocation with a different alignment in stable Rust. <System as Allocator>::shrink() always falls back to allocate() + copy_nonoverlapping() + deallocate() if the alignments aren't exactly the same. Thus, a custom Allocator implementation would be necessary for shrink() to even plausibly change the alignment.

@the8472
Copy link
Member

the8472 commented Jan 18, 2024

Yeah I think I was testing with jemalloc.

@Voultapher
Copy link
Contributor

Voultapher commented Jan 18, 2024

I agree with others in this thread. The behavior is quite surprising to me, which is generally a bad thing for a language that advertises itself as reliable. Even more so if you plan on building system abstractions with it. I think at the very least there has to be an upper limit to the reuse. Anything above 2x seems like a footgun to me. As the original post author encountered already.

Also the property that it will re-use for different alignments also seems quite surprising to me.

@jhorstmann
Copy link
Contributor

That's unexpected. It should be calling into realloc, not memcpy

Might be the default implementation of the global allocator for shrink. Since the alignment is different it ends in the default branch and has to allocate/copy/deallocate.

@Istvan91
Copy link

Might be the default implementation of the global allocator for shrink. Since the alignment is different it ends in the default branch and has to allocate/copy/deallocate.

Which seems unnecessary if the target type is overaligned? I think this should only be necessary when the target type is under/mis-aligned. Which is never true if the target type is smaller than the source type.

@LegionMammal978
Copy link
Contributor

LegionMammal978 commented Jan 18, 2024

Which seems unnecessary if the target type is overaligned? I think this should only be necessary when the target type is under/mis-aligned. Which is never true if the target type is smaller than the source type.

libc allocators generally don't provide an aligned_realloc() to go alongside posix_memalign() or aligned_alloc(). Thus, any allocation with an alignment greater than the guaranteed malloc() alignment has to be manually reallocated via memcpy(). This is implemented as the realloc_fallback() function in library/std/src/sys/common/alloc.rs.

@matthieu-m
Copy link

matthieu-m commented Jan 18, 2024

@the8472

First of all, I'd like to say that the core idea of this optimization is a good one. In general, mapping in place is much less costly than allocating, mapping across, then deallocating the old allocation. Avoiding a new allocation and avoiding the resulting cache traffic is good.

With that being said, just because the idea is good in general does not mean that it may never be unhelpful in some cases, and the case demonstrated in the blog post, where over 100x the necessary memory is retained, is definitely unhelpful.

The documentation could be improved, but:

  1. Users don't read the documentation.
  2. There's a lot of collect in the wild, written prior to this optimization, which may suddenly start exhibiting such an unhelpful behavior.

So, let's focus on expectations:

  • Is it reasonable for a user to complain if the memory wasn't reused? I'd argue YES. It seems reasonable, and I'm glad this optimization helps doing so.
  • Is it reasonable for a user to complain if the resulting Vec didn't preserve the 100x excess capacity? I'd argue NO. This is way beyond the usual Exponential Growth pattern, and usually only occurs when explicit steps are taken -- reserve, or popping elements out.

You mention that in case of alignment mismatch -- when the new alignment is less than the old -- the implementation calls mremap.

I'd suggest that in case of too much extreme capacity -- when the new capacity is, say, 4x+ the new length -- the implementation should call shrink_to_fit. This has several benefits:

  1. In most cases of no excess allocation, the allocation is reused and untouched. Perfect.
  2. In cases of excess allocation, the allocation is still reused (with an in-place remap) and the likelihood is high that it'll be shrunk in place -- close to a no-op.
  3. In cases where an allocation has been shrunk, and needs to be grown again, chances are that realloc will be able to grow it in place: there was space before, so as long as no other allocation stole it, it's still there.

From a number of users affected point of view, I'd expect that saving memory for the majority of users who'd get bitten by excess capacity is worth it. The tiny fraction of users for whom preserving the entire allocation is absolutely necessary can always ask for a dedicated explicit function, or create a crate in the meantime.

@cuviper
Copy link
Member

cuviper commented Jan 18, 2024

I'd suggest that in case of too much extreme capacity -- when the new capacity is, say, 4x+ the new length -- the implementation should call shrink_to_fit.

I think this is reasonable. I might even go as far as "more than 2x", since that means we have more capacity than the unspecialized doubling-growth would have produced.

@the8472
Copy link
Member

the8472 commented Jan 18, 2024

Users don't read the documentation.

This is very poor argument imo. It may be a fact, but it's not something we should bend over backwards to accomodate.

If everyone programs based on vibes then we can just delete the docs. API contracts exist for a reason because people's expectations are not universal and thus won't lead to consistent outcomes. If you code to assumptions you build on a foundation of sand. Sometimes people just assume very silly things. Sometimes they assume reasonable things that happen to be false because they conflict with other reasonable things.
In many cases the docs are intentionally vague to preserve implementation flexibility.

There's a lot of collect in the wild, written prior to this optimization, which may suddenly start exhibiting such an unhelpful behavior.

The optimization has existed since 2020. It was recently only extended to a few more cases. I agree that there are some issues with the generalization but the fundamental approach has already worked nicely for a while.
Now that it's more broadly applicable someone has encountered an edge-case, ok, that happens. But that doesn't mean just because someone wrote a blog-post that it now must be changed.

That the system allocator doesn't support alignment-changing realloc is an issue worth considering though.

Is it reasonable for a user to complain if the resulting Vec didn't preserve the 100x excess capacity?

That depends on what is excess capacity. If you keep reusing a vec then there is no excess. The capacity will be filled up in the next round. So this is use-dependent.

I'd suggest that in case of too much extreme capacity -- when the new capacity is, say, 4x+ the new length -- the implementation should call shrink_to_fit.

As I have said several times that basing this on a factor (such as 4x) means that clear-and-recycle-allocation uses become impossible with this approach even though we have told people to do it this way.

We can change that of course, but then we're just breaking a different set of users which currently aren't making noises. Just because one set is visible doesn't mean the other set doesn't exist.

@kevincox
Copy link
Contributor

kevincox commented Jan 19, 2024

Such an assumption seems reasonable to me, matching the behavior of Vec::retain().

Personally my expectations of retain is very different than collect. With retain it takes a &mut self so I see it as the same object. I expect that capacity is unchanged. With collect it seems like this creates a new object. The docs don't explicitly say it creates a new object but imply it with phrases like "create instances" and "Using collect() to make a String".

For example it is very surprising that calling this function with different types can use hugely different amounts of memory for the return value. Note that the function that creates the vector itself has no actual idea if it will return something with lots of extra capacity.

fn test(i: impl IntoIterator<Item=u32>) -> Vec<u32> {
    i.into_iter().filter(|_| false).collect()
}

fn main() {
    eprintln!("vec: {}", test(vec![1; 1000]).capacity()); // 1000
    eprintln!("vec: {}", test([1; 1000]).capacity()); // 0
}

Playground

Personally my mental model of collect::<Vec<_>>() looks something like the below code. I expect that this is similar to many users expectations (or maybe they don't even consider the upfront reserve).

fn my_collect<T>(i: impl Iterator<Item=T>) -> Vec<T> {
    let mut out = Vec::with_capacity(i.size_hint().0);
    for e in i {
        out.push(e);
    }
    out
}

And I appreciate that it is nice to add allocation reuse here, even if it is user-perceptible. But as others have agreed with holding around "arbitrary" amounts of extra memory is very surprising.

My preferred outcome would be:

  1. Add a general rule for Vec that automatic allocations will use at most some constant factor of the in-use memory.
  2. Update collect to shrink_to_fit if that rule would be violated due to allocation reuse.
  3. Add an explicit API for attempting to reuse an allocation to a new Vec with a potentially different type.

I think all of these are quite important changes:

  1. This makes implicit promises that users rely on well-defined so that everyone can be in agreement.
  2. Avoids surprising memory usage by "new" Vec.
  3. Allows users who want to reuse allocations do so in a reliable and explicit way.

I think one potentially missing thing would be how to handle reliable allocation reuse while changing the type. IIUC this can really only be done if the new type is <= the size of the original type. But this would need some more specific API if we wanted to enable it to be explicit.

@cuviper
Copy link
Member

cuviper commented Jan 19, 2024

It's just not possible for collect() to know what uses will happen downstream of it. They might use all of the capacity or none of it. If we add an automatism the user can't opt out of it because the cost will already have been paid at that point.

Right, we can't know -- but that's why I think this optimization should tread closer to as-if behavior. Limiting to 2 * len is roughly that, but actually push's growth would have a capacity more like len.next_power_of_two(). (Unless we consider exact-size with_capacity results, then it would be shrink_to_fit.)

@kevincox
Copy link
Contributor

kevincox commented Jan 19, 2024

I think the "as-if" comparison is a great one! I believe that we should keep the behaviour similar to what is "expected" by the simple model of these combinators. Managing to perform clever optimizations is great when they are all upsides. But when clever optimizations accidentally bite you it really sucks. So I'd rather make sure that the optimizations are improvements in the vast majority of use cases. If we have optimizations that may be helpful or may be hurtful it is probably better to allowing the user to opt-in explicitly, even if generally the benefit is positive. Otherwise seemingly benign changes to the code like replacing a for x in i { vec.push(x) } with a collect will have surprising consequences.

I get that this is definitely a grey area because it is impossible to define what is expected by users and because Vec capacity is not exactly defined. However there are some expectations here that users reasonably depend on (such that push doesn't allocate 1000x more capacity than len) and I think based on the attention this ticket has got shows that even if this is a rare case it is surprising to many users. So I would default to the not surprising case, even if it isn't optimal performance in other cases.

@the8472 the8472 added the I-libs-nominated The issue / PR has been nominated for discussion during a libs team meeting. label Jan 19, 2024
@matthieu-m
Copy link

On the topic of explicit APIs, I was thinking that a collect_with_capacity may be a missing in the APIs here.

Many collections today allow hinting at the capacity, and reserving at least that much. That is, if I know I will be further using a Vec or HashMap I collect into, I'd like to be able to hint the capacity I guess I am going to need, and depending on the usecase be conservative or aggressive in the estimate.

Today, this is possible in two steps to ensure the target capacity:

let mut v = Vec::with_capacity(x);
v.extend(some_iterator);

A minor issue is that it is two steps, which leads to the major issue that allocation reuse is not possible.

A collect_with_capacity API would, however, address both issues:

  • Single step.
  • Allow allocation reuse -- without undue excess.

Further, because the target capacity is known in advance, inlining collect_with_capacity allows the compiler to optimize the selection of which collection algorithm to use, as it knows upfront whether it's going to need to extend, shrink, or do nothing.

It does still suffer from one disadvantage: there is no guarantee it will reuse the underlying allocation. This is annoying when reuse is necessary for performance reasons.

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Jan 20, 2024
…viper

Remove alignment-changing in-place collect

This removes the alignment-changing in-place collect optimization introduced in rust-lang#110353
Currently stable users can't benefit from the optimization because GlobaAlloc doesn't support alignment-changing realloc and neither do most posix allocators. So in practice it has a negative impact on performance.

Explanation from rust-lang#120091 (comment):

> > You mention that in case of alignment mismatch -- when the new alignment is less than the old -- the implementation calls `mremap`.
>
> I was trying to note that this isn't really the case in practice, due to the semantics of Rust's allocator APIs. The only use of the allocator within the `in_place_collect` implementation itself is [a call to `Allocator::shrink()`](https://github.com/rust-lang/rust/blob/db7125f008cfd72e8951c9a863178956e2cbacc3/library/alloc/src/vec/in_place_collect.rs#L299-L303), which per its documentation [allows decreasing the required alignment](https://doc.rust-lang.org/1.75.0/core/alloc/trait.Allocator.html). However, in stable Rust, the only available `Allocator` is [`Global`](https://doc.rust-lang.org/1.75.0/alloc/alloc/struct.Global.html), which delegates to the registered `GlobalAlloc`. Since `GlobalAlloc::realloc()` [cannot change the required alignment](https://doc.rust-lang.org/1.75.0/core/alloc/trait.GlobalAlloc.html#method.realloc), the implementation of [`<Global as Allocator>::shrink()`](https://github.com/rust-lang/rust/blob/db7125f008cfd72e8951c9a863178956e2cbacc3/library/alloc/src/alloc.rs#L280-L321) must fall back to creating a brand-new allocation, `memcpy`ing the data into it, and freeing the old allocation, whenever the alignment doesn't remain exactly the same.
>
> Therefore, the underlying allocator, provided by libc or some other source, has no opportunity to internally `mremap()` the data when the alignment is changed, since it has no way of knowing that the allocation is the same.
@hgomersall
Copy link

hgomersall commented Jan 20, 2024

On the topic of explicit APIs, I was thinking that a collect_with_capacity may be a missing in the APIs here.

What would that do if the target type doesn't have a concept of capacity? Would this be based on a new trait like FromIteratorWithCapacity?

Edit: I guess I'm still trying to work out if the problem is a Vec problem, or is it more general. Can we allow downstream types to make use of the memory, or is that just too far out of scope?

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Jan 20, 2024
…viper

Remove alignment-changing in-place collect

This removes the alignment-changing in-place collect optimization introduced in rust-lang#110353
Currently stable users can't benefit from the optimization because GlobaAlloc doesn't support alignment-changing realloc and neither do most posix allocators. So in practice it has a negative impact on performance.

Explanation from rust-lang#120091 (comment):

> > You mention that in case of alignment mismatch -- when the new alignment is less than the old -- the implementation calls `mremap`.
>
> I was trying to note that this isn't really the case in practice, due to the semantics of Rust's allocator APIs. The only use of the allocator within the `in_place_collect` implementation itself is [a call to `Allocator::shrink()`](https://github.com/rust-lang/rust/blob/db7125f008cfd72e8951c9a863178956e2cbacc3/library/alloc/src/vec/in_place_collect.rs#L299-L303), which per its documentation [allows decreasing the required alignment](https://doc.rust-lang.org/1.75.0/core/alloc/trait.Allocator.html). However, in stable Rust, the only available `Allocator` is [`Global`](https://doc.rust-lang.org/1.75.0/alloc/alloc/struct.Global.html), which delegates to the registered `GlobalAlloc`. Since `GlobalAlloc::realloc()` [cannot change the required alignment](https://doc.rust-lang.org/1.75.0/core/alloc/trait.GlobalAlloc.html#method.realloc), the implementation of [`<Global as Allocator>::shrink()`](https://github.com/rust-lang/rust/blob/db7125f008cfd72e8951c9a863178956e2cbacc3/library/alloc/src/alloc.rs#L280-L321) must fall back to creating a brand-new allocation, `memcpy`ing the data into it, and freeing the old allocation, whenever the alignment doesn't remain exactly the same.
>
> Therefore, the underlying allocator, provided by libc or some other source, has no opportunity to internally `mremap()` the data when the alignment is changed, since it has no way of knowing that the allocation is the same.
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Jan 20, 2024
Rollup merge of rust-lang#120116 - the8472:only-same-alignments, r=cuviper

Remove alignment-changing in-place collect

This removes the alignment-changing in-place collect optimization introduced in rust-lang#110353
Currently stable users can't benefit from the optimization because GlobaAlloc doesn't support alignment-changing realloc and neither do most posix allocators. So in practice it has a negative impact on performance.

Explanation from rust-lang#120091 (comment):

> > You mention that in case of alignment mismatch -- when the new alignment is less than the old -- the implementation calls `mremap`.
>
> I was trying to note that this isn't really the case in practice, due to the semantics of Rust's allocator APIs. The only use of the allocator within the `in_place_collect` implementation itself is [a call to `Allocator::shrink()`](https://github.com/rust-lang/rust/blob/db7125f008cfd72e8951c9a863178956e2cbacc3/library/alloc/src/vec/in_place_collect.rs#L299-L303), which per its documentation [allows decreasing the required alignment](https://doc.rust-lang.org/1.75.0/core/alloc/trait.Allocator.html). However, in stable Rust, the only available `Allocator` is [`Global`](https://doc.rust-lang.org/1.75.0/alloc/alloc/struct.Global.html), which delegates to the registered `GlobalAlloc`. Since `GlobalAlloc::realloc()` [cannot change the required alignment](https://doc.rust-lang.org/1.75.0/core/alloc/trait.GlobalAlloc.html#method.realloc), the implementation of [`<Global as Allocator>::shrink()`](https://github.com/rust-lang/rust/blob/db7125f008cfd72e8951c9a863178956e2cbacc3/library/alloc/src/alloc.rs#L280-L321) must fall back to creating a brand-new allocation, `memcpy`ing the data into it, and freeing the old allocation, whenever the alignment doesn't remain exactly the same.
>
> Therefore, the underlying allocator, provided by libc or some other source, has no opportunity to internally `mremap()` the data when the alignment is changed, since it has no way of knowing that the allocation is the same.
@Istvan91
Copy link

I am not sure if I'm following the whole collect_with_capacity concept:

To me there are two different variants:

  1. I know exactly what capacity I want in the target type (this is what I interpret as collect_with_capacity)
  2. I want to keep the existing allocation, because I will be still appending to the (potential vec) without knowing how much elements. (which I would call something like collect_with_current_allocation). Calling it capacity in this case could be really confusing if the element size is different between the original T and the target U.

=> Two me it seems there are two API missing: One where I can define the (minimum) capacity of the target collection and one where the original collection is kept without any shrinking attempt.

@hgomersall
Copy link

One where I can define the (minimum) capacity of the target collection and one where the original collection is kept without any shrinking attempt.

I think the point is that with_capacity would use the existing allocation up to capacity. It's still not explicit though. I'd prefer an API that guarantees reuse (or fails). collect_with_current_allocation would still expose the current issue.

@Istvan91
Copy link

Istvan91 commented Jan 22, 2024

I think the point is that with_capacity would use the existing allocation up to capacity

This only works if the if the target collection is actually smaller than capacity. With something like .iter().flatten().collect_with_capacity(1234) you could end up with a result that is much bigger than the specified capacity or even bigger than the original allocation. (E.g. where you have Vec<Vec>, the outer vec has just a few items, and the inner vecs a few thousands.)

collect_with_current_allocation would still expose the current issue.

except the current "issue" is not a universal issue: I was proposing collect_with_current_allocation, because I have uses cases where I want this behaviour: Keep the current allocation wherever possible, or expand if necessary. (One of the main reason why we are "here" today is because there was demand for this behaviour in the first place - the change leading to this issue was not done due to sheer boredom, but because people where asking for allocation recycling.)

I'd prefer an API that guarantees reuse (or fails).

So this would be yet another use case, potentially requiring yet another API ;)

EDIT: fix quoting

@hgomersall
Copy link

hgomersall commented Jan 22, 2024

One of the main reason why we are "here" today is because there was demand for this behaviour in the first place - the change leading to this issue was not done due to sheer boredom, but because people where asking for allocation recycling

Absolutely. It does feel to me that the problem is there is no generic way for a type to expose its allocated buffer in such a way that a consuming type can take over that allocation. Instead the collect case on Vec is a std special case in which the behaviour diverges from what is possible in normal user code (hence, probably, why the expectations are not necessarily consistent with the behaviour).

@the8472
Copy link
Member

the8472 commented Jan 22, 2024

I'd prefer an API that guarantees reuse (or fails).

I doubt that we can expose this as an API anytime soon or generically. In practice it only works on a few (although very important) types, what exactly works changes over time, relies on some unstable and very unsafe features and can't be implemented by user crates. So it's not something that really makes sense for FromIterator.

This will have to remain in the realm of non-guaranteed optimizations for a while.

A Vec::clear_and_try_recycle_allocation would be different because it's far less generic and, at least for A=Global, can already be implemented on stable with just a bit of unsafe.

@Storyyeller
Copy link
Contributor

Storyyeller commented Jan 26, 2024

It's unclear whether it would affect the thing that motivated the blog-post since that only contains simplified examples, not the original code that triggered this.

It was (u64, u128) -> (u32, u32), so requiring matching alignments would have prevented the optimization in my case.

Nadrieril added a commit to Nadrieril/rust that referenced this issue Jan 31, 2024
document `FromIterator for Vec` allocation behaviors

[t-libs discussion](https://rust-lang.zulipchat.com/#narrow/stream/259402-t-libs.2Fmeetings/topic/Meeting.202024-01-24/near/417686526) about rust-lang#120091 didn't reach a strong consensus, but it was agreed that if we keep the current behavior it should at least be documented even though it is an implementation detail.

The language is intentionally non-committal. The previous (non-existent) documentation permits a lot of implementation leeway and we want retain that. In some cases we even must retain it to be able to rip out some code paths that rely on unstable features.
Nadrieril added a commit to Nadrieril/rust that referenced this issue Jan 31, 2024
document `FromIterator for Vec` allocation behaviors

[t-libs discussion](https://rust-lang.zulipchat.com/#narrow/stream/259402-t-libs.2Fmeetings/topic/Meeting.202024-01-24/near/417686526) about rust-lang#120091 didn't reach a strong consensus, but it was agreed that if we keep the current behavior it should at least be documented even though it is an implementation detail.

The language is intentionally non-committal. The previous (non-existent) documentation permits a lot of implementation leeway and we want retain that. In some cases we even must retain it to be able to rip out some code paths that rely on unstable features.
Nadrieril added a commit to Nadrieril/rust that referenced this issue Jan 31, 2024
document `FromIterator for Vec` allocation behaviors

[t-libs discussion](https://rust-lang.zulipchat.com/#narrow/stream/259402-t-libs.2Fmeetings/topic/Meeting.202024-01-24/near/417686526) about rust-lang#120091 didn't reach a strong consensus, but it was agreed that if we keep the current behavior it should at least be documented even though it is an implementation detail.

The language is intentionally non-committal. The previous (non-existent) documentation permits a lot of implementation leeway and we want retain that. In some cases we even must retain it to be able to rip out some code paths that rely on unstable features.
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Jan 31, 2024
Rollup merge of rust-lang#120355 - the8472:doc-vec-fromiter, r=cuviper

document `FromIterator for Vec` allocation behaviors

[t-libs discussion](https://rust-lang.zulipchat.com/#narrow/stream/259402-t-libs.2Fmeetings/topic/Meeting.202024-01-24/near/417686526) about rust-lang#120091 didn't reach a strong consensus, but it was agreed that if we keep the current behavior it should at least be documented even though it is an implementation detail.

The language is intentionally non-committal. The previous (non-existent) documentation permits a lot of implementation leeway and we want retain that. In some cases we even must retain it to be able to rip out some code paths that rely on unstable features.
@the8472 the8472 added C-discussion Category: Discussion or questions that doesn't represent real issues. and removed regression-from-stable-to-beta Performance or correctness regression from stable to beta. C-bug Category: This is a bug. I-prioritize Issue: Indicates that prioritization has been requested for this issue. I-libs-nominated The issue / PR has been nominated for discussion during a libs team meeting. labels Jan 31, 2024
@the8472
Copy link
Member

the8472 commented Jan 31, 2024

This was discussed in a libs meeting but no conclusion on whether the range of possible behaviors should be narrowed was reached. We did agree that it should at least be documented (#120355) and the recent changes warrant a release note (#120004).

Additionally #120116 removes the cases in which the attempted optimization would never have been useful on stable and that also caused the regression that lead to the blog post.
After the removal there still are some additional cases in that were not optimized on previous stable releases. We're going to let those propagate to stable and wait for feedback. The assumption is that only very few crates will need to apply shrinking.

@the8472
Copy link
Member

the8472 commented May 26, 2024

Documentation has been added, the case where the optimization wasn't effective has been removed and there have been no further reports so I'll close this for now.

@the8472 the8472 closed this as completed May 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-collections Area: std::collections. C-discussion Category: Discussion or questions that doesn't represent real issues. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests