Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected/wasteful handling of capacity by vec.split_off(0) #119913

Closed
Zalathar opened this issue Jan 13, 2024 · 6 comments · Fixed by #119917
Closed

Unexpected/wasteful handling of capacity by vec.split_off(0) #119913

Zalathar opened this issue Jan 13, 2024 · 6 comments · Fixed by #119917
Labels
C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@Zalathar
Copy link
Contributor

I was reviewing some code that uses split_off(0) in a loop to take the contents of a buffer, while leaving the buffer's capacity intact to be reused by subsequent iterations. When I double-checked the actual behaviour of split_off(0), I was surprised to find that the vector returned by split_off(0) had a much larger capacity than was necessary to hold its contents.

Consider this example program: (playground link)

fn main() {
    let mut buffer = Vec::with_capacity(100);
    buffer.extend([1u8, 2u8]);
    println!("buffer capacity: {}", buffer.capacity());
    println!();
    
    let split_off_1 = buffer.split_off(1);
    println!("split_off_1 capacity: {}", split_off_1.capacity());
    println!("buffer capacity: {}", buffer.capacity());
    println!();

    let split_off_0 = buffer.split_off(0);
    println!("split_off_0 capacity: {}", split_off_0.capacity());
    println!("buffer capacity: {}", buffer.capacity());
    println!();
}

I expected that the split_off_1 and split_off_0 vectors would have similar capacities, since they each only hold one value.

Instead, this happened:

buffer capacity: 100

split_off_1 capacity: 1
buffer capacity: 100

split_off_0 capacity: 100
buffer capacity: 100

The single-element vector produced by split_off(1) has a tight capacity of 1, but the single-element vector produced by split_off(0) has a wasteful capacity of 100, which is the same as the original vector's capacity.

(The method's documentation doesn't make any explicit promises about the capacity of the returned vector, so this isn't necessarily a bug per se, but it is a surprising behaviour.)

@Zalathar Zalathar added the C-bug Category: This is a bug. label Jan 13, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jan 13, 2024
@Zalathar
Copy link
Contributor Author

This behaviour seems to have been introduced by #76682, as an explicit optimization of the split_at(0) case, to avoid having to move the elements into a new allocation.

That PR notes that the new behaviour is compatible with the existing documentation, and notes some concerns about wastefully reallocating lots of capacity due to an outlier in how long the list gets, but I don't see any discussion of the capacity of the returned vector specifically.

At the very least, this behaviour probably deserves some warning in the docs, since the actual behaviour wastes more capacity than the “obvious” implementation would.

@Zalathar
Copy link
Contributor Author

@rustbot label +T-libs

@rustbot rustbot added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Jan 13, 2024
@chenyukang chenyukang self-assigned this Jan 13, 2024
@Zalathar
Copy link
Contributor Author

Looking again at the actual documentation, it says this:

Returns a newly allocated vector containing the elements in the range [at, len). After the call, the original vector will be left containing the elements [0, at) with its previous capacity unchanged.

If the returned vector is “newly allocated”, it's natural to assume that the new allocation would be reasonably tight, if not exact. But in the 0 case, the returned vector isn't newly allocated at all, since it steals its allocation from the existing vector.

@chenyukang
Copy link
Member

When at is 0, directly return original vec as other here:

return mem::replace(

it's indeed a issue, but I haven't got a better solution, this change will fix the issue, but will also introduce extra copy:

         if at == 0 {
             // the new vector can take over the original buffer and avoid the copy
-            return mem::replace(
+            let mut other = mem::replace(
                 self,
                 Vec::with_capacity_in(self.capacity(), self.allocator().clone()),
             );
+            other.shrink_to(self.len());
+            return other;
         }

@chenyukang
Copy link
Member

chenyukang commented Jan 13, 2024

split_off in VecDeque don't have this issue:

unsafe {
let (first_half, second_half) = self.as_slices();
let first_len = first_half.len();
let second_len = second_half.len();
if at < first_len {
// `at` lies in the first half.
let amount_in_first = first_len - at;
ptr::copy_nonoverlapping(first_half.as_ptr().add(at), other.ptr(), amount_in_first);
// just take all of the second half.
ptr::copy_nonoverlapping(
second_half.as_ptr(),
other.ptr().add(amount_in_first),
second_len,
);
} else {
// `at` lies in the second half, need to factor in the elements we skipped
// in the first half.
let offset = at - first_len;
let amount_in_second = second_len - offset;
ptr::copy_nonoverlapping(
second_half.as_ptr().add(offset),
other.ptr(),
amount_in_second,
);
}
}

@Zalathar
Copy link
Contributor Author

I think the realistic options are:

  • Document the possibility of unexpected capacity/allocation behaviour when at == 0
  • Remove the special handling for at == 0, and just use the same code path for split_at(0) and split_at(n)

In cases where at is unknown, I think it's surprising for 0 and 1 to behave very differently.

In cases where the user specifically wants the current split_at(0) behaviour, it may be better for the user to write std::mem::replace themselves.

@saethlin saethlin removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jan 13, 2024
@chenyukang chenyukang removed their assignment Jan 14, 2024
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jan 26, 2024
Remove special-case handling of `vec.split_off(0)`

rust-lang#76682 added special handling to `Vec::split_off` for the case where `at == 0`. Instead of copying the vector's contents into a freshly-allocated vector and returning it, the special-case code steals the old vector's allocation, and replaces it with a new (empty) buffer with the same capacity.

That eliminates the need to copy the existing elements, but comes at a surprising cost, as seen in rust-lang#119913. The returned vector's capacity is no longer determined by the size of its contents (as would be expected for a freshly-allocated vector), and instead uses the full capacity of the old vector.

In cases where the capacity is large but the size is small, that results in a much larger capacity than would be expected from reading the documentation of `split_off`. This is especially bad when `split_off` is called in a loop (to recycle a buffer), and the returned vectors have a wide variety of lengths.

I believe it's better to remove the special-case code, and treat `at == 0` just like any other value:
- The current documentation states that `split_off` returns a “newly allocated vector”, which is not actually true in the current implementation when `at == 0`.
- If the value of `at` could be non-zero at runtime, then the caller has already agreed to the cost of a full memcpy of the taken elements in the general case. Avoiding that copy would be nice if it were close to free, but the different handling of capacity means that it is not.
- If the caller specifically wants to avoid copying in the case where `at == 0`, they can easily implement that behaviour themselves using `mem::replace`.

Fixes rust-lang#119913.
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Jan 26, 2024
Rollup merge of rust-lang#119917 - Zalathar:split-off, r=cuviper

Remove special-case handling of `vec.split_off(0)`

rust-lang#76682 added special handling to `Vec::split_off` for the case where `at == 0`. Instead of copying the vector's contents into a freshly-allocated vector and returning it, the special-case code steals the old vector's allocation, and replaces it with a new (empty) buffer with the same capacity.

That eliminates the need to copy the existing elements, but comes at a surprising cost, as seen in rust-lang#119913. The returned vector's capacity is no longer determined by the size of its contents (as would be expected for a freshly-allocated vector), and instead uses the full capacity of the old vector.

In cases where the capacity is large but the size is small, that results in a much larger capacity than would be expected from reading the documentation of `split_off`. This is especially bad when `split_off` is called in a loop (to recycle a buffer), and the returned vectors have a wide variety of lengths.

I believe it's better to remove the special-case code, and treat `at == 0` just like any other value:
- The current documentation states that `split_off` returns a “newly allocated vector”, which is not actually true in the current implementation when `at == 0`.
- If the value of `at` could be non-zero at runtime, then the caller has already agreed to the cost of a full memcpy of the taken elements in the general case. Avoiding that copy would be nice if it were close to free, but the different handling of capacity means that it is not.
- If the caller specifically wants to avoid copying in the case where `at == 0`, they can easily implement that behaviour themselves using `mem::replace`.

Fixes rust-lang#119913.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants