Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reuse RHS allocation for vec.extend(vec.into_iter()) when they do not fit into the LHS #77496

Closed
wants to merge 12 commits into from

Conversation

the8472
Copy link
Member

@the8472 the8472 commented Oct 3, 2020

I tried a broader version of this optimization in #70793 but it regressed compile times because it emitted more IR for every single use of extend() and thus had to be removed from that PR. This attempt is narrower since it only applies to cases where we're extending/appending directly from another vec.

 name                             extend-rec-baseline.b ns/iter  extend-append.b ns/iter  diff ns/iter   diff %  speedup 
 vec::bench_extend_recycle        243                            26                               -217  -89.30%   x 9.35 

The aim is to improve runtime behavior, not necessarily compile time but I still want a perf-run to make sure it's not regressing significantly unlike the original attempt.

@rust-highfive
Copy link
Collaborator

r? @KodrAus

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 3, 2020
@jyn514 jyn514 added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 3, 2020
@jyn514
Copy link
Member

jyn514 commented Oct 3, 2020

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Oct 3, 2020

⌛ Trying commit b0a249415002259095836413843f80f033c3d782 with merge 097a1aa29aaea51d74064470a76793b00bb9ec11...

@the8472
Copy link
Member Author

the8472 commented Oct 3, 2020

CC @pickfire since you wanted to review vec changes

@bors
Copy link
Contributor

bors commented Oct 3, 2020

☀️ Try build successful - checks-actions, checks-azure
Build commit: 097a1aa29aaea51d74064470a76793b00bb9ec11 (097a1aa29aaea51d74064470a76793b00bb9ec11)

@rust-timer
Copy link
Collaborator

Queued 097a1aa29aaea51d74064470a76793b00bb9ec11 with parent 738d4a7, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (097a1aa29aaea51d74064470a76793b00bb9ec11): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

@jyn514
Copy link
Member

jyn514 commented Oct 3, 2020

Nice, moderate improvements across the board :)

@jyn514
Copy link
Member

jyn514 commented Oct 3, 2020

Oh wait I was looking at bootstrap - the instruction counts seem to be about the same.

library/alloc/src/vec.rs Outdated Show resolved Hide resolved
library/alloc/src/vec.rs Outdated Show resolved Hide resolved
@the8472
Copy link
Member Author

the8472 commented Oct 4, 2020

  • Removed the part that needs further discussion
  • generalized the optimization to also apply when the LHS is not empty

@the8472 the8472 changed the title reuse allocations for vec.append(vec) and vec.extend(vec.into_iter()) when the LHS is empty reuse RHS allocation for vec.extend(vec.into_iter()) when they do not fit into the LHS Oct 4, 2020
library/alloc/src/vec.rs Outdated Show resolved Hide resolved
library/alloc/src/vec.rs Outdated Show resolved Hide resolved
/// * `offset == 0` is always valid
/// * `offset` must be positive
unsafe fn into_vec(self, offset: isize) -> Vec<T> {
let dst = unsafe { self.buf.as_ptr().offset(offset) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably use add instead of offset for usize, but I don't know if the compiler will optimize out the offset if it is 0? Does it or should we have two functions, one with offset and one without?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is a constant in the other callsite, so it should be easy to optimize for llvm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @lzutao to see if he's interested to check this out

Comment on lines +2391 to +2392
&& self.capacity() - self.len() < iterator.len()
&& iterator.cap - iterator.len() >= self.len()
Copy link
Contributor

@pickfire pickfire Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be simplified as?

Suggested change
&& self.capacity() - self.len() < iterator.len()
&& iterator.cap - iterator.len() >= self.len()
&& iterator.len() - self.len() < iterator.cap - self.capacity()

I wonder if this will always grow the vec if the iterator is larger.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed change would underflow if self is larger than iterator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underflow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically still an overflow, but that iterator.len() - self.len() would panic or wrap if, say, self.capacity() == 20_000 and self.len() == 19_950, and iterator.len() == 100.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the case where self.capacity() > iterator.cap . The subtraction would underflow the usize result and thus lead to the inequality unexpectedly evaluating to true which would then violate the safety constraints of into_vec_with_uninit_prefix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how about?

Suggested change
&& self.capacity() - self.len() < iterator.len()
&& iterator.cap - iterator.len() >= self.len()
&& iterator.len().saturating_sub(self.len()) < iterator.cap.saturating_sub(self.capacity())

I

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a self with len == 2 && cap == 2 and an iterator len == 2 && cap == 3 that would evaluate to true and attempt to store 4 elements into an allocation of 3. 💣💥

@the8472
Copy link
Member Author

the8472 commented Oct 6, 2020

Should we try rerunning the benchmarks? I think a notable part may be gone.

Significant parts were changed since then, yes.

// ² memcpy
// ³ memmove
// ⁴ free
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That line was intentionally left blank.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think following the rest of the codebase convention here would have this blank line not have the comment //, and just be empty.

// ³ into_vec ____------BBBB____-------------- Vec(0x00, 0, 4) Vec(0x0a, 4, 8)
// ⁴ *self = v ----------BBBB____-------------- Vec(0x0a, 4, 8)
//
// ## empty self, pristine iterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even check for this case below?

By the way, nice diagram.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

into_vec_with_uninit_prefix is pretty much just a struct conversion when there's nothing to move.

// ## insufficient capacity
//
// [initial] AAAAA-----BBBBBB__-------------- Vec(0x00, 5, 5) IntoIter(0x0a, 0x0a, 0x0f, 8)
// ¹² reserve(6) ----------BBBBBB__--AAAAA______- Vec(0x14, 5, 11) IntoIter(0x0a, 0x0a, 0x0f, 8)
Copy link
Contributor

@pickfire pickfire Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, is this correct or maybe I think wrongly. Shouldn't reserve only have one malloc (realloc) to reallocate the memory when the alignment is the same which in this case should be the same, why does it needs a memmove?

// SAFETY: `new_size` is non-zero as `old_size` is greater than or equal to `new_size`
// as required by safety conditions. Other conditions must be upheld by the caller
old_size if old_layout.align() == new_layout.align() => unsafe {
let new_size = new_layout.size();
// `realloc` probably checks for `new_size >= old_layout.size()` or something similar.
intrinsics::assume(new_size >= old_layout.size());
let raw_ptr = GlobalAlloc::realloc(self, ptr.as_ptr(), old_layout, new_size);
let ptr = NonNull::new(raw_ptr).ok_or(AllocError)?;
if zeroed {
raw_ptr.add(old_size).write_bytes(0, new_size - old_size);
}
Ok(NonNull::slice_from_raw_parts(ptr, new_size))
},

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many cases realloc is just malloc+memcpy+free. Only in some cases it can extend the allocation in place.

}
iterator.ptr = iterator.end;
iterator.move_into(self);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cooler if the code are linked to the cool diagram.

Suggested change
iterator.move_into(self);
// Insufficient capacity
iterator.move_into(self);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes and no.

With the optimization present that is indeed all it covers. But it is the general codepath that also works without the optimization. So I don't want to give the impression that it can only handle that case.

Copy link
Contributor

@pickfire pickfire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but I think we should run the timer again.

@the8472
Copy link
Member Author

the8472 commented Oct 15, 2020

@jyn514 poke for another perf run.

@jyn514
Copy link
Member

jyn514 commented Oct 15, 2020

@bors try @rust-timer queue

Thanks for the ping, I'd forgotten about this.

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Oct 15, 2020

⌛ Trying commit 1885f38 with merge d99d46fbeabf815d807dea479ae158f7ae9041c2...

@bors
Copy link
Contributor

bors commented Oct 15, 2020

☀️ Try build successful - checks-actions, checks-azure
Build commit: d99d46fbeabf815d807dea479ae158f7ae9041c2 (d99d46fbeabf815d807dea479ae158f7ae9041c2)

@rust-timer
Copy link
Collaborator

Queued d99d46fbeabf815d807dea479ae158f7ae9041c2 with parent b5c9e24, future comparison URL.

library/alloc/src/vec.rs Outdated Show resolved Hide resolved
@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (d99d46fbeabf815d807dea479ae158f7ae9041c2): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

@the8472
Copy link
Member Author

the8472 commented Oct 16, 2020

Slight improvements in the bootstrap timings, otherwise not much change. Not sure if it's worth it.

@pickfire
Copy link
Contributor

pickfire commented Oct 16, 2020

Slight change but the memory did improve, less memory used.

@Mark-Simulacrum
Copy link
Member

I am pretty sure that the memory usage here is just noise. Noise in the 20% range is not unusual for max-rss.

I think the improvements on the bootstrap timing is also likely noise, though that's less clear.

@KodrAus
Copy link
Contributor

KodrAus commented Oct 23, 2020

@the8472 Hmm, was there a usecase you had in mind originally when optimizing this case?

@the8472
Copy link
Member Author

the8472 commented Oct 23, 2020

@KodrAus just trying to reduce allocations and memcpys. But looks like it didn't work out.

@the8472 the8472 closed this Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants