Support vec zero-alloc optimization for tuples and byte arrays #97581

AngelicosPhosphoros · 2022-05-31T12:26:38Z

Implement IsZero trait for tuples up to 8 IsZero elements;
Implement IsZero for u8/i8, leading to implementation of it for arrays of them too;
Add more codegen tests for this optimization.
Lower size of array for IsZero trait because it fails to inline checks

rust-highfive · 2022-05-31T12:26:41Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with r? rust-lang/libs-api @rustbot label +T-libs-api -T-libs to request review from a libs-api team reviewer. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

rust-highfive · 2022-05-31T12:26:42Z

r? @m-ou-se

(rust-highfive has picked a reviewer for you, use r? to override)

AngelicosPhosphoros · 2022-05-31T12:27:13Z

r? @Mark-Simulacrum because you reviewed similar PRs in that area before.

library/alloc/src/vec/spec_from_elem.rs

src/test/codegen/vec-calloc.rs

AngelicosPhosphoros · 2022-05-31T12:31:17Z

src/test/codegen/vec-calloc.rs

+// CHECK-LABEL: @vec_zero_bytes
+#[no_mangle]
+pub fn vec_zero_bytes(n: usize) -> Vec<u8> {
+    // CHECK-NOT: call alloc::vec::from_elem


Added more checks to avoid cases when inlining don't happen.

src/test/codegen/vec-calloc.rs

AngelicosPhosphoros · 2022-05-31T12:42:27Z

What to do with the fact that constant aggregates (tuples and arrays) larger than 8 bytes fails to fold if there is more than 1 invokation of vec! macro with them in the compilation unit? Should I lower threshold or just leave it be?
proof

P.S. Maybe adding #[inline] here would help:

rust/library/alloc/src/vec/mod.rs

Line 2422 in dcbd5f5

pub fn from_elem<T: Clone>(elem: T, n: usize) -> Vec<T> {

P.P.S. This doesn't affect it actually.

jyn514 · 2022-05-31T15:30:48Z

@bors try @rust-timer queue

rust-timer · 2022-05-31T15:30:50Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2022-05-31T15:30:57Z

⌛ Trying commit 5a78f2889322806560cf87844ffd14042041d15e with merge 5b2f32200628769756745f20d6a0aea4e3bee040...

bors · 2022-05-31T17:22:23Z

☀️ Try build successful - checks-actions
Build commit: 5b2f32200628769756745f20d6a0aea4e3bee040 (5b2f32200628769756745f20d6a0aea4e3bee040)

rust-timer · 2022-05-31T17:22:25Z

Queued 5b2f32200628769756745f20d6a0aea4e3bee040 with parent 16a0d03, future comparison URL.

rust-timer · 2022-05-31T20:09:40Z

Finished benchmarking commit (5b2f32200628769756745f20d6a0aea4e3bee040): comparison url.

Instruction count

Primary benchmarks: no relevant changes found
Secondary benchmarks: 😿 relevant regressions found

	mean¹	max	count²
Regressions 😿 (primary)	N/A	N/A	0
Regressions 😿 (secondary)	1.1%	1.1%	3
Improvements 🎉 (primary)	N/A	N/A	0
Improvements 🎉 (secondary)	N/A	N/A	0
All 😿🎉 (primary)	N/A	N/A	0

Max RSS (memory usage)

Results

Primary benchmarks: 😿 relevant regression found
Secondary benchmarks: 😿 relevant regression found

	mean¹	max	count²
Regressions 😿 (primary)	1.1%	1.1%	1
Regressions 😿 (secondary)	3.1%	3.1%	1
Improvements 🎉 (primary)	N/A	N/A	0
Improvements 🎉 (secondary)	N/A	N/A	0
All 😿🎉 (primary)	1.1%	1.1%	1

Cycles

This benchmark run did not return any relevant results for this metric.

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

the arithmetic mean of the percent change ↩ ↩²
number of relevant changes ↩ ↩²

AngelicosPhosphoros · 2022-06-01T08:57:33Z

Does secondary tests measure performance of compiler itself? I didn't find anything relevant in ctfe-stress-5 benchmark code.

Mark-Simulacrum · 2022-06-01T13:17:00Z

library/alloc/src/vec/is_zero.rs

 impl_is_zero!(i16, |x| x == 0);
 impl_is_zero!(i32, |x| x == 0);
 impl_is_zero!(i64, |x| x == 0);
 impl_is_zero!(i128, |x| x == 0);
 impl_is_zero!(isize, |x| x == 0);

+impl_is_zero!(u8, |x| x == 0);


We can add comments -- but u8/i8 are already done by https://github.com/rust-lang/rust/blob/master/library/alloc/src/vec/spec_from_elem.rs#L20-L48, I think.

Those impls seem more general than these, so I'd leave them in place rather than adding these.

Is there any reason why we want to specialize for i8/u8 instead of specializing on Copy types and probably checking for size_of::() == 1? Compiler manages replace iteration by memset. https://godbolt.org/z/rzYYYhTKj There is little difference in generated code but it is still almost similar.

This implementation in this file allow [u8; N] and [i8;N] to be IsZero too and need less repetition than implementing IsZero for byte arrays directly.

Just because it's size-1 & Copy doesn't mean it's legal to branch on the value, though.

@scottmcm I mean something like this:

impl<T: Copy>impl SpecFromElem for T { fn from_elem(elem: T, n:usize)->Vec<T>{ if core::mem::size_of::<T>() == 1 { let mut v = Vec::with_capacity(n); unsafe{ let byte_val: u8 = ptr::read(&elem as const T* as const u8*); ptr::write_bytes(v.as_mut_ptr(), byte_val, n); v.set_len(n); } return v; } // Default impl ... } }

And probably IsZero variant for u8/i8.

It's not generally valid to read a 1-byte value as u8 (I guess maybe with MaybeUninit<u8> that could be OK?) But I'm not sure this is worth trying to optimize for ourselves; I'd hope that LLVM can lower our standard init loop. (Or can be convinced to do so). I think slice::fill for example does pretty OK without such shenanigans.

Valid, didn't think about uninit values before.

library/alloc/src/vec/is_zero.rs

src/test/codegen/vec-calloc.rs

AngelicosPhosphoros · 2022-07-12T01:07:49Z

@Mark-Simulacrum
I don't have access to any Apple device so it seems that I wouldn't be able to fix that failure.

Any suggestions what to do for me? Maybe just don't run this tests on Apple?

library/alloc/src/vec/is_zero.rs

bors · 2022-07-13T00:21:27Z

⌛ Testing commit fd488ccb9ad2b238eda4a318f3ceb81cbdd86eac with merge a4b83d324d0f049ad5b48fe32d13ba3c624a9c38...

bors · 2022-07-13T01:38:36Z

💔 Test failed - checks-actions

scottmcm · 2022-07-13T03:44:12Z

Since it looks like this failed twice,
@bors r-

AngelicosPhosphoros · 2022-07-24T19:23:06Z

@Mark-Simulacrum
I just added // ignore-macos to failing test. Is that OK?

@rustbot label: +S-waiting-on-review -S-waiting-on-author

* Implement IsZero trait for tuples up to 8 IsZero elements; * Implement IsZero for u8/i8, leading to implementation of it for arrays of them too; * Add more codegen tests for this optimization. * Lower size of array for IsZero trait because it fails to inline checks

Mark-Simulacrum · 2022-07-24T20:01:38Z

I dropped the calloc-2 test entirely -- I'm not sure there's much value in checking that we're eliminating the zero comparison for a constant element. The whole point of bounding its length is that it's relatively cheap. I don't know why macOS has slightly different behavior, though I would suspect CGU differences or something -- probably not worth a deep investigation.

I'm not sure the other added tests here are really necessary either, but they seem more or less OK to leave for now.

@bors r+ rollup=never

bors · 2022-07-24T20:01:40Z

📌 Commit 86d445e has been approved by Mark-Simulacrum

It is now in the queue for this repository.

bors · 2022-07-25T00:20:46Z

⌛ Testing commit 86d445e with merge babff22...

bors · 2022-07-25T02:46:29Z

☀️ Test successful - checks-actions
Approved by: Mark-Simulacrum
Pushing babff22 to master...

rust-timer · 2022-07-25T04:02:16Z

Finished benchmarking commit (babff22): comparison url.

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

Primary benchmarks: no relevant changes found
Secondary benchmarks: 🎉 relevant improvement found

	mean¹	max	count²
Regressions 😿 (primary)	N/A	N/A	0
Regressions 😿 (secondary)	N/A	N/A	0
Improvements 🎉 (primary)	N/A	N/A	0
Improvements 🎉 (secondary)	-3.2%	-3.2%	1
All 😿🎉 (primary)	N/A	N/A	0

Cycles

Results

Primary benchmarks: no relevant changes found
Secondary benchmarks: 🎉 relevant improvements found

	mean¹	max	count²
Regressions 😿 (primary)	N/A	N/A	0
Regressions 😿 (secondary)	N/A	N/A	0
Improvements 🎉 (primary)	N/A	N/A	0
Improvements 🎉 (secondary)	-5.0%	-5.7%	4
All 😿🎉 (primary)	N/A	N/A	0

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

the arithmetic mean of the percent change ↩ ↩²
number of relevant changes ↩ ↩²

rust-highfive assigned m-ou-se May 31, 2022

rustbot added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label May 31, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 31, 2022

rust-highfive assigned Mark-Simulacrum and unassigned m-ou-se May 31, 2022