Optimize large array creation in const-eval #120069

Mark-Simulacrum · 2024-01-17T19:41:08Z

This changes repeated memcpy's to a memset for the case that we're propagating a single byte into a region of memory. It also optimizes the element-by-element copies to have a tighter loop; I'm pretty sure the old code was actually doing a multiply within each loop iteration.

For an 8GB array (static SLICE: [u8; SIZE] = [0u8; 1 << 33];) this takes us from ~23 seconds to ~6 seconds locally, which is spent roughly 50/50 in (a) memset to zero and (b) memcpy of the original place into a new place, when popping stack frame. The latter seems hard to avoid but is a big memcpy (since we're copying the type rather than initializing a region, so it's pretty fast), and the first is as good as it's going to get without special casing constant-valued arrays.

Closes #55795. (That issue's references to lint checking don't appear true anymore, but I think this closes that case as something that is slow due to time pretty fully. An 8GB array taking only 6 seconds feels reasonable enough to not merit further tracking).

rustbot · 2024-01-17T19:41:17Z

r? @cjgillot

(rustbot has picked a reviewer for you, use r? to override)

rustbot · 2024-01-17T19:41:19Z

Some changes occurred to the CTFE / Miri engine

cc @rust-lang/miri

Mark-Simulacrum · 2024-01-17T19:44:01Z

compiler/rustc_const_eval/src/interpret/memory.rs

+            // For particularly large arrays (where this is perf-sensitive) it's common that
+            // we're writing a single byte repeatedly. So, optimize that case to a memset.
+            if size_in_bytes == 1 && num_copies >= 1 {
+                let value = *src_bytes;


I think this is OK. We've checked above that the bytes are initialized (if !init.no_bytes_init() {), so reading the byte should be safe.

Mark-Simulacrum · 2024-01-17T19:44:15Z

@bors try @rust-timer queue

Optimize large array creation in const-eval This changes repeated memcpy's to a memset for the case that we're propagating a single byte into a region of memory. It also optimizes the element-by-element copies to have a tighter loop; I'm pretty sure the old code was actually doing a multiply within each loop iteration. For an 8GB array (`static SLICE: [u8; SIZE] = [0u8; 1 << 33];`) this takes us from ~23 seconds to ~6 seconds locally, which is spent roughly 50/50 in (a) memset to zero and (b) memcpy of the original place into a new place, when popping stack frame. The latter seems hard to avoid but is a big memcpy (since we're copying the type rather than initializing a region, so it's pretty fast), and the first is as good as it's going to get without special casing constant-valued arrays. Closes rust-lang#55795. (That issue's references to lint checking don't appear true anymore, but I think this closes that case as something that is slow due to *time* pretty fully. An 8GB array taking only 6 seconds feels reasonable enough to not merit further tracking).

bors · 2024-01-17T19:45:24Z

⌛ Trying commit 4106d71 with merge 7e0886b...

bors · 2024-01-17T21:10:26Z

☀️ Try build successful - checks-actions
Build commit: 7e0886b (7e0886b50c05ca6ece6a7f27e7ce7adf54c50b37)

rust-timer · 2024-01-17T22:25:49Z

Finished benchmarking commit (7e0886b): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-1.0%	[-1.1%, -0.9%]	6
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-1.2%	[-1.6%, -0.8%]	3
All ❌✅ (primary)	-	-	0

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	7.0%	[7.0%, 7.0%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.5%	[-0.5%, -0.5%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.5%	[-0.5%, -0.5%]	1

Bootstrap: 663.454s -> 664.806s (0.20%)
Artifact size: 308.30 MiB -> 308.35 MiB (0.02%)

saethlin · 2024-01-17T23:10:08Z

compiler/rustc_const_eval/src/interpret/memory.rs

+            let size_in_bytes = size.bytes_usize();
+            // For particularly large arrays (where this is perf-sensitive) it's common that
+            // we're writing a single byte repeatedly. So, optimize that case to a memset.
+            if size_in_bytes == 1 && num_copies >= 1 {


Can you generalize this to handle arrays of other types by extending this size_in_bytes == 1 to any value where all bytes are equal? It would be nice if large arrays of usize or u32, or even an enum could also benefit.

Most data-less enums would already benefit (since they are 1 byte in size)... and if they have data it seems super unlikely the bytes are all the same.

For the cases where this could work (e.g., 0u32), I'm wary of #120069 (comment) -- not sure if there's a good, cheap way to check "is there padding in there?"

It might also be worth noting that anything significantly fancier here probably merits looking at replicating the algorithm in slice::repeat -- this was pointed out in earlier optimization work in this code, but not implemented at the time. For larger types I think that generalizes much better than trying to detect memset being possible...

not sure if there's a good, cheap way to check "is there padding in there?"

🤷 That probably deserves a comment that we've checked in a way that only works if the value is a single byte. If I'm doing the logic right, as-written, it's "are any bytes init" not "are all bytes init". I don't think I'd argue against doing that specifically because this is one byte but it's subtle.

InitMask::is_range_initialized seems like the better function here, but the API has a gnarly error path... that only one caller uses. It might be worth it to split the function into two, so that the one error path that wants to know where there are uninit bytes can call some new function.

Yeah, a comment seems like a good idea. I've added that.

it's "are any bytes init" not "are all bytes init"

Yes, but if there's only one byte, that seems like the same thing, right?

InitMask::is_range_initialized seems like the better function here, but the API has a gnarly error path... that only one caller uses. It might be worth it to split the function into two, so that the one error path that wants to know where there are uninit bytes can call some new function.

Yeah, that seems like the way to go for a more complex change here. I'm still worried that adding extra checks is going to lead to regressions -- or at least more code to ship for little value. But 🤷 , I could see it being a win. This PR already makes all cases a very tight loop around the memcpy/memmove call for each value, which is probably pretty fast for not excessively large things, and those are almost always byte-like I suspect.

This changes repeated memcpy's to a memset for the case that we're propagating a single byte into a region of memory.

oli-obk · 2024-01-18T08:52:29Z

@bors r+ rollup=never

yea let's try follow up optimizations in follow up PRs, not this one

bors · 2024-01-18T08:52:32Z

📌 Commit e68f303 has been approved by oli-obk

It is now in the queue for this repository.

Optimize large array creation in const-eval This changes repeated memcpy's to a memset for the case that we're propagating a single byte into a region of memory. It also optimizes the element-by-element copies to have a tighter loop; I'm pretty sure the old code was actually doing a multiply within each loop iteration. For an 8GB array (`static SLICE: [u8; SIZE] = [0u8; 1 << 33];`) this takes us from ~23 seconds to ~6 seconds locally, which is spent roughly 50/50 in (a) memset to zero and (b) memcpy of the original place into a new place, when popping stack frame. The latter seems hard to avoid but is a big memcpy (since we're copying the type rather than initializing a region, so it's pretty fast), and the first is as good as it's going to get without special casing constant-valued arrays. Closes rust-lang#55795. (That issue's references to lint checking don't appear true anymore, but I think this closes that case as something that is slow due to *time* pretty fully. An 8GB array taking only 6 seconds feels reasonable enough to not merit further tracking).

bors · 2024-01-19T04:33:13Z

⌛ Testing commit e68f303 with merge 2617746...

bors · 2024-01-19T04:42:12Z

💔 Test failed - checks-actions

rust-log-analyzer · 2024-01-19T04:42:35Z

A job failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

saethlin · 2024-01-19T04:45:42Z

Error: fatal: unable to access 'https://github.com/rust-lang-ci/rust/': Could not resolve host: github.com

@bors retry

bors · 2024-01-19T06:44:22Z

⌛ Testing commit e68f303 with merge 16fadb3...

bors · 2024-01-19T08:41:10Z

☀️ Test successful - checks-actions
Approved by: oli-obk
Pushing 16fadb3 to master...

rust-timer · 2024-01-19T09:59:53Z

Finished benchmarking commit (16fadb3): comparison URL.

Overall result: no relevant changes - no action needed

@rustbot label: -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	5.3%	[5.3%, 5.3%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.7%	[-0.7%, -0.7%]	1
All ❌✅ (primary)	-	-	0

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 664.049s -> 663.017s (-0.16%)
Artifact size: 308.24 MiB -> 308.29 MiB (0.01%)

RalfJung · 2024-01-22T08:21:08Z

compiler/rustc_const_eval/src/interpret/memory.rs

+            if size_in_bytes == 1 && num_copies >= 1 {
+                // SAFETY: `src_bytes` would be read from anyway by copies below (num_copies >= 1).
+                // Since size_in_bytes = 1, then the `init.no_bytes_init()` check above guarantees
+                // that this read at type `u8` is OK -- it must be an initialized byte.


This comment seems confused. SAFETY comments are about the safety of the interpreter itself. The bytes in question are always initialized in that sense. The check above is about whether they are initialized in the emulated interpreter state, i.e., whether the interpreted program initialized them.

Happy to edit it! I didn't obviously find any guarantees that the interpreter always initializes backing allocations before passing them around (seems like at minimum based on profiling that's not happening with the destination buffer, so it would be bad to read that at type u8 here?).

I would have expected that the emulated interpreter state having the bytes initialized implies the interpreter's state also has them initialized.

The u8 buffer backing interpreter allocations is always initialized. It has type u8 rather than MaybeUninit<u8>, so documenting that seems redundant.

Profiling seems to be misleading then. I think we are calling calloc (via try_new_zeroed_slice) which might do the zero-init via efficient copy-on-write tricks or so.

Perhaps! I'm 99% sure we're not touching every byte until we hit this code. I'll look into changing some things here then.

New allocations get created here:

rust/compiler/rustc_middle/src/mir/interpret/allocation.rs

Lines 293 to 333 in 2ea7a37

fn uninit_inner<R>(size: Size, align: Align, fail: impl FnOnce() -> R) -> Result<Self, R> {

// We raise an error if we cannot create the allocation on the host.

// This results in an error that can happen non-deterministically, since the memory

// available to the compiler can change between runs. Normally queries are always

// deterministic. However, we can be non-deterministic here because all uses of const

// evaluation (including ConstProp!) will make compilation fail (via hard error

// or ICE) upon encountering a `MemoryExhausted` error.

let bytes = Bytes::zeroed(size, align).ok_or_else(fail)?;

Ok(Allocation {

bytes,

provenance: ProvenanceMap::new(),

init_mask: InitMask::new(size, false),

align,

mutability: Mutability::Mut,

extra: (),

})

}

/// Try to create an Allocation of `size` bytes, failing if there is not enough memory

/// available to the compiler to do so.

pub fn try_uninit<'tcx>(size: Size, align: Align) -> InterpResult<'tcx, Self> {

Self::uninit_inner(size, align, || {

ty::tls::with(|tcx| tcx.dcx().delayed_bug("exhausted memory during interpretation"));

InterpError::ResourceExhaustion(ResourceExhaustionInfo::MemoryExhausted).into()

})

}

/// Try to create an Allocation of `size` bytes, panics if there is not enough memory

/// available to the compiler to do so.

///

/// Example use case: To obtain an Allocation filled with specific data,

/// first call this function and then call write_scalar to fill in the right data.

pub fn uninit(size: Size, align: Align) -> Self {

match Self::uninit_inner(size, align, || {

panic!("Allocation::uninit called with panic_on_fail had allocation failure");

}) {

Ok(x) => x,

Err(x) => x,

}

}

That ends up calling this:

rust/compiler/rustc_middle/src/mir/interpret/allocation.rs

Lines 48 to 53 in 2ea7a37

fn zeroed(size: Size, _align: Align) -> Option<Self> {

let bytes = Box::<[u8]>::try_new_zeroed_slice(size.bytes_usize()).ok()?;

// SAFETY: the box was zero-allocated, which is a valid initial value for Box<[u8]>

let bytes = unsafe { bytes.assume_init() };

Some(bytes)

}

I am 100% sure that all these u8 are initialized.

compiler/rustc_const_eval/src/interpret/memory.rs

RalfJung · 2024-01-26T16:55:31Z

I filed a follow-up to fix the comments: #120387

rustbot assigned cjgillot Jan 17, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 17, 2024

Mark-Simulacrum force-pushed the fast-memcpy branch from 036954b to 4106d71 Compare January 17, 2024 19:42

Mark-Simulacrum commented Jan 17, 2024

View reviewed changes

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 17, 2024

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 17, 2024

saethlin reviewed Jan 17, 2024

View reviewed changes

Optimize large array creation in const-eval

e68f303

This changes repeated memcpy's to a memset for the case that we're propagating a single byte into a region of memory.

Mark-Simulacrum force-pushed the fast-memcpy branch from 4106d71 to e68f303 Compare January 18, 2024 01:24

oli-obk assigned oli-obk and unassigned cjgillot Jan 18, 2024

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 18, 2024

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jan 19, 2024

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 19, 2024

bors added the merged-by-bors This PR was explicitly merged by bors. label Jan 19, 2024

bors merged commit 16fadb3 into rust-lang:master Jan 19, 2024
12 checks passed

rustbot added this to the 1.77.0 milestone Jan 19, 2024

RalfJung reviewed Jan 22, 2024

View reviewed changes

compiler/rustc_const_eval/src/interpret/memory.rs Show resolved Hide resolved

Mark-Simulacrum deleted the fast-memcpy branch January 22, 2024 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize large array creation in const-eval #120069

Optimize large array creation in const-eval #120069

Mark-Simulacrum commented Jan 17, 2024

rustbot commented Jan 17, 2024

rustbot commented Jan 17, 2024

Mark-Simulacrum Jan 17, 2024

Mark-Simulacrum commented Jan 17, 2024

This comment has been minimized.

bors commented Jan 17, 2024

bors commented Jan 17, 2024

This comment has been minimized.

rust-timer commented Jan 17, 2024

saethlin Jan 17, 2024 •

edited

Mark-Simulacrum Jan 17, 2024

saethlin Jan 17, 2024

Mark-Simulacrum Jan 18, 2024

oli-obk commented Jan 18, 2024

bors commented Jan 18, 2024

bors commented Jan 19, 2024

bors commented Jan 19, 2024

rust-log-analyzer commented Jan 19, 2024

saethlin commented Jan 19, 2024

bors commented Jan 19, 2024

bors commented Jan 19, 2024

rust-timer commented Jan 19, 2024

RalfJung Jan 22, 2024 •

edited

Mark-Simulacrum Jan 22, 2024

RalfJung Jan 22, 2024

RalfJung Jan 22, 2024

Mark-Simulacrum Jan 22, 2024

RalfJung Jan 22, 2024

RalfJung commented Jan 26, 2024

	fn uninit_inner<R>(size: Size, align: Align, fail: impl FnOnce() -> R) -> Result<Self, R> {
	// We raise an error if we cannot create the allocation on the host.
	// This results in an error that can happen non-deterministically, since the memory
	// available to the compiler can change between runs. Normally queries are always
	// deterministic. However, we can be non-deterministic here because all uses of const
	// evaluation (including ConstProp!) will make compilation fail (via hard error
	// or ICE) upon encountering a `MemoryExhausted` error.
	let bytes = Bytes::zeroed(size, align).ok_or_else(fail)?;

	Ok(Allocation {
	bytes,
	provenance: ProvenanceMap::new(),
	init_mask: InitMask::new(size, false),
	align,
	mutability: Mutability::Mut,
	extra: (),
	})
	}

	/// Try to create an Allocation of `size` bytes, failing if there is not enough memory
	/// available to the compiler to do so.
	pub fn try_uninit<'tcx>(size: Size, align: Align) -> InterpResult<'tcx, Self> {
	Self::uninit_inner(size, align, \|\| {
	ty::tls::with(\|tcx\| tcx.dcx().delayed_bug("exhausted memory during interpretation"));
	InterpError::ResourceExhaustion(ResourceExhaustionInfo::MemoryExhausted).into()
	})
	}

	/// Try to create an Allocation of `size` bytes, panics if there is not enough memory
	/// available to the compiler to do so.
	///
	/// Example use case: To obtain an Allocation filled with specific data,
	/// first call this function and then call write_scalar to fill in the right data.
	pub fn uninit(size: Size, align: Align) -> Self {
	match Self::uninit_inner(size, align, \|\| {
	panic!("Allocation::uninit called with panic_on_fail had allocation failure");
	}) {
	Ok(x) => x,
	Err(x) => x,
	}
	}

	fn zeroed(size: Size, _align: Align) -> Option<Self> {
	let bytes = Box::<[u8]>::try_new_zeroed_slice(size.bytes_usize()).ok()?;
	// SAFETY: the box was zero-allocated, which is a valid initial value for Box<[u8]>
	let bytes = unsafe { bytes.assume_init() };
	Some(bytes)
	}

Optimize large array creation in const-eval #120069

Optimize large array creation in const-eval #120069

Conversation

Mark-Simulacrum commented Jan 17, 2024

rustbot commented Jan 17, 2024

rustbot commented Jan 17, 2024

Choose a reason for hiding this comment

Mark-Simulacrum commented Jan 17, 2024

This comment has been minimized.

bors commented Jan 17, 2024

bors commented Jan 17, 2024

This comment has been minimized.

rust-timer commented Jan 17, 2024

Overall result: ✅ improvements - no action needed

saethlin Jan 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oli-obk commented Jan 18, 2024

bors commented Jan 18, 2024

bors commented Jan 19, 2024

bors commented Jan 19, 2024

rust-log-analyzer commented Jan 19, 2024

saethlin commented Jan 19, 2024

bors commented Jan 19, 2024

bors commented Jan 19, 2024

rust-timer commented Jan 19, 2024

Overall result: no relevant changes - no action needed

RalfJung Jan 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RalfJung commented Jan 26, 2024

saethlin Jan 17, 2024 •

edited

RalfJung Jan 22, 2024 •

edited