Performance regression between v1.76.0 and v1.77.2 #125543

jmillikin · 2024-05-25T12:02:40Z

Code

I'm working on some low-level bit-manipulation code, and discovered that v1.77 generates code that runs at about half the speed compared to the output of v1.76. Since the v1.77 release notes don't mention anything about an LLVM upgrade I figure there must be something going wrong in the rustc LLVM IR generation.

Compilable reproduction:

The inner loop is decode_u32(), which is sort of ugly but isn't doing anything especially complex.

pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
	if buf[0] < 0x80 {
		return (buf[0] as u32, 1);
	}
	if buf[0] < 0xF0 {
		let x = u32::from_le(unsafe {
			ptr::from_ref(buf).cast::<u32>().read_unaligned()
		});
		if buf[0] < 0b11000000 {
			let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
			return (decoded, 2);
		}
		if buf[0] < 0b11100000 {
			let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
			return (decoded, 3);
		}
		let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
		return (decoded, 4);
	}
	let decoded = u32::from_le(unsafe {
		ptr::from_ref(buf)
			.cast::<u8>()
			.add(1)
			.cast::<u32>()
			.read_unaligned()
	});
	(decoded, ((buf[0] & 0x0F) + 2) as usize)
}

The exact timing varies depending on the distribution of input values. The attached reproduction uses a distribution for which the new code is about twice as slow.

$ RUSTFLAGS="-Copt-level=3" cargo +1.76.0 build --release
$ time ./target/release/main
real    0m4.182s
user    0m4.181s
sys     0m0.000s

$ RUSTFLAGS="-Copt-level=3" cargo +1.77.2 build --release
$ time ./target/release/main
real    0m7.699s
user    0m7.694s
sys     0m0.004s

Version it worked on

It most recently worked on: Rust v1.76.0

Version with regression

rustc --version --verbose:

$ rustc +1.77.2 --version --verbose
rustc 1.77.2 (25ef9e3d8 2024-04-09)
binary: rustc
commit-hash: 25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04
commit-date: 2024-04-09
host: x86_64-unknown-linux-gnu
release: 1.77.2
LLVM version: 17.0.6

The text was updated successfully, but these errors were encountered:

blyxyas · 2024-05-25T13:15:52Z

Here are the contents of main.rs:

/*

```
[package]
name = "perf-regression-debug"
version = "0.1.0"
edition = "2021"

[dependencies]
rand = { version = "0.8.5", features = ["alloc", "small_rng"] }

[[bin]]
name = "main"
path = "main.rs"
```

```
$ RUSTFLAGS="-Copt-level=3" cargo +1.76.0 build --release
$ time ./target/release/main
real    0m4.182s
user    0m4.181s
sys     0m0.000s
```

```
$ RUSTFLAGS="-Copt-level=3" cargo +1.77.2 build --release
$ time ./target/release/main
real    0m7.699s
user    0m7.694s
sys     0m0.004s
```

```
$ RUSTFLAGS="-Copt-level=3" cargo +1.78.0 build --release
$ time ./target/release/main
real    0m7.634s
user    0m7.628s
sys     0m0.005s
```

*/

use core::hint::black_box;
use core::ptr;

use rand::distributions::{Distribution, WeightedIndex};
use rand::{Rng, SeedableRng};

const RAND_VEC_LEN: usize = 10000;
const BENCHMARK_LOOP_ITERATIONS: usize = 250000;

#[inline]
#[no_mangle]
pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
	if buf[0] < 0x80 {
		return (buf[0] as u32, 1);
	}
	if buf[0] < 0xF0 {
		let x = u32::from_le(unsafe {
			ptr::from_ref(buf).cast::<u32>().read_unaligned()
		});
		if buf[0] < 0b11000000 {
			let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
			return (decoded, 2);
		}
		if buf[0] < 0b11100000 {
			let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
			return (decoded, 3);
		}
		let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
		return (decoded, 4);
	}
	let decoded = u32::from_le(unsafe {
		ptr::from_ref(buf)
			.cast::<u8>()
			.add(1)
			.cast::<u32>()
			.read_unaligned()
	});
	(decoded, ((buf[0] & 0x0F) + 2) as usize)
}

#[inline(never)]
#[no_mangle]
pub fn benchmark_decode_u32(bufs: &[[u8; 5]]) {
	for buf in bufs {
		black_box(decode_u32(black_box(buf)));
	}
}

const U32_MASKS: [u32; 5] = [
	u32::MAX >> 25,
	u32::MAX >> 24,
	u32::MAX >> 16,
	u32::MAX >> 8,
	u32::MAX,
];

fn main() {
	let mixed_u32 = WeightedIndex::new(&[15, 60, 15, 5, 5])
		.unwrap()
		.map(|ii| U32_MASKS[ii]);

	let mut rng = rand::rngs::SmallRng::seed_from_u64(0);
	let encoded_values: Vec<[u8; 5]> =
		std::iter::from_fn(|| Some(rng.sample(&mixed_u32)))
			.map(|value| {
				let mut buf = [0u8; 5];
				encode_u32(&mut buf, value);
				buf
			})
			.take(RAND_VEC_LEN)
			.collect();

	for _ii in 0..BENCHMARK_LOOP_ITERATIONS {
		benchmark_decode_u32(&encoded_values);
	}
}

fn encode_u32(buf: &mut [u8; 5], mut x: u32) -> usize {
	if x & 0b01111111 == x {
		buf[0] = x as u8;
		return 1;
	}
	if x < 0x10000000 {
		if x < 0x00004000 {
			x <<= 2;
			buf[0] = 0b10000000 | ((x as u8) >> 2);
			buf[1] = (x >> 8) as u8;
			return 2;
		}
		if x < 0x00200000 {
			x <<= 3;
			buf[0] = 0b11000000 | ((x as u8) >> 3);
			buf[1] = (x >> 8) as u8;
			buf[2] = (x >> 16) as u8;
			return 3;
		}
		x <<= 4;
		buf[0] = 0b11100000 | ((x as u8) >> 4);
		buf[1] = (x >> 8) as u8;
		buf[2] = (x >> 16) as u8;
		buf[3] = (x >> 24) as u8;
		return 4;
	}
	let bytes = x.to_le_bytes();
	buf[0] = 0b11110011;
	buf[1] = bytes[0];
	buf[2] = bytes[1];
	buf[3] = bytes[2];
	buf[4] = bytes[3];
	5
}

workingjubilee · 2024-05-25T18:26:42Z

@jmillikin How does the performance go on 1.78.0, the current stable?

jmillikin · 2024-05-25T21:26:47Z

v1.78.0 generates the same assembly for {benchmark_,}decode_u32 as v1.77.2: https://godbolt.org/z/a5hcnh9az.

workingjubilee · 2024-05-25T22:12:46Z

wow, and that's the version that actually has the LLVM upgrade.

jmillikin · 2024-05-27T07:57:12Z

I tried using cargo-bisect-rustc and it pointed to 8d76d07, which is a fairly large (+1,088 -1,581 delta) change to how constant propagation works.

cc @cjgillot who may have some ideas on whether the new code is working as intended or not.

My suspicion is that unification of buf[0] with the buf[0:4] as u32 is causing either branch mispredictions or an unwanted data dependency, but I don't have enough knowledge of compiler internals to speak with any confidence.

searched nightlies: from nightly-2023-12-21 to nightly-2024-02-01
regressed nightly: nightly-2023-12-31
searched commit range: 3cdd004...2a3e635
regressed commit: 8d76d07

bisected with cargo-bisect-rustc v0.6.8

Host triple: x86_64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc --start=1.76.0 --end=1.77.0 --script=./test.sh

blyxyas · 2024-05-27T11:51:28Z

@jmillikin I also tried bisecting this, but had some problems with the script. If you don't mind, what's the content of your test.sh? 😅

jmillikin · 2024-05-27T12:00:31Z

It is super duper hacky -- literally grepping for an instruction pattern that only shows up in the fast version.

#!/bin/sh
set -eux

RUSTFLAGS="-Copt-level=3" cargo build --release
objdump -M intel "$CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/main"  --no-addresses --no-show-raw-insn --disassemble='benchmark_decode_u32' | grep 'd,DWORD PTR'
exit $?

A more portable version might be to invoke /bin/time -f '%e' "$CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/main" in a subshell and then check if the resulting wall-clock time measurement is too large.

jmillikin · 2024-05-27T14:36:03Z

Not sure if this is useful, but I found a way to make v1.76.0 emit the same code as v1.77.2:

pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
	let prefix_0 = buf[0];
	if prefix_0 < 0x80 {
		return (prefix_0 as u32, 1);
	}

	if prefix_0 < 0xF0 {
		let x = u32::from_le(unsafe {
			ptr::from_ref(buf).cast::<u32>().read_unaligned()
		});

		let prefix_1 = buf[0];
		// let prefix_1 = prefix_0;
		if prefix_1 < 0b11000000 {
			let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
			return (decoded, 2);
		}
		if prefix_1 < 0b11100000 {
			let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
			return (decoded, 3);
		}
		let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
		return (decoded, 4);
	}
	let decoded = u32::from_le(unsafe {
		ptr::from_ref(buf)
			.cast::<u8>()
			.add(1)
			.cast::<u32>()
			.read_unaligned()
	});
	(decoded, ((prefix_0 & 0x0F) + 2) as usize)
}

This version is compiled to the same assembly as the original in both v1.76.0 and v1.77.2.

If the statement let prefix_1 = prefix_0; is uncommented, then both compiler versions emit the slow assembly.

apiraino · 2024-05-28T10:25:09Z

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

saethlin · 2024-05-31T03:09:23Z

Disabling GVN via -Zmir-enable-passes=-GVN produces the fast version on nightly. The MIR inliner is not relevant in case anyone was wondering that (I was).

jmillikin added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels May 25, 2024

rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels May 25, 2024

rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels May 28, 2024

saethlin added A-mir-opt Area: MIR optimizations and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression between v1.76.0 and v1.77.2 #125543

Performance regression between v1.76.0 and v1.77.2 #125543

jmillikin commented May 25, 2024

blyxyas commented May 25, 2024 •

edited

Loading

workingjubilee commented May 25, 2024

jmillikin commented May 25, 2024

workingjubilee commented May 25, 2024

jmillikin commented May 27, 2024

blyxyas commented May 27, 2024

jmillikin commented May 27, 2024

jmillikin commented May 27, 2024

apiraino commented May 28, 2024

saethlin commented May 31, 2024 •

edited

Loading

Performance regression between v1.76.0 and v1.77.2 #125543

Performance regression between v1.76.0 and v1.77.2 #125543

Comments

jmillikin commented May 25, 2024

Code

Version it worked on

Version with regression

blyxyas commented May 25, 2024 • edited Loading

workingjubilee commented May 25, 2024

jmillikin commented May 25, 2024

workingjubilee commented May 25, 2024

jmillikin commented May 27, 2024

blyxyas commented May 27, 2024

jmillikin commented May 27, 2024

jmillikin commented May 27, 2024

apiraino commented May 28, 2024

saethlin commented May 31, 2024 • edited Loading

blyxyas commented May 25, 2024 •

edited

Loading

saethlin commented May 31, 2024 •

edited

Loading