Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression between v1.76.0 and v1.77.2 #125543

Open
jmillikin opened this issue May 25, 2024 · 10 comments
Open

Performance regression between v1.76.0 and v1.77.2 #125543

jmillikin opened this issue May 25, 2024 · 10 comments
Labels
A-mir-opt Area: MIR optimizations C-bug Category: This is a bug. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression.

Comments

@jmillikin
Copy link
Contributor

Code

I'm working on some low-level bit-manipulation code, and discovered that v1.77 generates code that runs at about half the speed compared to the output of v1.76. Since the v1.77 release notes don't mention anything about an LLVM upgrade I figure there must be something going wrong in the rustc LLVM IR generation.

Compilable reproduction:

The inner loop is decode_u32(), which is sort of ugly but isn't doing anything especially complex.

pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
	if buf[0] < 0x80 {
		return (buf[0] as u32, 1);
	}
	if buf[0] < 0xF0 {
		let x = u32::from_le(unsafe {
			ptr::from_ref(buf).cast::<u32>().read_unaligned()
		});
		if buf[0] < 0b11000000 {
			let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
			return (decoded, 2);
		}
		if buf[0] < 0b11100000 {
			let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
			return (decoded, 3);
		}
		let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
		return (decoded, 4);
	}
	let decoded = u32::from_le(unsafe {
		ptr::from_ref(buf)
			.cast::<u8>()
			.add(1)
			.cast::<u32>()
			.read_unaligned()
	});
	(decoded, ((buf[0] & 0x0F) + 2) as usize)
}

The exact timing varies depending on the distribution of input values. The attached reproduction uses a distribution for which the new code is about twice as slow.

$ RUSTFLAGS="-Copt-level=3" cargo +1.76.0 build --release
$ time ./target/release/main
real    0m4.182s
user    0m4.181s
sys     0m0.000s
$ RUSTFLAGS="-Copt-level=3" cargo +1.77.2 build --release
$ time ./target/release/main
real    0m7.699s
user    0m7.694s
sys     0m0.004s

Version it worked on

It most recently worked on: Rust v1.76.0

Version with regression

rustc --version --verbose:

$ rustc +1.77.2 --version --verbose
rustc 1.77.2 (25ef9e3d8 2024-04-09)
binary: rustc
commit-hash: 25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04
commit-date: 2024-04-09
host: x86_64-unknown-linux-gnu
release: 1.77.2
LLVM version: 17.0.6
@jmillikin jmillikin added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels May 25, 2024
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels May 25, 2024
@blyxyas
Copy link
Member

blyxyas commented May 25, 2024

Here are the contents of main.rs:

/*

```
[package]
name = "perf-regression-debug"
version = "0.1.0"
edition = "2021"

[dependencies]
rand = { version = "0.8.5", features = ["alloc", "small_rng"] }

[[bin]]
name = "main"
path = "main.rs"
```

```
$ RUSTFLAGS="-Copt-level=3" cargo +1.76.0 build --release
$ time ./target/release/main
real    0m4.182s
user    0m4.181s
sys     0m0.000s
```

```
$ RUSTFLAGS="-Copt-level=3" cargo +1.77.2 build --release
$ time ./target/release/main
real    0m7.699s
user    0m7.694s
sys     0m0.004s
```

```
$ RUSTFLAGS="-Copt-level=3" cargo +1.78.0 build --release
$ time ./target/release/main
real    0m7.634s
user    0m7.628s
sys     0m0.005s
```

*/

use core::hint::black_box;
use core::ptr;

use rand::distributions::{Distribution, WeightedIndex};
use rand::{Rng, SeedableRng};

const RAND_VEC_LEN: usize = 10000;
const BENCHMARK_LOOP_ITERATIONS: usize = 250000;

#[inline]
#[no_mangle]
pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
	if buf[0] < 0x80 {
		return (buf[0] as u32, 1);
	}
	if buf[0] < 0xF0 {
		let x = u32::from_le(unsafe {
			ptr::from_ref(buf).cast::<u32>().read_unaligned()
		});
		if buf[0] < 0b11000000 {
			let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
			return (decoded, 2);
		}
		if buf[0] < 0b11100000 {
			let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
			return (decoded, 3);
		}
		let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
		return (decoded, 4);
	}
	let decoded = u32::from_le(unsafe {
		ptr::from_ref(buf)
			.cast::<u8>()
			.add(1)
			.cast::<u32>()
			.read_unaligned()
	});
	(decoded, ((buf[0] & 0x0F) + 2) as usize)
}

#[inline(never)]
#[no_mangle]
pub fn benchmark_decode_u32(bufs: &[[u8; 5]]) {
	for buf in bufs {
		black_box(decode_u32(black_box(buf)));
	}
}

const U32_MASKS: [u32; 5] = [
	u32::MAX >> 25,
	u32::MAX >> 24,
	u32::MAX >> 16,
	u32::MAX >> 8,
	u32::MAX,
];

fn main() {
	let mixed_u32 = WeightedIndex::new(&[15, 60, 15, 5, 5])
		.unwrap()
		.map(|ii| U32_MASKS[ii]);

	let mut rng = rand::rngs::SmallRng::seed_from_u64(0);
	let encoded_values: Vec<[u8; 5]> =
		std::iter::from_fn(|| Some(rng.sample(&mixed_u32)))
			.map(|value| {
				let mut buf = [0u8; 5];
				encode_u32(&mut buf, value);
				buf
			})
			.take(RAND_VEC_LEN)
			.collect();

	for _ii in 0..BENCHMARK_LOOP_ITERATIONS {
		benchmark_decode_u32(&encoded_values);
	}
}

fn encode_u32(buf: &mut [u8; 5], mut x: u32) -> usize {
	if x & 0b01111111 == x {
		buf[0] = x as u8;
		return 1;
	}
	if x < 0x10000000 {
		if x < 0x00004000 {
			x <<= 2;
			buf[0] = 0b10000000 | ((x as u8) >> 2);
			buf[1] = (x >> 8) as u8;
			return 2;
		}
		if x < 0x00200000 {
			x <<= 3;
			buf[0] = 0b11000000 | ((x as u8) >> 3);
			buf[1] = (x >> 8) as u8;
			buf[2] = (x >> 16) as u8;
			return 3;
		}
		x <<= 4;
		buf[0] = 0b11100000 | ((x as u8) >> 4);
		buf[1] = (x >> 8) as u8;
		buf[2] = (x >> 16) as u8;
		buf[3] = (x >> 24) as u8;
		return 4;
	}
	let bytes = x.to_le_bytes();
	buf[0] = 0b11110011;
	buf[1] = bytes[0];
	buf[2] = bytes[1];
	buf[3] = bytes[2];
	buf[4] = bytes[3];
	5
}

@workingjubilee
Copy link
Contributor

@jmillikin How does the performance go on 1.78.0, the current stable?

@jmillikin
Copy link
Contributor Author

v1.78.0 generates the same assembly for {benchmark_,}decode_u32 as v1.77.2: https://godbolt.org/z/a5hcnh9az.

@workingjubilee
Copy link
Contributor

wow, and that's the version that actually has the LLVM upgrade.

@jmillikin
Copy link
Contributor Author

I tried using cargo-bisect-rustc and it pointed to 8d76d07, which is a fairly large (+1,088 -1,581 delta) change to how constant propagation works.

cc @cjgillot who may have some ideas on whether the new code is working as intended or not.

My suspicion is that unification of buf[0] with the buf[0:4] as u32 is causing either branch mispredictions or an unwanted data dependency, but I don't have enough knowledge of compiler internals to speak with any confidence.


searched nightlies: from nightly-2023-12-21 to nightly-2024-02-01
regressed nightly: nightly-2023-12-31
searched commit range: 3cdd004...2a3e635
regressed commit: 8d76d07

bisected with cargo-bisect-rustc v0.6.8

Host triple: x86_64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc --start=1.76.0 --end=1.77.0 --script=./test.sh 

@blyxyas
Copy link
Member

blyxyas commented May 27, 2024

@jmillikin I also tried bisecting this, but had some problems with the script. If you don't mind, what's the content of your test.sh? 😅

@jmillikin
Copy link
Contributor Author

It is super duper hacky -- literally grepping for an instruction pattern that only shows up in the fast version.

#!/bin/sh
set -eux

RUSTFLAGS="-Copt-level=3" cargo build --release
objdump -M intel "$CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/main"  --no-addresses --no-show-raw-insn --disassemble='benchmark_decode_u32' | grep 'd,DWORD PTR'
exit $?

A more portable version might be to invoke /bin/time -f '%e' "$CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/main" in a subshell and then check if the resulting wall-clock time measurement is too large.

@jmillikin
Copy link
Contributor Author

Not sure if this is useful, but I found a way to make v1.76.0 emit the same code as v1.77.2:

pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
	let prefix_0 = buf[0];
	if prefix_0 < 0x80 {
		return (prefix_0 as u32, 1);
	}

	if prefix_0 < 0xF0 {
		let x = u32::from_le(unsafe {
			ptr::from_ref(buf).cast::<u32>().read_unaligned()
		});

		let prefix_1 = buf[0];
		// let prefix_1 = prefix_0;
		if prefix_1 < 0b11000000 {
			let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
			return (decoded, 2);
		}
		if prefix_1 < 0b11100000 {
			let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
			return (decoded, 3);
		}
		let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
		return (decoded, 4);
	}
	let decoded = u32::from_le(unsafe {
		ptr::from_ref(buf)
			.cast::<u8>()
			.add(1)
			.cast::<u32>()
			.read_unaligned()
	});
	(decoded, ((prefix_0 & 0x0F) + 2) as usize)
}

This version is compiled to the same assembly as the original in both v1.76.0 and v1.77.2.

If the statement let prefix_1 = prefix_0; is uncommented, then both compiler versions emit the slow assembly.

@apiraino
Copy link
Contributor

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

@rustbot rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels May 28, 2024
@saethlin saethlin added A-mir-opt Area: MIR optimizations and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels May 31, 2024
@saethlin
Copy link
Member

saethlin commented May 31, 2024

Disabling GVN via -Zmir-enable-passes=-GVN produces the fast version on nightly. The MIR inliner is not relevant in case anyone was wondering that (I was).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-mir-opt Area: MIR optimizations C-bug Category: This is a bug. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression.
Projects
None yet
Development

No branches or pull requests

6 participants