-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Always const-evaluate the GCD in slice::align_to_offsets
#111296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
(rustbot has picked a reviewer for you, use r? to override) |
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit b5ee324 with merge 2a2eb58ca22fcee5651c02f204da3818f000d4fc... |
r=me with perf results OK cc @scottmcm, I think you've worked on the GCD code and may be interested. |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
let gcd: usize = const { gcd(mem::size_of::<T>(), mem::size_of::<U>()) }; | ||
let ts: usize = mem::size_of::<U>() / gcd; | ||
let us: usize = mem::size_of::<T>() / gcd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe push even a bit more into the const block?
let gcd: usize = const { gcd(mem::size_of::<T>(), mem::size_of::<U>()) }; | |
let ts: usize = mem::size_of::<U>() / gcd; | |
let us: usize = mem::size_of::<T>() / gcd; | |
let (ts, us) = const { | |
let gcd: usize = gcd(mem::size_of::<T>(), mem::size_of::<U>()); | |
(mem::size_of::<U>() / gcd, mem::size_of::<T>() / gcd) | |
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing that results in more instructions being emitted in debug mode for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also compute ts * us
in const
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we should move more operations into the const
-block if it increases the instruction count though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing that results in more instructions being emitted in debug mode for some reason.
Caveat: small changes in instruction count in a debug build is a subpar proxy for estimating performance of such a build. LLVM can still select instruction sequences that execute in fewer cycles, but require more instructions in a debug mode, so long as it is able to match on IR sequences that trigger those instruction sequences. A prime example of that is e.g. dividing by a constant that produces a strength reduction that involves a multiplication and a shift. These two operations are still way faster to execute compared to a full division. Something similar may be happening here.
More importantly, though, this is all LLVM version dependent. It might do better in versions V and V+1, worse in versions V+2 and V+3 and even better in V+4... In this case I'd say, as long as the difference in machine instruction count isn’t like 10-20% or more, then we should strive to ensure that rustc does as much work as possible before handing it off to LLVM’s consideration.
EDIT: The actually correct thing would be to actually put the generated code through llvm-mca
and have it compute the actual execution cost in cycles.
EDIT (part 2): Another thing that might be happening is that moving code into the const
block enables LLVM to use FastIsel
or somesuch mechanism which might have been not used before for whatever reason, which is just much faster in terms of generating code (even if the results aren't as good execution-time wise.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both T
and U
are ZSTs then doing the divisions in the const-block causes a compile time divide-by-zero error. align_to
has an early return if at least one of the types is a ZST, so the divisor will never be zero at runtime, but it looks like the const-block still gets evaluated even though it's technically dead code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, sounds like we have no other choice right now, then. We could look into moving the check for ts == 0 || us == 0
into the const block too and bail in some other reasonable way, but it probably something that warrants a separate PR anyhow.
Nope, wasn't me. Maybe @nagisa ? Great change, though! A really nice demonstration of a |
Yeah, I “own” anything |
} | ||
a << k | ||
const fn gcd(a: usize, b: usize) -> usize { | ||
if b == 0 { a } else { gcd(b, a % b) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn’t this need to check for a == 0
too? There's no guarantee that either of the types won’t be ZST (in isolation.) Or, add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a == 0
then it returns gcd(b, 0 % b)
which will then return b
.
Finished benchmarking commit (2a2eb58ca22fcee5651c02f204da3818f000d4fc): comparison URL. Overall result: no relevant changes - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 653.81s -> 653.024s (-0.12%) |
@bors r=nagisa,Mark-Simulacrum |
☀️ Test successful - checks-actions |
Finished benchmarking commit (90c02c1): comparison URL. Overall result: ✅ improvements - no action needed@rustbot label: -perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 658.017s -> 656.527s (-0.23%) |
Use an inline
const
-block to force the compiler to calculate the GCD at compile time, even in debug mode. This shouldn't affect the behavior of the program at all, but it drastically cuts down on the number of instructions emitted with optimizations disabled.With the current implementation, a single
slice::align_to
instantiation (specifically<[u8]>::align_to::<u128>()
) generates 676 instructions (on x86-64). Forcing the GCD computation to be const cuts it down to 327 instructions, so just over 50% less. This is obviously not representative of actual runtime gains, but I still see it as a significant win as long as it doesn't degrade compile times.Not having to worry about LLVM const-evaluating the GCD function also allows it to use the textbook recursive euclidean algorithm instead of a much more complicated iterative implementation with multiple
unsafe
-blocks.