Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upManipulating slice through &mut parameter not optimized well #27130
Comments
This comment has been minimized.
This comment has been minimized.
|
The first example is much harder for LLVM to optimize because each iteration of the loop stores to a value which would be globally visible if the loop panics. I think LLVM would figure it out with the right set of optimization passes, though. The other examples are much easier to optimize because LLVM can trivially eliminate the stack traffic using scalarrepl. |
This comment has been minimized.
This comment has been minimized.
|
According to the LLVM IR, the |
This comment has been minimized.
This comment has been minimized.
|
The slice "escapes" like so: use std::sync::Mutex;
use std::sync::Arc;
pub fn trim_in_place(a: &mut &[u8]) {
while a.first() == Some(&42) {
*a = &a[2..];
}
}
fn main() {
static X: &'static [u8] = &[42, 0, 42];
let m = Arc::new(Mutex::new(X));
let m2 = m.clone();
let _ = std::thread::spawn(move || {
let mut r = m.lock().unwrap();
trim_in_place(&mut *r)
}).join();
let r = match m2.lock() {
Ok(r) => r,
Err(r) => r.into_inner()
};
assert_eq!(*r, &[42]);
} |
This comment has been minimized.
This comment has been minimized.
|
The mentioned transformation that leads to better code is what @reinerp suggested here: #26494 (comment) |
steveklabnik
added
the
I-slow
label
Jul 20, 2015
This comment has been minimized.
This comment has been minimized.
|
I'm preparing a PR that will eliminate the extra null check. For the loop itself, I don't have a fix for rustc (yet?), but this optimizes better: pub fn trim_in_place(a: &mut &[u8]) {
let mut x = *a;
while x.first() == Some(&42) {
x = &x[1..];
}
*a = x;
}You get the tight loop with that, and with the PR I'm preparing, the result is this: _ZN13trim_in_place20h5d93414c92587a1aeaaE:
.cfi_startproc
movq (%rdi), %rax
movq 8(%rdi), %rdx
xorl %ecx, %ecx
testq %rdx, %rdx
je .LBB0_5
xorl %ecx, %ecx
.align 16, 0x90
.LBB0_2:
movzbl (%rax), %esi
cmpl $42, %esi
jne .LBB0_3
incq %rax
decq %rdx
jne .LBB0_2
jmp .LBB0_5
.LBB0_3:
movq %rdx, %rcx
.LBB0_5:
movq %rax, (%rdi)
movq %rcx, 8(%rdi)
retqStill slightly weird WRT its usage of |
dotdash
added a commit
to dotdash/rust
that referenced
this issue
Jul 21, 2015
dotdash
referenced this issue
Jul 21, 2015
Closed
Employ non-null metadata for loads from fat pointers #27180
This comment has been minimized.
This comment has been minimized.
|
@dotdash yeah, LLVM register usage can be a little weird sometimes. It's live-range analysis can produce some odd results at times, meaning that it won't re-use a register even it can. |
dotdash
added a commit
to dotdash/rust
that referenced
this issue
Jul 22, 2015
rkruppe
referenced
this issue
in rkruppe/rust
Aug 6, 2015
pnkfelix
referenced
this issue
in rkruppe/rust
Aug 6, 2015
brson
added
P-low
T-compiler
A-codegen
labels
Jan 26, 2017
This comment has been minimized.
This comment has been minimized.
|
@rkruppe still a problem? |
This comment has been minimized.
This comment has been minimized.
|
@brson still as bad in stable and beta, but it even regressed from there in nightly :-/ _ZN8rust_out13trim_in_place17ha198dc1ee798e461E:
.cfi_startproc
pushq %rax
.Ltmp0:
.cfi_def_cfa_offset 16
movq 8(%rdi), %rax
jmp .LBB0_1
.p2align 4, 0x90
.LBB0_5:
incq (%rdi)
movq %rax, 8(%rdi)
.LBB0_1:
decq %rax
cmpq $-1, %rax
setne %cl
je .LBB0_3
movq (%rdi), %rcx
cmpb $42, (%rcx)
sete %cl
.LBB0_3:
testb %cl, %cl
je .LBB0_6
cmpq $-1, %rax
jne .LBB0_5
movl $1, %edi
xorl %esi, %esi
callq _ZN4core5slice22slice_index_order_fail17h9eb379df958d4186E@PLT
.LBB0_6:
popq %rax
retq
.Lfunc_end0:
.size _ZN8rust_out13trim_in_place17ha198dc1ee798e461E, .Lfunc_end0-_ZN8rust_out13trim_in_place17ha198dc1ee798e461E
.cfi_endproc |
dotdash
self-assigned this
Jan 26, 2017
This comment has been minimized.
This comment has been minimized.
|
OK, so the regression here is interesting. Thanks to a change in #38854 which makes us avoid FCA loads/stores in favor of memcpys, LLVM got a bit better at understanding this code and sees that |
This comment has been minimized.
This comment has been minimized.
|
I've found some a few places where tweaking things in LLVM would prevent this from regressing, the most promising one being a fix to folding zexts into PHIs with only two incoming values, a special case that is currently not handled. A patch on top of the current LLVM trunk at least gets rid of the regression, though the original problem still remains. |
This comment has been minimized.
This comment has been minimized.
|
The patch to handle the "regression" has landed in LLVM trunk |
This comment has been minimized.
This comment has been minimized.
|
Could we backport it? |
This comment has been minimized.
This comment has been minimized.
|
@arielb1 we probably could (can't check right now), but I'm not sure how useful it is in real world code. The commit is llvm-mirror/llvm@7c4d39c |
Mark-Simulacrum
added
the
C-enhancement
label
Jul 22, 2017
This comment has been minimized.
This comment has been minimized.
|
The "regression" is no more, but the original issue still persists. |
This comment has been minimized.
This comment has been minimized.
|
Looks like the bounds check is being optimized away since rustc 1.25. |
This comment has been minimized.
This comment has been minimized.
|
Do we have any perf tests that tell us when we regress it? |
oli-obk
added
the
E-needstest
label
Dec 25, 2018
This comment has been minimized.
This comment has been minimized.
|
I would assume that a regression would show up somewhere in lolbench, but there are a lot of events showing up there currently with no effort yet to triage or reproduce them, so I doubt we'll know based on those until the tooling is further along. |
rkruppe commentedJul 19, 2015
When passing a
&mut &[u8]to a function that trims elements from the start/end of the slice, the code contains pointless length checks and stack traffic. Writing the function slightly differently, or inlining it, does produce the code I'd expect.This is a minimal example (playground of all following snippets):
I expect this to compile to a tight loop that simply decrements the slice length and increments the data pointer in registers, until the length reaches 0 or a byte value of 42 is encountered.
Instead, the loop writes to the stack on every iteration and checks twice whether the length is zero, trying to panic for the second check (presumably comes from the
a[1..]slicing).The
.LBB0_9branch cannot ever be taken, and is indeed optimized out when writing the function in differently. The stack traffic also disappears, though that probably has an entirely different cause. This variant compiles to better code:It does all the loop calculations in registers. Also, marking the first function as
inline(always)and defining the second function in terms of it ({ trim_in_place(&mut a); a }) optimizes to the same code.Both variants check if the slice is empty before entering the loop and return immediately in that case.
Background
I encountered this problem in the context of my dec2flt work, in a larger and more complicated function that trims two slices on both ends, and the slices are stored in a struct. Applying
inline(always)to that function gives a small but measurable performance improvement across the board,but increases code size by 10 KiB(Edit: I just didn't measure properly. Size is about the same.). I haven't checked the assembly code (far too much code) but I suspect this means this problem, and solution, carries over to real code.Meta
Reproduces on nightly playground. Stable is worse for all variants, presumably because it doesn't have #26411 (yay, progress!). My local
rustchas the same problem. The exact commit is meaningless because it only exists in myrust.gitfork. It's basedf9f580936d81ed527115ae86375f69bb77723b1c.