Manipulating slice through &mut parameter not optimized well #27130

hanna-kruppe · 2015-07-19T18:23:39Z

When passing a &mut &[u8] to a function that trims elements from the start/end of the slice, the code contains pointless length checks and stack traffic. Writing the function slightly differently, or inlining it, does produce the code I'd expect.

This is a minimal example (playground of all following snippets):

pub fn trim_in_place(a: &mut &[u8]) {
    while a.first() == Some(&42) {
        *a = &a[1..];
    }
}

I expect this to compile to a tight loop that simply decrements the slice length and increments the data pointer in registers, until the length reaches 0 or a byte value of 42 is encountered.

Instead, the loop writes to the stack on every iteration and checks twice whether the length is zero, trying to panic for the second check (presumably comes from the a[1..] slicing).

.LBB0_5:
    movzx   edx, byte ptr [rax - 1]
    cmp edx, 42
    jne .LBB0_8 # exit function
    cmp rcx, -1
    je  .LBB0_9 # panic, never taken
    mov qword ptr [rdi], rax
    mov qword ptr [rdi + 8], rcx
    dec rcx
    inc rax
    cmp rcx, -1
    jne .LBB0_5

The .LBB0_9 branch cannot ever be taken, and is indeed optimized out when writing the function in differently. The stack traffic also disappears, though that probably has an entirely different cause. This variant compiles to better code:

pub fn trim_functionally(mut a: &[u8]) -> &[u8] {
    while a.first() == Some(&42) {
        a = &a[1..];
    }
    a
}

It does all the loop calculations in registers. Also, marking the first function as inline(always) and defining the second function in terms of it ({ trim_in_place(&mut a); a }) optimizes to the same code.

.LBB1_2:
    movzx   eax, byte ptr [rdi]
    cmp eax, 42
    jne .LBB1_3 # exit function
    inc rdi
    dec rsi
    jne .LBB1_2

Both variants check if the slice is empty before entering the loop and return immediately in that case.

Background

I encountered this problem in the context of my dec2flt work, in a larger and more complicated function that trims two slices on both ends, and the slices are stored in a struct. Applying inline(always) to that function gives a small but measurable performance improvement across the board, ~~but increases code size by 10 KiB~~ (Edit: I just didn't measure properly. Size is about the same.). I haven't checked the assembly code (far too much code) but I suspect this means this problem, and solution, carries over to real code.

Meta

Reproduces on nightly playground. Stable is worse for all variants, presumably because it doesn't have #26411 (yay, progress!). My local rustc has the same problem. The exact commit is meaningless because it only exists in my rust.git fork. It's based f9f580936d81ed527115ae86375f69bb77723b1c.

The text was updated successfully, but these errors were encountered:

eefriedman · 2015-07-19T19:42:39Z

The first example is much harder for LLVM to optimize because each iteration of the loop stores to a value which would be globally visible if the loop panics. I think LLVM would figure it out with the right set of optimization passes, though.

The other examples are much easier to optimize because LLVM can trivially eliminate the stack traffic using scalarrepl.

hanna-kruppe · 2015-07-19T20:20:21Z

According to the LLVM IR, the &mut &[u8]parameter is noalias and it's not passed to the panic function as argument. Rust-semantics wise I also don't see how the slice would escape in case of a panic. I'm not an expert on either, but I don't see how LLVM could consider it to be escaping.

eefriedman · 2015-07-19T21:00:12Z

The slice "escapes" like so:

use std::sync::Mutex;
use std::sync::Arc;

pub fn trim_in_place(a: &mut &[u8]) {
    while a.first() == Some(&42) {
        *a = &a[2..];
    }
}

fn main() {
    static X: &'static [u8] = &[42, 0, 42];
    let m = Arc::new(Mutex::new(X));
    let m2 = m.clone();
    let _ = std::thread::spawn(move || {
        let mut r = m.lock().unwrap();
        trim_in_place(&mut *r)
    }).join();
    let r = match m2.lock() {
        Ok(r) => r,
        Err(r) => r.into_inner()
    };
    assert_eq!(*r, &[42]);
}

dotdash · 2015-07-20T14:22:03Z

The mentioned transformation that leads to better code is what @reinerp suggested here: #26494 (comment)

dotdash · 2015-07-20T20:29:40Z

I'm preparing a PR that will eliminate the extra null check. For the loop itself, I don't have a fix for rustc (yet?), but this optimizes better:

pub fn trim_in_place(a: &mut &[u8]) {
    let mut x = *a;
    while x.first() == Some(&42) {
        x = &x[1..];
    }
    *a = x;
}

You get the tight loop with that, and with the PR I'm preparing, the result is this:

_ZN13trim_in_place20h5d93414c92587a1aeaaE:
    .cfi_startproc
    movq    (%rdi), %rax
    movq    8(%rdi), %rdx
    xorl    %ecx, %ecx
    testq   %rdx, %rdx
    je  .LBB0_5
    xorl    %ecx, %ecx
    .align  16, 0x90
.LBB0_2:
    movzbl  (%rax), %esi
    cmpl    $42, %esi
    jne .LBB0_3
    incq    %rax
    decq    %rdx
    jne .LBB0_2
    jmp .LBB0_5
.LBB0_3:
    movq    %rdx, %rcx
.LBB0_5:
    movq    %rax, (%rdi)
    movq    %rcx, 8(%rdi)
    retq

Still slightly weird WRT its usage of rcx, but at least the loop looks good.

Aatch · 2015-07-22T05:05:30Z

@dotdash yeah, LLVM register usage can be a little weird sometimes. It's live-range analysis can produce some odd results at times, meaning that it won't re-use a register even it can.

We currently copy fat pointers using memcpy intrinsics, which SROA will split up into load/store pairs. Unfortunately, nonnull metadata on loads from said fat pointers might get lost in the process. So instead of using memcpy we can go ahead and use load/store pairs right away including the appropriate nonnull metadata on the loads. That way the metadata doesn't get lost and LLVM has better chances of eliminating null checks. One example for this is the code in rust-lang#27130 that currently has an extra null check that disappears after this change.

brson · 2017-01-26T17:30:49Z

@rkruppe still a problem?

dotdash · 2017-01-26T18:13:13Z

@brson still as bad in stable and beta, but it even regressed from there in nightly :-/

_ZN8rust_out13trim_in_place17ha198dc1ee798e461E:
	.cfi_startproc
	pushq	%rax
.Ltmp0:
	.cfi_def_cfa_offset 16
	movq	8(%rdi), %rax
	jmp	.LBB0_1
	.p2align	4, 0x90
.LBB0_5:
	incq	(%rdi)
	movq	%rax, 8(%rdi)
.LBB0_1:
	decq	%rax
	cmpq	$-1, %rax
	setne	%cl
	je	.LBB0_3
	movq	(%rdi), %rcx
	cmpb	$42, (%rcx)
	sete	%cl
.LBB0_3:
	testb	%cl, %cl
	je	.LBB0_6
	cmpq	$-1, %rax
	jne	.LBB0_5
	movl	$1, %edi
	xorl	%esi, %esi
	callq	_ZN4core5slice22slice_index_order_fail17h9eb379df958d4186E@PLT
.LBB0_6:
	popq	%rax
	retq
.Lfunc_end0:
	.size	_ZN8rust_out13trim_in_place17ha198dc1ee798e461E, .Lfunc_end0-_ZN8rust_out13trim_in_place17ha198dc1ee798e461E
	.cfi_endproc

dotdash · 2017-01-29T15:49:01Z

OK, so the regression here is interesting. Thanks to a change in #38854 which makes us avoid FCA loads/stores in favor of memcpys, LLVM got a bit better at understanding this code and sees that <Option<&u8]> as PartialEq>::eq is only called with a single unique value for the RHS , so it specializes the function. Unfortunately this happens before some other optimizations can kick in, and while LLVM could optimize the generic code, it fails with the specialized version.

dotdash · 2017-01-29T22:12:31Z

I've found some a few places where tweaking things in LLVM would prevent this from regressing, the most promising one being a fix to folding zexts into PHIs with only two incoming values, a special case that is currently not handled.
Since I'm not fully aware of the implications this might have, I asked about that on the mailing list: http://lists.llvm.org/pipermail/llvm-dev/2017-January/109631.html

A patch on top of the current LLVM trunk at least gets rid of the regression, though the original problem still remains.

dotdash · 2017-02-04T14:47:18Z

The patch to handle the "regression" has landed in LLVM trunk

arielb1 · 2017-02-04T19:33:29Z

@dotdash

Could we backport it?

dotdash · 2017-02-04T21:46:16Z

@arielb1 we probably could (can't check right now), but I'm not sure how useful it is in real world code. The commit is llvm-mirror/llvm@7c4d39c

nox · 2018-04-02T14:36:55Z

The "regression" is no more, but the original issue still persists.

https://godbolt.org/g/i6iMJa

Cc @rust-lang/wg-codegen

nikic · 2018-12-24T13:13:11Z

Looks like the bounds check is being optimized away since rustc 1.25.

oli-obk · 2018-12-25T14:01:07Z

Do we have any perf tests that tell us when we regress it?

anp · 2018-12-27T16:52:45Z

I would assume that a regression would show up somewhere in lolbench, but there are a lot of events showing up there currently with no effort yet to triage or reproduce them, so I doubt we'll know based on those until the tooling is further along.

bugadani · 2020-08-01T18:14:10Z

I don't know how helpful this is, but it looks like rustc really likes to math patterns: https://rust.godbolt.org/z/TP9zKc

bugadani · 2020-08-25T15:02:29Z

This issue seems to be fixed in nightly.

Add test for issue rust-lang#27130 rust-lang#27130 seems to be fixed by the llvm 11 update. The issue is marked with needs-test, so here it is. As some historical context, the generated code was fine until 1.38, and remained unoptimized from 1.38 up until the current nightly. I've also added a pattern matching version that was fine on 1.45.2.

@ghost

Rollup of 14 pull requests Successful merges: - rust-lang#75832 (Move to intra-doc links for wasi/ext/fs.rs, os_str_bytes.rs…) - rust-lang#75852 (Switch to intra-doc links in `core::hash`) - rust-lang#75874 (Shorten liballoc doc intra link while readable) - rust-lang#75881 (Expand rustdoc theme chooser x padding) - rust-lang#75885 (Fix another clashing_extern_declarations false positive.) - rust-lang#75892 (Fix typo in TLS Model in Unstable Book) - rust-lang#75910 (Add test for issue rust-lang#27130) - rust-lang#75917 (Move to intra doc links for core::ptr::non_null) - rust-lang#75975 (Allow --bess ing expect-tests in tools) - rust-lang#75990 (Add __fastfail for Windows on arm/aarch64) - rust-lang#76015 (Fix loading pretty-printers in rust-lldb script) - rust-lang#76022 (Clean up rustdoc front-end source code) - rust-lang#76029 (Move to intra-doc links for library/core/src/sync/atomic.rs) - rust-lang#76057 (Move retokenize hack to save_analysis) Failed merges: r? @ghost

JohnTitor · 2020-10-01T03:51:29Z

The test is added by #75910, closing.

steveklabnik added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jul 20, 2015

dotdash mentioned this issue Jul 21, 2015

Employ non-null metadata for loads from fat pointers #27180

Closed

brson added P-low Low priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-codegen Area: Code generation labels Jan 26, 2017

dotdash self-assigned this Jan 26, 2017

Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 22, 2017

oli-obk added the E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. label Dec 25, 2018

bugadani added a commit to bugadani/rust that referenced this issue Aug 25, 2020

Add test for issue rust-lang#27130

a4090d2

bugadani mentioned this issue Aug 25, 2020

Add test for issue #27130 #75910

Merged

Dylan-DPC-zz mentioned this issue Aug 27, 2020

Rollup of 10 pull requests #75999

Closed

pietroalbini mentioned this issue Aug 28, 2020

Rollup of 10 pull requests #76025

Closed

Dylan-DPC-zz mentioned this issue Aug 29, 2020

Rollup of 16 pull requests #76045

Closed

Dylan-DPC-zz mentioned this issue Aug 29, 2020

Rollup of 12 pull requests #76054

Closed

Dylan-DPC-zz mentioned this issue Aug 29, 2020

Rollup of 14 pull requests #76090

Merged

JohnTitor closed this as completed Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manipulating slice through &mut parameter not optimized well #27130

Manipulating slice through &mut parameter not optimized well #27130

hanna-kruppe commented Jul 19, 2015

eefriedman commented Jul 19, 2015

hanna-kruppe commented Jul 19, 2015

eefriedman commented Jul 19, 2015

dotdash commented Jul 20, 2015

dotdash commented Jul 20, 2015

Aatch commented Jul 22, 2015

brson commented Jan 26, 2017

dotdash commented Jan 26, 2017

dotdash commented Jan 29, 2017

dotdash commented Jan 29, 2017

dotdash commented Feb 4, 2017

arielb1 commented Feb 4, 2017

dotdash commented Feb 4, 2017

nox commented Apr 2, 2018

nikic commented Dec 24, 2018

oli-obk commented Dec 25, 2018

anp commented Dec 27, 2018

bugadani commented Aug 1, 2020

bugadani commented Aug 25, 2020

JohnTitor commented Oct 1, 2020

Manipulating slice through &mut parameter not optimized well #27130

Manipulating slice through &mut parameter not optimized well #27130

Comments

hanna-kruppe commented Jul 19, 2015

Background

Meta

eefriedman commented Jul 19, 2015

hanna-kruppe commented Jul 19, 2015

eefriedman commented Jul 19, 2015

dotdash commented Jul 20, 2015

dotdash commented Jul 20, 2015

Aatch commented Jul 22, 2015

brson commented Jan 26, 2017

dotdash commented Jan 26, 2017

dotdash commented Jan 29, 2017

dotdash commented Jan 29, 2017

dotdash commented Feb 4, 2017

arielb1 commented Feb 4, 2017

dotdash commented Feb 4, 2017

nox commented Apr 2, 2018

nikic commented Dec 24, 2018

oli-obk commented Dec 25, 2018

anp commented Dec 27, 2018

bugadani commented Aug 1, 2020

bugadani commented Aug 25, 2020

JohnTitor commented Oct 1, 2020