Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manipulating slice through &mut parameter not optimized well #27130

Closed
hanna-kruppe opened this issue Jul 19, 2015 · 20 comments
Closed

Manipulating slice through &mut parameter not optimized well #27130

hanna-kruppe opened this issue Jul 19, 2015 · 20 comments
Assignees
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. I-slow Issue: Problems and improvements with respect to performance of generated code. P-low Low priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@hanna-kruppe
Copy link
Contributor

When passing a &mut &[u8] to a function that trims elements from the start/end of the slice, the code contains pointless length checks and stack traffic. Writing the function slightly differently, or inlining it, does produce the code I'd expect.

This is a minimal example (playground of all following snippets):

pub fn trim_in_place(a: &mut &[u8]) {
    while a.first() == Some(&42) {
        *a = &a[1..];
    }
}

I expect this to compile to a tight loop that simply decrements the slice length and increments the data pointer in registers, until the length reaches 0 or a byte value of 42 is encountered.

Instead, the loop writes to the stack on every iteration and checks twice whether the length is zero, trying to panic for the second check (presumably comes from the a[1..] slicing).

.LBB0_5:
    movzx   edx, byte ptr [rax - 1]
    cmp edx, 42
    jne .LBB0_8 # exit function
    cmp rcx, -1
    je  .LBB0_9 # panic, never taken
    mov qword ptr [rdi], rax
    mov qword ptr [rdi + 8], rcx
    dec rcx
    inc rax
    cmp rcx, -1
    jne .LBB0_5

The .LBB0_9 branch cannot ever be taken, and is indeed optimized out when writing the function in differently. The stack traffic also disappears, though that probably has an entirely different cause. This variant compiles to better code:

pub fn trim_functionally(mut a: &[u8]) -> &[u8] {
    while a.first() == Some(&42) {
        a = &a[1..];
    }
    a
}

It does all the loop calculations in registers. Also, marking the first function as inline(always) and defining the second function in terms of it ({ trim_in_place(&mut a); a }) optimizes to the same code.

.LBB1_2:
    movzx   eax, byte ptr [rdi]
    cmp eax, 42
    jne .LBB1_3 # exit function
    inc rdi
    dec rsi
    jne .LBB1_2

Both variants check if the slice is empty before entering the loop and return immediately in that case.

Background

I encountered this problem in the context of my dec2flt work, in a larger and more complicated function that trims two slices on both ends, and the slices are stored in a struct. Applying inline(always) to that function gives a small but measurable performance improvement across the board, but increases code size by 10 KiB (Edit: I just didn't measure properly. Size is about the same.). I haven't checked the assembly code (far too much code) but I suspect this means this problem, and solution, carries over to real code.

Meta

Reproduces on nightly playground. Stable is worse for all variants, presumably because it doesn't have #26411 (yay, progress!). My local rustc has the same problem. The exact commit is meaningless because it only exists in my rust.git fork. It's based f9f580936d81ed527115ae86375f69bb77723b1c.

@eefriedman
Copy link
Contributor

The first example is much harder for LLVM to optimize because each iteration of the loop stores to a value which would be globally visible if the loop panics. I think LLVM would figure it out with the right set of optimization passes, though.

The other examples are much easier to optimize because LLVM can trivially eliminate the stack traffic using scalarrepl.

@hanna-kruppe
Copy link
Contributor Author

According to the LLVM IR, the &mut &[u8]parameter is noalias and it's not passed to the panic function as argument. Rust-semantics wise I also don't see how the slice would escape in case of a panic. I'm not an expert on either, but I don't see how LLVM could consider it to be escaping.

@eefriedman
Copy link
Contributor

The slice "escapes" like so:

use std::sync::Mutex;
use std::sync::Arc;

pub fn trim_in_place(a: &mut &[u8]) {
    while a.first() == Some(&42) {
        *a = &a[2..];
    }
}

fn main() {
    static X: &'static [u8] = &[42, 0, 42];
    let m = Arc::new(Mutex::new(X));
    let m2 = m.clone();
    let _ = std::thread::spawn(move || {
        let mut r = m.lock().unwrap();
        trim_in_place(&mut *r)
    }).join();
    let r = match m2.lock() {
        Ok(r) => r,
        Err(r) => r.into_inner()
    };
    assert_eq!(*r, &[42]);
}

@dotdash
Copy link
Contributor

dotdash commented Jul 20, 2015

The mentioned transformation that leads to better code is what @reinerp suggested here: #26494 (comment)

@steveklabnik steveklabnik added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jul 20, 2015
@dotdash
Copy link
Contributor

dotdash commented Jul 20, 2015

I'm preparing a PR that will eliminate the extra null check. For the loop itself, I don't have a fix for rustc (yet?), but this optimizes better:

pub fn trim_in_place(a: &mut &[u8]) {
    let mut x = *a;
    while x.first() == Some(&42) {
        x = &x[1..];
    }
    *a = x;
}

You get the tight loop with that, and with the PR I'm preparing, the result is this:

_ZN13trim_in_place20h5d93414c92587a1aeaaE:
    .cfi_startproc
    movq    (%rdi), %rax
    movq    8(%rdi), %rdx
    xorl    %ecx, %ecx
    testq   %rdx, %rdx
    je  .LBB0_5
    xorl    %ecx, %ecx
    .align  16, 0x90
.LBB0_2:
    movzbl  (%rax), %esi
    cmpl    $42, %esi
    jne .LBB0_3
    incq    %rax
    decq    %rdx
    jne .LBB0_2
    jmp .LBB0_5
.LBB0_3:
    movq    %rdx, %rcx
.LBB0_5:
    movq    %rax, (%rdi)
    movq    %rcx, 8(%rdi)
    retq

Still slightly weird WRT its usage of rcx, but at least the loop looks good.

@Aatch
Copy link
Contributor

Aatch commented Jul 22, 2015

@dotdash yeah, LLVM register usage can be a little weird sometimes. It's live-range analysis can produce some odd results at times, meaning that it won't re-use a register even it can.

dotdash added a commit to dotdash/rust that referenced this issue Jul 22, 2015
We currently copy fat pointers using memcpy intrinsics, which SROA will
split up into load/store pairs. Unfortunately, nonnull metadata on loads
from said fat pointers might get lost in the process. So instead of
using memcpy we can go ahead and use load/store pairs right away
including the appropriate nonnull metadata on the loads. That way the
metadata doesn't get lost and LLVM has better chances of eliminating
null checks.

One example for this is the code in rust-lang#27130 that currently has an extra
null check that disappears after this change.
@brson brson added P-low Low priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-codegen Area: Code generation labels Jan 26, 2017
@brson
Copy link
Contributor

brson commented Jan 26, 2017

@rkruppe still a problem?

@dotdash
Copy link
Contributor

dotdash commented Jan 26, 2017

@brson still as bad in stable and beta, but it even regressed from there in nightly :-/

_ZN8rust_out13trim_in_place17ha198dc1ee798e461E:
	.cfi_startproc
	pushq	%rax
.Ltmp0:
	.cfi_def_cfa_offset 16
	movq	8(%rdi), %rax
	jmp	.LBB0_1
	.p2align	4, 0x90
.LBB0_5:
	incq	(%rdi)
	movq	%rax, 8(%rdi)
.LBB0_1:
	decq	%rax
	cmpq	$-1, %rax
	setne	%cl
	je	.LBB0_3
	movq	(%rdi), %rcx
	cmpb	$42, (%rcx)
	sete	%cl
.LBB0_3:
	testb	%cl, %cl
	je	.LBB0_6
	cmpq	$-1, %rax
	jne	.LBB0_5
	movl	$1, %edi
	xorl	%esi, %esi
	callq	_ZN4core5slice22slice_index_order_fail17h9eb379df958d4186E@PLT
.LBB0_6:
	popq	%rax
	retq
.Lfunc_end0:
	.size	_ZN8rust_out13trim_in_place17ha198dc1ee798e461E, .Lfunc_end0-_ZN8rust_out13trim_in_place17ha198dc1ee798e461E
	.cfi_endproc

@dotdash dotdash self-assigned this Jan 26, 2017
@dotdash
Copy link
Contributor

dotdash commented Jan 29, 2017

OK, so the regression here is interesting. Thanks to a change in #38854 which makes us avoid FCA loads/stores in favor of memcpys, LLVM got a bit better at understanding this code and sees that <Option<&u8]> as PartialEq>::eq is only called with a single unique value for the RHS , so it specializes the function. Unfortunately this happens before some other optimizations can kick in, and while LLVM could optimize the generic code, it fails with the specialized version.

@dotdash
Copy link
Contributor

dotdash commented Jan 29, 2017

I've found some a few places where tweaking things in LLVM would prevent this from regressing, the most promising one being a fix to folding zexts into PHIs with only two incoming values, a special case that is currently not handled.
Since I'm not fully aware of the implications this might have, I asked about that on the mailing list: http://lists.llvm.org/pipermail/llvm-dev/2017-January/109631.html

A patch on top of the current LLVM trunk at least gets rid of the regression, though the original problem still remains.

@dotdash
Copy link
Contributor

dotdash commented Feb 4, 2017

The patch to handle the "regression" has landed in LLVM trunk

@arielb1
Copy link
Contributor

arielb1 commented Feb 4, 2017

@dotdash

Could we backport it?

@dotdash
Copy link
Contributor

dotdash commented Feb 4, 2017

@arielb1 we probably could (can't check right now), but I'm not sure how useful it is in real world code. The commit is llvm-mirror/llvm@7c4d39c

@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 22, 2017
@nox
Copy link
Contributor

nox commented Apr 2, 2018

The "regression" is no more, but the original issue still persists.

https://godbolt.org/g/i6iMJa

Cc @rust-lang/wg-codegen

@nikic
Copy link
Contributor

nikic commented Dec 24, 2018

Looks like the bounds check is being optimized away since rustc 1.25.

@oli-obk
Copy link
Contributor

oli-obk commented Dec 25, 2018

Do we have any perf tests that tell us when we regress it?

@oli-obk oli-obk added the E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. label Dec 25, 2018
@anp
Copy link
Member

anp commented Dec 27, 2018

I would assume that a regression would show up somewhere in lolbench, but there are a lot of events showing up there currently with no effort yet to triage or reproduce them, so I doubt we'll know based on those until the tooling is further along.

@bugadani
Copy link
Contributor

bugadani commented Aug 1, 2020

I don't know how helpful this is, but it looks like rustc really likes to math patterns: https://rust.godbolt.org/z/TP9zKc

@bugadani
Copy link
Contributor

This issue seems to be fixed in nightly.

bugadani added a commit to bugadani/rust that referenced this issue Aug 25, 2020
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this issue Aug 27, 2020
Add test for issue rust-lang#27130

rust-lang#27130 seems to be fixed by the llvm 11 update. The issue is marked with needs-test, so here it is. As some historical context, the generated code was fine until 1.38, and remained unoptimized from 1.38 up until the current nightly.

I've also added a pattern matching version that was fine on 1.45.2.
pietroalbini added a commit to pietroalbini/rust that referenced this issue Aug 28, 2020
Add test for issue rust-lang#27130

rust-lang#27130 seems to be fixed by the llvm 11 update. The issue is marked with needs-test, so here it is. As some historical context, the generated code was fine until 1.38, and remained unoptimized from 1.38 up until the current nightly.

I've also added a pattern matching version that was fine on 1.45.2.
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this issue Aug 29, 2020
Add test for issue rust-lang#27130

rust-lang#27130 seems to be fixed by the llvm 11 update. The issue is marked with needs-test, so here it is. As some historical context, the generated code was fine until 1.38, and remained unoptimized from 1.38 up until the current nightly.

I've also added a pattern matching version that was fine on 1.45.2.
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this issue Aug 29, 2020
Add test for issue rust-lang#27130

rust-lang#27130 seems to be fixed by the llvm 11 update. The issue is marked with needs-test, so here it is. As some historical context, the generated code was fine until 1.38, and remained unoptimized from 1.38 up until the current nightly.

I've also added a pattern matching version that was fine on 1.45.2.
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this issue Aug 29, 2020
Add test for issue rust-lang#27130

rust-lang#27130 seems to be fixed by the llvm 11 update. The issue is marked with needs-test, so here it is. As some historical context, the generated code was fine until 1.38, and remained unoptimized from 1.38 up until the current nightly.

I've also added a pattern matching version that was fine on 1.45.2.
bors added a commit to rust-lang-ci/rust that referenced this issue Aug 30, 2020
Rollup of 14 pull requests

Successful merges:

 - rust-lang#75832 (Move to intra-doc links for wasi/ext/fs.rs, os_str_bytes.rs…)
 - rust-lang#75852 (Switch to intra-doc links in `core::hash`)
 - rust-lang#75874 (Shorten liballoc doc intra link while readable)
 - rust-lang#75881 (Expand rustdoc theme chooser x padding)
 - rust-lang#75885 (Fix another clashing_extern_declarations false positive.)
 - rust-lang#75892 (Fix typo in TLS Model in Unstable Book)
 - rust-lang#75910 (Add test for issue rust-lang#27130)
 - rust-lang#75917 (Move to intra doc links for core::ptr::non_null)
 - rust-lang#75975 (Allow --bess ing expect-tests in tools)
 - rust-lang#75990 (Add __fastfail for Windows on arm/aarch64)
 - rust-lang#76015 (Fix loading pretty-printers in rust-lldb script)
 - rust-lang#76022 (Clean up rustdoc front-end source code)
 - rust-lang#76029 (Move to intra-doc links for library/core/src/sync/atomic.rs)
 - rust-lang#76057 (Move retokenize hack to save_analysis)

Failed merges:

r? @ghost
@JohnTitor
Copy link
Member

The test is added by #75910, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. I-slow Issue: Problems and improvements with respect to performance of generated code. P-low Low priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests