Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird copy happening during initialization #56191

Closed
jrmuizel opened this issue Nov 24, 2018 · 14 comments
Closed

Weird copy happening during initialization #56191

jrmuizel opened this issue Nov 24, 2018 · 14 comments
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@jrmuizel
Copy link
Contributor

This code:

#[inline(never)]
pub fn do_ants() {
    do_item(&Outer {
                item: SpecificDisplayItem2::PopStackingContext,
                info: LayoutPrimitiveInfo2::new(),
            });
}

#[derive(Copy, Clone)]
pub struct LayoutPrimitiveInfo2 {
    pub rect: [f32; 7],
    pub is_backface_visible: u16,
}

impl LayoutPrimitiveInfo2 {
    fn new() -> Self {
        Self {
            rect: [0.; 7],
            is_backface_visible:0,
        }
    }
}

pub enum SpecificDisplayItem2 {
    PopStackingContext,
    Other([f64; 22]),
}

struct Outer {
    item: SpecificDisplayItem2,
    info: LayoutPrimitiveInfo2
}

fn do_item(di: &Outer) { unsafe { ext(di) } }
extern {
    fn ext(di: &Outer);
}

compiles to

bad_pop_stacking_context::do_ants:
 mov     rbp, rsp
 sub     rsp, 256
 xorps   xmm0, xmm0
 movaps  xmmword, ptr, [rbp, -, 32], xmm0
 mov     qword, ptr, [rbp, -, 10], 0
 mov     qword, ptr, [rbp, -, 16], 0
 mov     qword, ptr, [rbp, -, 248], 0
 mov     rax, qword, ptr, [rbp, -, 32]
 mov     rcx, qword, ptr, [rbp, -, 24]
 mov     qword, ptr, [rbp, -, 64], rax
 mov     qword, ptr, [rbp, -, 56], rcx
 mov     rax, qword, ptr, [rbp, -, 16]
 mov     qword, ptr, [rbp, -, 48], rax
 mov     rax, qword, ptr, [rbp, -, 8]
 mov     qword, ptr, [rbp, -, 40], rax
 lea     rdi, [rbp, -, 248]
 call    _ext
 add     rsp, 256
 pop     rbp
 ret

When the u16 is changed to a u32 we get the saner:

bad_pop_stacking_context::do_ants:
 mov     rbp, rsp
 sub     rsp, 224
 mov     qword, ptr, [rbp, -, 216], 0
 mov     qword, ptr, [rbp, -, 8], 0
 mov     qword, ptr, [rbp, -, 16], 0
 mov     qword, ptr, [rbp, -, 24], 0
 mov     qword, ptr, [rbp, -, 32], 0
 lea     rdi, [rbp, -, 216]
 call    _ext
 add     rsp, 224
 pop     rbp
 ret
@eddyb
Copy link
Member

eddyb commented Nov 24, 2018

You should be able to reproduce any of these with C++ code in clang and report them as LLVM bugs.

But also, it helps to explain what is weird here. In this case, my guess is the reads back out from the stack? Also, we should ignore the lack of SSE usage in the saner example, at least for now, right?

cc @rust-lang/wg-codegen Can someone look at the LLVM IR here and try to figure out if there's anything we could tell LLVM that would make it not do this?

@nagisa
Copy link
Member

nagisa commented Nov 24, 2018

The code we generate for the structure and its initalization looks something like this:

%LayoutPrimitiveInfo2 = type { [0 x i32], [7 x float], [0 x i16], i16, [1 x i16] }

; in a function
; loop to initialize all floats to 0
; then...
%9 = getelementptr inbounds %LayoutPrimitiveInfo2, %LayoutPrimitiveInfo2* %0, i32 0, i32 3
store i16 0, i16* %9, align 2
; and implicitly
; %10 = getelementptr inbounds %LayoutPrimitiveInfo2, %LayoutPrimitiveInfo2* %0, i32 0, i32 4
; store [1 x i16] undef, [1 x i16]* %10, align 2

this undef seems to be the thing that prevents proper optimisation from taking place. Explicitly adding intialization of padding to 0:

%10 = getelementptr inbounds %LayoutPrimitiveInfo2, %LayoutPrimitiveInfo2* %0, i32 0, i32 4
store [1 x i16] zersoinitializer, [1 x i16]* %10, align 2

makes GVN generate a memset for the whole structure, rather than a part of it, which then makes instcombine happier and prone to optimizing out stuff better.

@eddyb
Copy link
Member

eddyb commented Nov 24, 2018

Hmm, can we indicate to LLVM that it could put anything in the padding?
I'd rather not initialize padding ourselves since that can be a deoptimization.

@nagisa
Copy link
Member

nagisa commented Nov 24, 2018

EDIT: my bad, was looking at the wrong thing. LLVM is able to reach the code below with is_backface_visible: u32.

Wrong stuff

FWIW we do not reach the "ideal" codegen in either scenario. Ideally the code would look like this

do_ants:
# %bb.0:
	subq	$216, %rsp
	movq	$0, (%rsp)
	xorps	%xmm0, %xmm0
	movups	%xmm0, 200(%rsp)
	movups	%xmm0, 184(%rsp)
	movq	%rsp, %rdi
	callq   ext
	addq	$216, %rsp
	retq

which can be achieved by replacing initialization with usnafe { ::std::mem::zeroed() }.


Hmm, can we indicate to LLVM that it could put anything in the padding?

LLVM already knows (the padding is undef), but that is exactly what prevents optimizations from kicking in from what I can see. I cannot really tell the exact underlying analysis/optimisation that falls short, and I’m probably not going to spend time on this any time soon.

@nikic might be interested, though.

@eddyb
Copy link
Member

eddyb commented Nov 24, 2018

LLVM already knows (the padding is undef), but that is exactly what prevents optimizations from kicking in from what I can see.

Does it know it could write there, though? That is, undef vs "value is unmodified".
Or is it optimizing all of this on an alloca?

@nagisa
Copy link
Member

nagisa commented Nov 24, 2018

is it optimizing all of this on an alloca?

Yes.

@nikic
Copy link
Contributor

nikic commented Nov 24, 2018

I don't think that memset formation is the right place to optimize here. In the general case extending a partial memset to a memset of the full structure may not necessarily be advantageous.

It would be preferable to handle this during memset-memcpy forwarding. I have something like https://gist.github.com/nikic/3ca2f08c70bf8d9c0a634d42d5533f17 in mind, which would be enough to cover this case. This will fall short though if there is interior padding.

@jrmuizel
Copy link
Contributor Author

The equivalent c++ generates saner code regardless of the type of is_backface_visible

#include <stdint.h>
#include <cstring>

struct LayoutPrimitiveInfo2 {
        LayoutPrimitiveInfo2() {
                memset(rect, 0, sizeof(rect));
                is_backface_visible = 0;
        }
        float rect[7];
        uint16_t is_backface_visible;
};

struct SpecificDisplayItem2 {
    static SpecificDisplayItem2 PopStackingContext() {
            SpecificDisplayItem2 ret;
            ret.p.a.disc = 0;
            return ret;
    }
    union {
            struct A {
                    uint64_t disc;
            } a;
            struct B {
                    uint64_t disc;
                    double f[22];
            } b;
    } p;
};

struct Outer {
    SpecificDisplayItem2 item;
    LayoutPrimitiveInfo2 info;
};

extern int do_item(Outer &o);

void do_ants() {
    Outer o =  {
                SpecificDisplayItem2::PopStackingContext(),
                LayoutPrimitiveInfo2(),
            };
    do_item(o);
}
do_ants(): # @do_ants()
  sub rsp, 216
  mov qword ptr [rsp], 0
  xorps xmm0, xmm0
  movups xmmword ptr [rsp + 198], xmm0
  movups xmmword ptr [rsp + 184], xmm0
  mov rdi, rsp
  call do_item(Outer&)
  add rsp, 216
  ret

@nikic
Copy link
Contributor

nikic commented Nov 30, 2018

I've submitted https://reviews.llvm.org/D55120.

Ultimately this is just papering things over though. I think a more general solution would be to instead check if a memcpy can be hoisted upwards past a store/memset chain and absorbed into an alloca. Basically what call slot optimization does, just not for calls.

@nikic nikic added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Dec 1, 2018
@nikic
Copy link
Contributor

nikic commented Dec 8, 2018

Patch landed upstream, so this is waiting on an LLVM update now.

@jrmuizel
Copy link
Contributor Author

jrmuizel commented Dec 8, 2018 via email

@nikic
Copy link
Contributor

nikic commented Dec 9, 2018

Yes, I think it would be reasonable to cherry-pick this change. I'd like to wait for a few miscompile fixes to land first, so we can batch them together.

@nikic nikic self-assigned this Dec 13, 2018
@jrmuizel
Copy link
Contributor Author

@nikic the llvm patch has baked for a while now. Do you think you could cherry pick it into rust's llvm?

@jrmuizel
Copy link
Contributor Author

Fixed by #57675

We now get:

playground::do_ants: # @playground::do_ants
# %bb.0:
	subq	$216, %rsp
	movq	$0, (%rsp)
	xorps	%xmm0, %xmm0
	movups	%xmm0, 198(%rsp)
	movups	%xmm0, 184(%rsp)
	movq	%rsp, %rdi
	callq	*ext@GOTPCREL(%rip)
	addq	$216, %rsp
	retq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

4 participants