Weird copy happening during initialization #56191

jrmuizel · 2018-11-24T02:56:54Z

This code:

#[inline(never)]
pub fn do_ants() {
    do_item(&Outer {
                item: SpecificDisplayItem2::PopStackingContext,
                info: LayoutPrimitiveInfo2::new(),
            });
}

#[derive(Copy, Clone)]
pub struct LayoutPrimitiveInfo2 {
    pub rect: [f32; 7],
    pub is_backface_visible: u16,
}

impl LayoutPrimitiveInfo2 {
    fn new() -> Self {
        Self {
            rect: [0.; 7],
            is_backface_visible:0,
        }
    }
}

pub enum SpecificDisplayItem2 {
    PopStackingContext,
    Other([f64; 22]),
}

struct Outer {
    item: SpecificDisplayItem2,
    info: LayoutPrimitiveInfo2
}

fn do_item(di: &Outer) { unsafe { ext(di) } }
extern {
    fn ext(di: &Outer);
}

compiles to

bad_pop_stacking_context::do_ants:
 mov     rbp, rsp
 sub     rsp, 256
 xorps   xmm0, xmm0
 movaps  xmmword, ptr, [rbp, -, 32], xmm0
 mov     qword, ptr, [rbp, -, 10], 0
 mov     qword, ptr, [rbp, -, 16], 0
 mov     qword, ptr, [rbp, -, 248], 0
 mov     rax, qword, ptr, [rbp, -, 32]
 mov     rcx, qword, ptr, [rbp, -, 24]
 mov     qword, ptr, [rbp, -, 64], rax
 mov     qword, ptr, [rbp, -, 56], rcx
 mov     rax, qword, ptr, [rbp, -, 16]
 mov     qword, ptr, [rbp, -, 48], rax
 mov     rax, qword, ptr, [rbp, -, 8]
 mov     qword, ptr, [rbp, -, 40], rax
 lea     rdi, [rbp, -, 248]
 call    _ext
 add     rsp, 256
 pop     rbp
 ret

When the u16 is changed to a u32 we get the saner:

bad_pop_stacking_context::do_ants:
 mov     rbp, rsp
 sub     rsp, 224
 mov     qword, ptr, [rbp, -, 216], 0
 mov     qword, ptr, [rbp, -, 8], 0
 mov     qword, ptr, [rbp, -, 16], 0
 mov     qword, ptr, [rbp, -, 24], 0
 mov     qword, ptr, [rbp, -, 32], 0
 lea     rdi, [rbp, -, 216]
 call    _ext
 add     rsp, 224
 pop     rbp
 ret

eddyb · 2018-11-24T07:48:44Z

You should be able to reproduce any of these with C++ code in clang and report them as LLVM bugs.

But also, it helps to explain what is weird here. In this case, my guess is the reads back out from the stack? Also, we should ignore the lack of SSE usage in the saner example, at least for now, right?

cc @rust-lang/wg-codegen Can someone look at the LLVM IR here and try to figure out if there's anything we could tell LLVM that would make it not do this?

nagisa · 2018-11-24T09:29:17Z

The code we generate for the structure and its initalization looks something like this:

%LayoutPrimitiveInfo2 = type { [0 x i32], [7 x float], [0 x i16], i16, [1 x i16] }

; in a function
; loop to initialize all floats to 0
; then...
%9 = getelementptr inbounds %LayoutPrimitiveInfo2, %LayoutPrimitiveInfo2* %0, i32 0, i32 3
store i16 0, i16* %9, align 2
; and implicitly
; %10 = getelementptr inbounds %LayoutPrimitiveInfo2, %LayoutPrimitiveInfo2* %0, i32 0, i32 4
; store [1 x i16] undef, [1 x i16]* %10, align 2

this undef seems to be the thing that prevents proper optimisation from taking place. Explicitly adding intialization of padding to 0:

%10 = getelementptr inbounds %LayoutPrimitiveInfo2, %LayoutPrimitiveInfo2* %0, i32 0, i32 4
store [1 x i16] zersoinitializer, [1 x i16]* %10, align 2

makes GVN generate a memset for the whole structure, rather than a part of it, which then makes instcombine happier and prone to optimizing out stuff better.

eddyb · 2018-11-24T09:40:28Z

Hmm, can we indicate to LLVM that it could put anything in the padding?
I'd rather not initialize padding ourselves since that can be a deoptimization.

nagisa · 2018-11-24T09:43:20Z

EDIT: my bad, was looking at the wrong thing. LLVM is able to reach the code below with is_backface_visible: u32.

Wrong stuff

FWIW we do not reach the "ideal" codegen in either scenario. Ideally the code would look like this

do_ants:
# %bb.0:
	subq	$216, %rsp
	movq	$0, (%rsp)
	xorps	%xmm0, %xmm0
	movups	%xmm0, 200(%rsp)
	movups	%xmm0, 184(%rsp)
	movq	%rsp, %rdi
	callq   ext
	addq	$216, %rsp
	retq

which can be achieved by replacing initialization with usnafe { ::std::mem::zeroed() }.

Hmm, can we indicate to LLVM that it could put anything in the padding?

LLVM already knows (the padding is undef), but that is exactly what prevents optimizations from kicking in from what I can see. I cannot really tell the exact underlying analysis/optimisation that falls short, and I’m probably not going to spend time on this any time soon.

@nikic might be interested, though.

eddyb · 2018-11-24T09:47:32Z

LLVM already knows (the padding is undef), but that is exactly what prevents optimizations from kicking in from what I can see.

Does it know it could write there, though? That is, undef vs "value is unmodified".
Or is it optimizing all of this on an alloca?

nagisa · 2018-11-24T09:49:00Z

is it optimizing all of this on an alloca?

Yes.

nikic · 2018-11-24T11:48:52Z

I don't think that memset formation is the right place to optimize here. In the general case extending a partial memset to a memset of the full structure may not necessarily be advantageous.

It would be preferable to handle this during memset-memcpy forwarding. I have something like https://gist.github.com/nikic/3ca2f08c70bf8d9c0a634d42d5533f17 in mind, which would be enough to cover this case. This will fall short though if there is interior padding.

jrmuizel · 2018-11-24T13:11:37Z

The equivalent c++ generates saner code regardless of the type of is_backface_visible

#include <stdint.h>
#include <cstring>

struct LayoutPrimitiveInfo2 {
        LayoutPrimitiveInfo2() {
                memset(rect, 0, sizeof(rect));
                is_backface_visible = 0;
        }
        float rect[7];
        uint16_t is_backface_visible;
};

struct SpecificDisplayItem2 {
    static SpecificDisplayItem2 PopStackingContext() {
            SpecificDisplayItem2 ret;
            ret.p.a.disc = 0;
            return ret;
    }
    union {
            struct A {
                    uint64_t disc;
            } a;
            struct B {
                    uint64_t disc;
                    double f[22];
            } b;
    } p;
};

struct Outer {
    SpecificDisplayItem2 item;
    LayoutPrimitiveInfo2 info;
};

extern int do_item(Outer &o);

void do_ants() {
    Outer o =  {
                SpecificDisplayItem2::PopStackingContext(),
                LayoutPrimitiveInfo2(),
            };
    do_item(o);
}

do_ants(): # @do_ants()
  sub rsp, 216
  mov qword ptr [rsp], 0
  xorps xmm0, xmm0
  movups xmmword ptr [rsp + 198], xmm0
  movups xmmword ptr [rsp + 184], xmm0
  mov rdi, rsp
  call do_item(Outer&)
  add rsp, 216
  ret

nikic · 2018-11-30T15:58:13Z

I've submitted https://reviews.llvm.org/D55120.

Ultimately this is just papering things over though. I think a more general solution would be to instead check if a memcpy can be hoisted upwards past a store/memset chain and absorbed into an alloca. Basically what call slot optimization does, just not for calls.

nikic · 2018-12-08T17:50:35Z

Patch landed upstream, so this is waiting on an LLVM update now.

jrmuizel · 2018-12-08T20:55:37Z

Would it be possible to cherry pick this into rust's llvm ahead of a new llvm release?

…

On Sat, Dec 8, 2018, 12:51 PM Nikita Popov ***@***.*** wrote: Patch landed upstream, so this is waiting on an LLVM update now. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#56191 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAUTbZhNpxUa7cNhNvFhLPZi4JaX9voTks5u2_wHgaJpZM4YxPtq> .

nikic · 2018-12-09T00:12:12Z

Yes, I think it would be reasonable to cherry-pick this change. I'd like to wait for a few miscompile fixes to land first, so we can batch them together.

jrmuizel · 2018-12-27T19:10:07Z

@nikic the llvm patch has baked for a while now. Do you think you could cherry pick it into rust's llvm?

jrmuizel · 2019-01-27T17:47:14Z

Fixed by #57675

We now get:

playground::do_ants: # @playground::do_ants
# %bb.0:
	subq	$216, %rsp
	movq	$0, (%rsp)
	xorps	%xmm0, %xmm0
	movups	%xmm0, 198(%rsp)
	movups	%xmm0, 184(%rsp)
	movq	%rsp, %rdi
	callq	*ext@GOTPCREL(%rip)
	addq	$216, %rsp
	retq

This was referenced Nov 27, 2018

Useless copies of alignment padding #56297

Closed

Field ordering causes extra memcpy #56356

Closed

jrmuizel mentioned this issue Nov 30, 2018

Reduce data copies during internation servo/webrender#3366

Merged

nikic added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Dec 1, 2018

nikic self-assigned this Dec 13, 2018

jrmuizel closed this as completed Jan 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird copy happening during initialization #56191

Weird copy happening during initialization #56191

jrmuizel commented Nov 24, 2018

eddyb commented Nov 24, 2018

nagisa commented Nov 24, 2018 •

edited

Loading

eddyb commented Nov 24, 2018

nagisa commented Nov 24, 2018 •

edited

Loading

eddyb commented Nov 24, 2018

nagisa commented Nov 24, 2018

nikic commented Nov 24, 2018 •

edited

Loading

jrmuizel commented Nov 24, 2018

nikic commented Nov 30, 2018

nikic commented Dec 8, 2018

jrmuizel commented Dec 8, 2018 via email

nikic commented Dec 9, 2018

jrmuizel commented Dec 27, 2018

jrmuizel commented Jan 27, 2019

Weird copy happening during initialization #56191

Weird copy happening during initialization #56191

Comments

jrmuizel commented Nov 24, 2018

eddyb commented Nov 24, 2018

nagisa commented Nov 24, 2018 • edited Loading

eddyb commented Nov 24, 2018

nagisa commented Nov 24, 2018 • edited Loading

eddyb commented Nov 24, 2018

nagisa commented Nov 24, 2018

nikic commented Nov 24, 2018 • edited Loading

jrmuizel commented Nov 24, 2018

nikic commented Nov 30, 2018

nikic commented Dec 8, 2018

jrmuizel commented Dec 8, 2018 via email

nikic commented Dec 9, 2018

jrmuizel commented Dec 27, 2018

jrmuizel commented Jan 27, 2019

nagisa commented Nov 24, 2018 •

edited

Loading

nagisa commented Nov 24, 2018 •

edited

Loading

nikic commented Nov 24, 2018 •

edited

Loading