stage2: Miscellaneous fixes to vector arithmetic and copy elision #13074

topolarity · 2022-10-05T12:46:42Z

[Sema] Changes array initialization to scan for stores the same way struct init does
[Sema/Value] Renames compare (and family) to compareAll (etc.), and fixes lots of places where the wrong predicate for vectors was used
[Sema] If vector arithmetic yields a compile-time-known constant, make sure it's a vector
~~[Liveness] Add a peephole optimization for safety checks to Liveness, for copy/load elision~~
[LLVM] Fix up load elision to account for "byref pass-through" consumers like .array_elem_val, etc. which do not make copies

P.S. As @Vexu pointed out to me, we still need a way to ensure that if, e.g., foo.arr[x][y].z performs a copy, it only copies the target object rather than all of foo. This is (usually) handled correctly already when the result is a non-by-ref type, but it needs some tightening, esp. for by-ref types.

Resolves #13064, Resolves #13065, Resolves #13069 + others mentioned in the commits

src/Sema.zig

andrewrk · 2022-10-12T18:18:48Z

Could you rebase against master branch please?

topolarity · 2022-10-13T21:28:00Z

~~I think cef0f11 might be exposing a bug in how we've been handling by_ref types in .struct_field_val~~

~~If I can't fix the issue, I'll have to revert this and make field/element accesses for unions and arrays perform a copy as well (meaning performance gets worse, unfortunately).~~

~~I'm going to convert this to a draft while I investigate.~~

All resolved now

andrewrk · 2022-10-19T20:50:23Z

That Liveness change looks like it might deserve its own PR, I'd like to take a close look at it.

topolarity · 2022-10-20T04:52:29Z

That Liveness change looks like it might deserve its own PR, I'd like to take a close look at it.

Aye aye, I'll spin one up

topolarity · 2022-10-20T05:44:53Z

Made this dependent on #13221, which resolves the prior CI failures. CI passing is unfortunately still blocked on #13232, which is present on master but seems to be passing by the luck of the (alignment) draw ♠️

topolarity · 2022-11-01T15:14:49Z

Going to revisit this now that #13221 has landed.

billzez · 2022-11-01T16:28:10Z

Is there an ETA on this? I was surprised these fixes did not go into 0.10.

andrewrk

I would like to suggest a different way of doing the LLVM changes that looks like this:

diff --git a/src/codegen/llvm.zig b/src/codegen/llvm.zig
index 8daffaefe..8f3af4c66 100644
--- a/src/codegen/llvm.zig
+++ b/src/codegen/llvm.zig
@@ -4555,7 +4555,7 @@ pub const FuncGen = struct {
                 .fptrunc        => try self.airFptrunc(inst),
                 .fpext          => try self.airFpext(inst),
                 .ptrtoint       => try self.airPtrToInt(inst),
-                .load           => try self.airLoad(inst, body, i + 1),
+                .load           => try self.airLoad(body[i..]),
                 .loop           => try self.airLoop(inst),
                 .not            => try self.airNot(inst),
                 .ret            => try self.airRet(inst),
@@ -4614,7 +4614,7 @@ pub const FuncGen = struct {
                 .atomic_store_seq_cst   => try self.airAtomicStore(inst, .SequentiallyConsistent),
 
                 .struct_field_ptr => try self.airStructFieldPtr(inst),
-                .struct_field_val => try self.airStructFieldVal(inst),
+                .struct_field_val => try self.airStructFieldVal(body[i..]),
 
                 .struct_field_ptr_index_0 => try self.airStructFieldPtrIndex(inst, 0),
                 .struct_field_ptr_index_1 => try self.airStructFieldPtrIndex(inst, 1),
@@ -4623,7 +4623,7 @@ pub const FuncGen = struct {
 
                 .field_parent_ptr => try self.airFieldParentPtr(inst),
 
-                .array_elem_val     => try self.airArrayElemVal(inst),
+                .array_elem_val     => try self.airArrayElemVal(body[i..]),
                 .slice_elem_val     => try self.airSliceElemVal(inst),
                 .slice_elem_ptr     => try self.airSliceElemPtr(inst),
                 .ptr_elem_val       => try self.airPtrElemVal(inst),
@@ -5633,7 +5633,8 @@ pub const FuncGen = struct {
         return self.builder.buildInBoundsGEP(llvm_elem_ty, base_ptr, &indices, indices.len, "");
     }
 
-    fn airArrayElemVal(self: *FuncGen, inst: Air.Inst.Index) !?*llvm.Value {
+    fn airArrayElemVal(self: *FuncGen, body_tail: []const Air.Inst.Index) !?*llvm.Value {
+        const inst = body_tail[0];
         if (self.liveness.isUnused(inst)) return null;
 
         const bin_op = self.air.instructions.items(.data)[inst].bin_op;
@@ -5646,7 +5647,11 @@ pub const FuncGen = struct {
             const elem_ptr = self.builder.buildInBoundsGEP(array_llvm_ty, array_llvm_val, &indices, indices.len, "");
             const elem_ty = array_ty.childType();
             if (isByRef(elem_ty)) {
-                return elem_ptr;
+                if (canElideLoad(self, body_tail)) {
+                    return elem_ptr;
+                } else {
+                    return self.loadByRef(elem_ptr, elem_ty, 0, false);
+                }
             } else {
                 const elem_llvm_ty = try self.dg.lowerType(elem_ty);
                 return self.builder.buildLoad(elem_llvm_ty, elem_ptr, "");
@@ -5723,7 +5728,8 @@ pub const FuncGen = struct {
         return self.fieldPtr(inst, struct_ptr, struct_ptr_ty, field_index);
     }
 
-    fn airStructFieldVal(self: *FuncGen, inst: Air.Inst.Index) !?*llvm.Value {
+    fn airStructFieldVal(self: *FuncGen, body_tail: []const Air.Inst.Index) !?*llvm.Value {
+        const inst = body_tail[0];
         if (self.liveness.isUnused(inst)) return null;
 
         const ty_pl = self.air.instructions.items(.data)[inst].ty_pl;
@@ -5782,7 +5788,15 @@ pub const FuncGen = struct {
                 const struct_llvm_ty = try self.dg.lowerType(struct_ty);
                 const field_ptr = self.builder.buildStructGEP(struct_llvm_ty, struct_llvm_val, llvm_field_index, "");
                 const field_ptr_ty = Type.initPayload(&ptr_ty_buf.base);
-                return self.load(field_ptr, field_ptr_ty);
+                if (isByRef(field_ty)) {
+                    if (canElideLoad(self, body_tail)) {
+                        return field_ptr;
+                    } else {
+                        return self.loadByRef(field_ptr, field_ty, ptr_ty_buf.data.alignment(target), false);
+                    }
+                } else {
+                    return self.load(field_ptr, field_ptr_ty);
+                }
             },
             .Union => {
                 const union_llvm_ty = try self.dg.lowerType(struct_ty);
@@ -5792,7 +5806,11 @@ pub const FuncGen = struct {
                 const llvm_field_ty = try self.dg.lowerType(field_ty);
                 const field_ptr = self.builder.buildBitCast(union_field_ptr, llvm_field_ty.pointerType(0), "");
                 if (isByRef(field_ty)) {
-                    return field_ptr;
+                    if (canElideLoad(self, body_tail)) {
+                        return field_ptr;
+                    } else {
+                        return self.loadByRef(field_ptr, field_ty, layout.payload_align, false);
+                    }
                 } else {
                     return self.builder.buildLoad(llvm_field_ty, field_ptr, "");
                 }
@@ -8030,35 +8048,35 @@ pub const FuncGen = struct {
         return null;
     }
 
-    fn airLoad(
-        self: *FuncGen,
-        inst: Air.Inst.Index,
-        body: []const Air.Inst.Index,
-        body_i: usize,
-    ) !?*llvm.Value {
-        const ty_op = self.air.instructions.items(.data)[inst].ty_op;
-        const ptr_ty = self.air.typeOf(ty_op.operand);
+    /// As an optimization, we want to avoid unnecessary copies of isByRef=true
+    /// types. Here, we scan forward in the current block, looking to see if
+    /// this load dies before any side effects occur. In such case, we can
+    /// safely return the operand without making a copy.
+    fn canElideLoad(fg: *FuncGen, body_tail: []const Air.Inst.Index) bool {
+        for (body_tail) |body_inst| {
+            switch (fg.liveness.categorizeOperand(fg.air, body_inst, body_tail[0])) {
+                .none => continue,
+                .write, .noret, .complex => return false,
+                .tomb => return true,
+            }
+        } else unreachable;
+    }
+
+    fn airLoad(fg: *FuncGen, body_tail: []const Air.Inst.Index) !?*llvm.Value {
+        const inst = body_tail[0];
+        const ty_op = fg.air.instructions.items(.data)[inst].ty_op;
+        const ptr_ty = fg.air.typeOf(ty_op.operand);
+        const ptr_info = ptr_ty.ptrInfo().data;
         elide: {
-            const ptr_info = ptr_ty.ptrInfo().data;
             if (ptr_info.@"volatile") break :elide;
-            if (self.liveness.isUnused(inst)) return null;
+            if (fg.liveness.isUnused(inst)) return null;
             if (!isByRef(ptr_info.pointee_type)) break :elide;
-
-            // It would be valid to fall back to the code below here that simply calls
-            // load(). However, as an optimization, we want to avoid unnecessary copies
-            // of isByRef=true types. Here, we scan forward in the current block,
-            // looking to see if this load dies before any side effects occur.
-            // In such case, we can safely return the operand without making a copy.
-            for (body[body_i..]) |body_inst| {
-                switch (self.liveness.categorizeOperand(self.air, body_inst, inst)) {
-                    .none => continue,
-                    .write, .noret, .complex => break :elide,
-                    .tomb => return try self.resolveInst(ty_op.operand),
-                }
-            } else unreachable;
+            if (!canElideLoad(fg, body_tail)) break :elide;
+            const ptr = try fg.resolveInst(ty_op.operand);
+            return ptr;
         }
-        const ptr = try self.resolveInst(ty_op.operand);
-        return self.load(ptr, ptr_ty);
+        const ptr = try fg.resolveInst(ty_op.operand);
+        return fg.load(ptr, ptr_ty);
     }
 
     fn airBreakpoint(self: *FuncGen, inst: Air.Inst.Index) !?*llvm.Value {
@@ -9522,6 +9540,32 @@ pub const FuncGen = struct {
         return self.llvmModule().getIntrinsicDeclaration(id, types.ptr, types.len);
     }
 
+    fn loadByRef(
+        fg: *FuncGen,
+        ptr: *llvm.Value,
+        pointee_type: Type,
+        ptr_alignment: u32,
+        is_volatile: bool,
+    ) !?*llvm.Value {
+        const pointee_llvm_ty = try fg.dg.lowerType(pointee_type);
+        const target = fg.dg.module.getTarget();
+        const result_align = pointee_type.abiAlignment(target);
+        const max_align = @max(result_align, ptr_alignment);
+        const result_ptr = fg.buildAlloca(pointee_llvm_ty, max_align);
+        const llvm_ptr_u8 = fg.context.intType(8).pointerType(0);
+        const llvm_usize = fg.context.intType(Type.usize.intInfo(target).bits);
+        const size_bytes = pointee_type.abiSize(target);
+        _ = fg.builder.buildMemCpy(
+            fg.builder.buildBitCast(result_ptr, llvm_ptr_u8, ""),
+            max_align,
+            fg.builder.buildBitCast(ptr, llvm_ptr_u8, ""),
+            max_align,
+            llvm_usize.constInt(size_bytes, .False),
+            is_volatile,
+        );
+        return result_ptr;
+    }
+
     /// This function always performs a copy. For isByRef=true types, it creates a new
     /// alloca and copies the value into it, then returns the alloca instruction.
     /// For isByRef=false types, it creates a load instruction and returns it.
@@ -9533,24 +9577,10 @@ pub const FuncGen = struct {
         const ptr_alignment = info.alignment(target);
         const ptr_volatile = llvm.Bool.fromBool(ptr_ty.isVolatilePtr());
         if (info.host_size == 0) {
-            const elem_llvm_ty = try self.dg.lowerType(info.pointee_type);
             if (isByRef(info.pointee_type)) {
-                const result_align = info.pointee_type.abiAlignment(target);
-                const max_align = @max(result_align, ptr_alignment);
-                const result_ptr = self.buildAlloca(elem_llvm_ty, max_align);
-                const llvm_ptr_u8 = self.context.intType(8).pointerType(0);
-                const llvm_usize = self.context.intType(Type.usize.intInfo(target).bits);
-                const size_bytes = info.pointee_type.abiSize(target);
-                _ = self.builder.buildMemCpy(
-                    self.builder.buildBitCast(result_ptr, llvm_ptr_u8, ""),
-                    max_align,
-                    self.builder.buildBitCast(ptr, llvm_ptr_u8, ""),
-                    max_align,
-                    llvm_usize.constInt(size_bytes, .False),
-                    info.@"volatile",
-                );
-                return result_ptr;
+                return self.loadByRef(ptr, info.pointee_type, ptr_alignment, info.@"volatile");
             }
+            const elem_llvm_ty = try self.dg.lowerType(info.pointee_type);
             const llvm_inst = self.builder.buildLoad(elem_llvm_ty, ptr, "");
             llvm_inst.setAlignment(ptr_alignment);
             llvm_inst.setVolatile(ptr_volatile);

The end result of LLVM IR between your branch and mine looks like this:

    var x: [10][10]u32 = undefined;

    x[0][1] = 0;
    const a = x[0];
    x[0][1] = 15;

    try expect(a[1] == 0);

 ; Function Attrs: nounwind
 define internal fastcc i16 @test.doTheTest() unnamed_addr #0 {
 Entry:
-  %0 = alloca [10 x [10 x i32]], align 4
+  %0 = alloca [10 x i32], align 4
   %1 = alloca [10 x [10 x i32]], align 4
   call void @llvm.memset.p0.i64(ptr align 4 %1, i8 -86, i64 400, i1 false)
   %2 = getelementptr inbounds [10 x [10 x i32]], ptr %1, i32 0, i64 0
   %3 = getelementptr inbounds [10 x i32], ptr %2, i32 0, i64 1
   store i32 0, ptr %3, align 4
-  call void @llvm.memcpy.p0.p0.i64(ptr align 4 %0, ptr align 4 %1, i64 400, i1 false)
-  %4 = getelementptr inbounds [10 x [10 x i32]], ptr %0, i32 0, i64 0
+  %4 = getelementptr inbounds [10 x [10 x i32]], ptr %1, i32 0, i64 0
+  call void @llvm.memcpy.p0.p0.i64(ptr align 4 %0, ptr align 4 %4, i64 40, i1 false)
   %5 = getelementptr inbounds [10 x [10 x i32]], ptr %1, i32 0, i64 0
   %6 = getelementptr inbounds [10 x i32], ptr %5, i32 0, i64 1
   store i32 15, ptr %6, align 4
-  %7 = getelementptr inbounds [10 x i32], ptr %4, i32 0, i64 1
+  %7 = getelementptr inbounds [10 x i32], ptr %0, i32 0, i64 1
   %8 = load i32, ptr %7, align 4
   %9 = icmp eq i32 %8, 0
   %10 = call fastcc i16 @testing.expect(i1 %9)

This LLVM IR passes the respective test while copying 40 bytes instead of 400 bytes.

In my opinion, these changes to llvm.zig are easier to maintain, since they do not add messy logic to the load instruction but instead operate on the observation that array_elem_val is itself a load instruction (as is slice_elem_val, ptr_elem_val, and struct_field_val - these will need to be updated with similar logic).

test/behavior/bugs/12043.zig

topolarity · 2022-11-09T18:17:56Z

I like that approach! Seems like a nice improvement.

My sis got hitched last weekend 🎉 so I haven't gotten to fix this one up, but I should have it ready for review in a day or two.

This is a follow-up to 9dc98fb, which made comptime initialization patterns for union/struct more robust, especially when storing to comptime-known pointers (and globals). Resolves ziglang#13063.

This change makes any of the `*_val` instructions check whether it's safe to elide copies for by-ref types rather than performing this elision blindly. AIR instructions fixed: - .array_elem_val - .struct_field_val - .unwrap_errunion_payload - .try - .optional_payload These now all respect value semantics, as expected. P.S. Thanks to Andrew for the new way to approach this. Many of the lines here are from his recommended change, which comes with the significant advantage that loads are now as small as the intervening memory access allows. Co-authored by: Andrew Kelley <andrew@ziglang.org>

Adds optimizations for by-ref types to: - .struct_field_val - .slice_elem_val - .ptr_elem_val I would have expected LLVM to be able to optimize away these temporaries since we don't leak pointers to them and they are fed straight from def to use, but empirically it does not. Resolves ziglang#12713 Resolves ziglang#12638

Resolves ziglang#13058

This bug is already resolved, just want to make sure we don't lose the test case. Closes ziglang#12043

These functions have a very error-prone API. They are essentially `all(cmp(op, ...))` but that's not reflected in the name. This renames these functions to `compareAllAgainstZero...` etc. for clarity and fixes >20 locations where the predicate was incorrect. In the future, the scalar `compare` should probably be split off from the vector comparison. Rank-polymorphic programming is great, but a proper implementation in Zig would decouple comparison and reduction, which then needs a way to fuse ops at comptime.

andrewrk

Looks great! I think we will have some happy users when this lands.

billzez · 2022-11-11T02:48:46Z

Thanks @topolarity !

stage2: Miscellaneous fixes to vector arithmetic and copy elision

topolarity changed the title ~~stage2 llvm/sema: Miscellaneous fixes~~ stage2 llvm/sema: Miscellaneous fixes to vector arithmetic and copy elision Oct 5, 2022

topolarity changed the title ~~stage2 llvm/sema: Miscellaneous fixes to vector arithmetic and copy elision~~ stage2: Miscellaneous fixes to vector arithmetic and copy elision Oct 5, 2022

This was linked to issues Oct 5, 2022

stage2 compiler crashes initializing global vector #13063

Closed

Performance regression with self-hosted compiler #12713

Closed

stage2 array access performance (boundchecks related?) #12215

Closed

compiler crash multiplying zero vector #13058

Closed

topolarity force-pushed the stage2-opt branch from 4c8bc6f to 92b562d Compare October 5, 2022 13:37

wooster0 reviewed Oct 5, 2022

View reviewed changes

src/Sema.zig Outdated Show resolved Hide resolved

topolarity force-pushed the stage2-opt branch from 92b562d to 95782e5 Compare October 5, 2022 19:51

topolarity force-pushed the stage2-opt branch 2 times, most recently from bb319a1 to 5e4a560 Compare October 13, 2022 17:26

topolarity marked this pull request as draft October 13, 2022 21:28

topolarity force-pushed the stage2-opt branch from 5e4a560 to 6b53162 Compare October 19, 2022 20:24

topolarity mentioned this pull request Oct 20, 2022

stage2 llvm: pointer access with excess alignment, causing segfault #13232

Closed

topolarity mentioned this pull request Oct 20, 2022

stage2: Teach Liveness that some safety checks do not observably modify memory #13233

Merged

topolarity force-pushed the stage2-opt branch from 6b53162 to ca5539d Compare October 20, 2022 05:16

topolarity removed a link to an issue Oct 20, 2022

stage2 array access performance (boundchecks related?) #12215

Closed

topolarity force-pushed the stage2-opt branch 2 times, most recently from 90aae5a to 8ddda8d Compare November 1, 2022 16:24

topolarity force-pushed the stage2-opt branch 3 times, most recently from ee6e525 to acab0a2 Compare November 2, 2022 15:07

topolarity marked this pull request as ready for review November 2, 2022 22:07

andrewrk self-assigned this Nov 3, 2022

andrewrk requested changes Nov 3, 2022

View reviewed changes

test/behavior/bugs/12043.zig Outdated Show resolved Hide resolved

andrewrk removed their assignment Nov 3, 2022

topolarity added 6 commits November 10, 2022 12:22

stage2: Fix comptime array initialization

ff69972

This is a follow-up to 9dc98fb, which made comptime initialization patterns for union/struct more robust, especially when storing to comptime-known pointers (and globals). Resolves ziglang#13063.

stage2 sema: Make vector constants when operating on vectors

fbda156

Resolves ziglang#13058

Add test case for ziglang#12043

b135709

This bug is already resolved, just want to make sure we don't lose the test case. Closes ziglang#12043

topolarity force-pushed the stage2-opt branch from acab0a2 to 7b978bf Compare November 10, 2022 19:25

topolarity requested a review from andrewrk November 10, 2022 20:30

andrewrk approved these changes Nov 10, 2022

View reviewed changes

andrewrk merged commit 892fb0f into ziglang:master Nov 11, 2022

paoda mentioned this pull request Nov 11, 2022

stage2 binds a value in an array by reference (?) where stage1 would copy. #12642

Closed

IntegratedQuantum mentioned this pull request Nov 11, 2022

Bcrypt large speed regression #13510

Open

scheibo mentioned this pull request Nov 12, 2022

stage2 self referential array memory corruption #13415

Open

andrewrk added this to the 0.10.1 milestone Jan 9, 2023

andrewrk added a commit that referenced this pull request Jan 9, 2023

Merge pull request #13074 from topolarity/stage2-opt

48798da

stage2: Miscellaneous fixes to vector arithmetic and copy elision

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stage2: Miscellaneous fixes to vector arithmetic and copy elision #13074

stage2: Miscellaneous fixes to vector arithmetic and copy elision #13074

topolarity commented Oct 5, 2022 •

edited

Loading

andrewrk commented Oct 12, 2022

topolarity commented Oct 13, 2022 •

edited

Loading

andrewrk commented Oct 19, 2022

topolarity commented Oct 20, 2022

topolarity commented Oct 20, 2022

topolarity commented Nov 1, 2022

billzez commented Nov 1, 2022 •

edited

Loading

andrewrk left a comment •

edited

Loading

topolarity commented Nov 9, 2022

andrewrk left a comment

billzez commented Nov 11, 2022

stage2: Miscellaneous fixes to vector arithmetic and copy elision #13074

stage2: Miscellaneous fixes to vector arithmetic and copy elision #13074

Conversation

topolarity commented Oct 5, 2022 • edited Loading

andrewrk commented Oct 12, 2022

topolarity commented Oct 13, 2022 • edited Loading

andrewrk commented Oct 19, 2022

topolarity commented Oct 20, 2022

topolarity commented Oct 20, 2022

topolarity commented Nov 1, 2022

billzez commented Nov 1, 2022 • edited Loading

andrewrk left a comment • edited Loading

Choose a reason for hiding this comment

topolarity commented Nov 9, 2022

andrewrk left a comment

Choose a reason for hiding this comment

billzez commented Nov 11, 2022

topolarity commented Oct 5, 2022 •

edited

Loading

topolarity commented Oct 13, 2022 •

edited

Loading

billzez commented Nov 1, 2022 •

edited

Loading

andrewrk left a comment •

edited

Loading