Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nest array initialization is not optimized #58625

Open
upsuper opened this issue Feb 21, 2019 · 7 comments
Open

Nest array initialization is not optimized #58625

upsuper opened this issue Feb 21, 2019 · 7 comments
Labels
A-mir-opt Area: MIR optimizations I-slow Issue: Problems and improvements with respect to performance of generated code. WG-llvm Working group: LLVM backend code generation

Comments

@upsuper
Copy link
Contributor

upsuper commented Feb 21, 2019

See the following code:

pub fn foo() -> [[i32; 1000]; 1000] {
    [[0; 1000]; 1000]
}

pub fn bar() -> [i32; 1_000_000] {
    [0; 1_000_000]
}

Ideally they should generate the same code as [[i32; 1000]; 1000] is essentially just [i32; 1_000_000]. However, the first function does one memset to set a [i32; 1000] on the stack, then use memcpy to generate the big array, while the second just uses a single memset.

The code generated from the first is way larger than the second, and I would also expect it to be much slower given its repeatedly invoking memcpy.

@upsuper
Copy link
Contributor Author

upsuper commented Feb 21, 2019

Similarly, if I do [[unsafe { std::mem::uninitialized() }; 1000]; 1000], the initialization of the inner array is skipped, while it still memcpys the uninitialized array content to initialize the outer array.

@Centril Centril added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Feb 21, 2019
@scottmcm scottmcm added the WG-llvm Working group: LLVM backend code generation label Feb 22, 2019
@oli-obk oli-obk self-assigned this Feb 22, 2019
@oli-obk oli-obk removed their assignment Mar 12, 2019
@nikic
Copy link
Contributor

nikic commented Mar 12, 2021

No changes: https://godbolt.org/z/o1xxd1

Also for the uninitialized case: https://godbolt.org/z/EEh9ae This one is surprising to me.

@nikic nikic self-assigned this Mar 12, 2021
@nikic
Copy link
Contributor

nikic commented Mar 12, 2021

The uninitialized case seems to be due to a bug in the location-aware MSSA clobber walker, which currently bails on phi starting access.

@nikic
Copy link
Contributor

nikic commented Mar 12, 2021

Upstream patch: https://reviews.llvm.org/D98557

@nikic
Copy link
Contributor

nikic commented Mar 13, 2021

Upstream fix: llvm/llvm-project@b2f933a

Assuming no phase ordering issues, we should be generating optimal code for both the zero-initialization and the uninitialized case after this.

@oli-obk oli-obk added the A-mir-opt Area: MIR optimizations label Mar 13, 2021
@nikic
Copy link
Contributor

nikic commented Aug 22, 2021

After #87570 the uninitialized case results in:

define void @_ZN5test23foo17he3396acf5b2d9080E([1000 x [1000 x i32]]* noalias nocapture sret([1000 x [1000 x i32]]) dereferenceable(4000000) %0) unnamed_addr #0 {
start:
  %1 = getelementptr inbounds [1000 x [1000 x i32]], [1000 x [1000 x i32]]* %0, i64 0, i64 0
  %2 = getelementptr inbounds [1000 x [1000 x i32]], [1000 x [1000 x i32]]* %0, i64 0, i64 1000
  br label %repeat_loop_body2

repeat_loop_body2:                                ; preds = %repeat_loop_body2, %start
  %3 = phi [1000 x i32]* [ %1, %start ], [ %4, %repeat_loop_body2 ]
  %4 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 25
  %.not5.24 = icmp eq [1000 x i32]* %4, %2
  br i1 %.not5.24, label %repeat_loop_next3, label %repeat_loop_body2

repeat_loop_next3:                                ; preds = %repeat_loop_body2
  ret void
}

The memcpy is eliminated, but we keep around a dead loop. The LoopDeletion pass runs before MemCpyOpt.

And the zero-initialized results in:

define void @_ZN5test23foo17he3396acf5b2d9080E([1000 x [1000 x i32]]* noalias nocapture sret([1000 x [1000 x i32]]) dereferenceable(4000000) %0) unnamed_addr #0 {
start:
  %1 = getelementptr inbounds [1000 x [1000 x i32]], [1000 x [1000 x i32]]* %0, i64 0, i64 0
  %2 = getelementptr inbounds [1000 x [1000 x i32]], [1000 x [1000 x i32]]* %0, i64 0, i64 1000
  br label %repeat_loop_body

repeat_loop_body:                                 ; preds = %repeat_loop_body, %start
  %3 = phi [1000 x i32]* [ %1, %start ], [ %53, %repeat_loop_body ]
  %4 = bitcast [1000 x i32]* %3 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %4, i8 0, i64 4000, i1 false)
  %5 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 1
  %6 = bitcast [1000 x i32]* %5 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %6, i8 0, i64 4000, i1 false)
  %7 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 2
  %8 = bitcast [1000 x i32]* %7 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %8, i8 0, i64 4000, i1 false)
  %9 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 3
  %10 = bitcast [1000 x i32]* %9 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %10, i8 0, i64 4000, i1 false)
  %11 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 4
  %12 = bitcast [1000 x i32]* %11 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %12, i8 0, i64 4000, i1 false)
  %13 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 5
  %14 = bitcast [1000 x i32]* %13 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %14, i8 0, i64 4000, i1 false)
  %15 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 6
  %16 = bitcast [1000 x i32]* %15 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %16, i8 0, i64 4000, i1 false)
  %17 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 7
  %18 = bitcast [1000 x i32]* %17 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %18, i8 0, i64 4000, i1 false)
  %19 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 8
  %20 = bitcast [1000 x i32]* %19 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %20, i8 0, i64 4000, i1 false)
  %21 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 9
  %22 = bitcast [1000 x i32]* %21 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %22, i8 0, i64 4000, i1 false)
  %23 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 10
  %24 = bitcast [1000 x i32]* %23 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %24, i8 0, i64 4000, i1 false)
  %25 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 11
  %26 = bitcast [1000 x i32]* %25 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %26, i8 0, i64 4000, i1 false)
  %27 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 12
  %28 = bitcast [1000 x i32]* %27 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %28, i8 0, i64 4000, i1 false)
  %29 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 13
  %30 = bitcast [1000 x i32]* %29 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %30, i8 0, i64 4000, i1 false)
  %31 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 14
  %32 = bitcast [1000 x i32]* %31 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %32, i8 0, i64 4000, i1 false)
  %33 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 15
  %34 = bitcast [1000 x i32]* %33 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %34, i8 0, i64 4000, i1 false)
  %35 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 16
  %36 = bitcast [1000 x i32]* %35 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %36, i8 0, i64 4000, i1 false)
  %37 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 17
  %38 = bitcast [1000 x i32]* %37 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %38, i8 0, i64 4000, i1 false)
  %39 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 18
  %40 = bitcast [1000 x i32]* %39 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %40, i8 0, i64 4000, i1 false)
  %41 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 19
  %42 = bitcast [1000 x i32]* %41 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %42, i8 0, i64 4000, i1 false)
  %43 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 20
  %44 = bitcast [1000 x i32]* %43 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %44, i8 0, i64 4000, i1 false)
  %45 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 21
  %46 = bitcast [1000 x i32]* %45 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %46, i8 0, i64 4000, i1 false)
  %47 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 22
  %48 = bitcast [1000 x i32]* %47 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %48, i8 0, i64 4000, i1 false)
  %49 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 23
  %50 = bitcast [1000 x i32]* %49 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %50, i8 0, i64 4000, i1 false)
  %51 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 24
  %52 = bitcast [1000 x i32]* %51 to i8*
  call void @llvm.memset.p0i8.i64(i8* noundef nonnull align 4 dereferenceable(4000) %52, i8 0, i64 4000, i1 false)
  %53 = getelementptr inbounds [1000 x i32], [1000 x i32]* %3, i64 25
  %.not.24 = icmp eq [1000 x i32]* %53, %2
  br i1 %.not.24, label %repeat_loop_next, label %repeat_loop_body

repeat_loop_next:                                 ; preds = %repeat_loop_body
  ret void
}

The memcpy is gone here as well, but we still perform memsets in a loop, instead of one large memset. I believe LoopIdiomRecognize would handle this, but it also runs before MemCpyOpt.

@nikic
Copy link
Contributor

nikic commented Aug 30, 2022

The uninit case looks good now: https://godbolt.org/z/vfo9PM1ca The init case still has memsets that could be combined: https://godbolt.org/z/eba6GT5xq

@nikic nikic removed their assignment Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-mir-opt Area: MIR optimizations I-slow Issue: Problems and improvements with respect to performance of generated code. WG-llvm Working group: LLVM backend code generation
Projects
None yet
Development

No branches or pull requests

5 participants