Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear stack blowup with multiple Vec::push of big struct with destructor #40883

Closed
arielb1 opened this issue Mar 28, 2017 · 12 comments
Closed
Labels
A-codegen Area: Code generation A-mir Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@arielb1
Copy link
Contributor

arielb1 commented Mar 28, 2017

Meta

Checked on rustc 1.12.0 with MIR on, 1.15.1, 1.16 and 1.18. Does not affect 1.12.0 with MIR off.

STR

#![crate_type="rlib"]

pub struct Big {
    drop_me: [Option<Box<u8>>; 64],
}

pub fn supersize_me(meal: fn() -> Big, out: &mut Vec<Big>) {
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal());
    out.push(meal()); // 16 calls to `push`
}

Expected Result

Function should use a small amount of stack space, definitely less than 2 kilobytes (Big is 512 bytes per copy); 1.12.0 with -Z orbit=off uses 1088 bytes of stack.

Actual Result

When compiled, the function uses more than 16384 = 8*64*16*2 bytes of stack space, as is evident from subq $16384, %rsp in the assembly - 2 copies of Big for every call to push.

Notes

This is the root cause for #40573. It is not new, however - it was probably always present in MIR.

@arielb1 arielb1 added A-codegen Area: Code generation A-mir Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 28, 2017
@arielb1
Copy link
Contributor Author

arielb1 commented Mar 29, 2017

Another example:

#![crate_type="rlib"]
#![feature(rustc_private)]
extern crate rustc;

use rustc::mir::*;

pub fn biggie2(basic_blocks: &mut Vec<BasicBlockData<'static>>,
               mk: fn() -> BasicBlockData<'static>,
               may_panic: fn())
{
    {
        let value = mk();
        may_panic();
        basic_blocks.push(value);
    }

    {
        let value = mk();
        may_panic();
        basic_blocks.push(value);
    }

    {
        let value = mk();
        may_panic();
        basic_blocks.push(value);
    }
}

@arielb1
Copy link
Contributor Author

arielb1 commented Mar 29, 2017

And then there's this case, which is probably the big granddaddy of them all, and which fixing probably requires help on the LLVM side:

#![crate_type="rlib"]

pub fn foo(get: fn() -> [u64; 128], sink: fn(u32),
           may_panic: fn([u64; 128]) -> u32,
           something_random_with_a_dtor: Box<u32>) {
    sink(may_panic(get()));
    sink(may_panic(get()));
}

The LLVM IR ends up like this:

define void @_ZN8rust_out3foo17hb77808e3fc28588aE(void ([128 x i64]*)* nocapture, void (i32)* nocapture, i32 ([128 x i64]*)* nocapture, i32* noalias dereferenceable(4)) unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
entry-block:
  %_19 = alloca [128 x i64], align 8
  %_13 = alloca [128 x i64], align 8
  %4 = bitcast [128 x i64]* %_13 to i8*
  call void @llvm.lifetime.start(i64 1024, i8* %4)
  invoke void %0([128 x i64]* noalias nocapture nonnull sret dereferenceable(1024) %_13)
          to label %bb3 unwind label %bb1

bb1:                                              ; preds = %entry-block, %bb3, %bb4, %bb5, %bb6, %bb7
  %5 = landingpad { i8*, i32 }
          cleanup
  %6 = bitcast i32* %3 to i8*
  tail call void @__rust_deallocate(i8* %6, i64 4, i64 4) #1
  resume { i8*, i32 } %5

bb3:                                              ; preds = %entry-block
  %7 = invoke i32 %2([128 x i64]* noalias nocapture nonnull dereferenceable(1024) %_13)
          to label %bb4 unwind label %bb1

bb4:                                              ; preds = %bb3
  call void @llvm.lifetime.end(i64 1024, i8* %4)
  invoke void %1(i32 %7)
          to label %bb5 unwind label %bb1

bb5:                                              ; preds = %bb4
  %8 = bitcast [128 x i64]* %_19 to i8*
  call void @llvm.lifetime.start(i64 1024, i8* %8)
  invoke void %0([128 x i64]* noalias nocapture nonnull sret dereferenceable(1024) %_19)
          to label %bb6 unwind label %bb1

bb6:                                              ; preds = %bb5
  %9 = invoke i32 %2([128 x i64]* noalias nocapture nonnull dereferenceable(1024) %_19)
          to label %bb7 unwind label %bb1

bb7:                                              ; preds = %bb6
  call void @llvm.lifetime.end(i64 1024, i8* %8)
  invoke void %1(i32 %9)
          to label %bb9 unwind label %bb1

bb9:                                              ; preds = %bb7
  %10 = bitcast i32* %3 to i8*
  tail call void @__rust_deallocate(i8* %10, i64 4, i64 4) #1
  ret void
}

And the problem is that either alloca can be alive at bb1, so LLVM thinks they can be simultaneously alive. And we can't solve this without splitting bb1 - which creates lots of landing pads that also degrade performance/stack usage.

@ranma42
Copy link
Contributor

ranma42 commented Mar 30, 2017

@arielb1 Since the lifetimes of the allocas are disjoint, wouldn't it be possible to just use one alloca?

@arielb1
Copy link
Contributor Author

arielb1 commented Mar 30, 2017

@ranma42

As a MIR optimization? Certainly. However, that won't help with LLVM inlining (LLVM has the alloca-merging-on-inlining thing, which sometimes improves matters, but not always).

@michaelwoerister michaelwoerister added the P-medium Medium priority label Mar 30, 2017
@michaelwoerister
Copy link
Member

According to @arielb1, this is another example of #35408. Will need LLVM changes to really be fixed. He'll look into it some more.

@michaelwoerister
Copy link
Member

Possibly related LLVM bug: https://bugs.llvm.org//show_bug.cgi?id=25776

@arielb1
Copy link
Contributor Author

arielb1 commented Apr 2, 2017

https://bugs.llvm.org//show_bug.cgi?id=32488 & https://reviews.llvm.org/D31583

@brson
Copy link
Contributor

brson commented Apr 4, 2017

Thanks @arielb1.

@nikomatsakis
Copy link
Contributor

@arielb1 can you say a bit more about this "granddaddy example". The problem seems to be (iiuc) that we unwind to bb1 so we can drop the something_random_with_a_dtor variable, and at that point LLVM considers the allocas alive. Would it be possible for us to issue calls to llvm.lifetime.end intrinsics on the unwind path? Is that not allowed in LLVM for some reason? (It seems like it'd be valid for us to do so, no?)

@dotdash
Copy link
Contributor

dotdash commented Jun 13, 2017

Patch has landed upstream! Thanks for keeping it on track @arielb1 🥇

@arielb1
Copy link
Contributor Author

arielb1 commented Jun 13, 2017

Yay!

@brson
Copy link
Contributor

brson commented Jun 15, 2017

Does the pending LLVM upgrade fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-mir Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants