Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generators are too big #52924

Closed
cramertj opened this issue Jul 31, 2018 · 29 comments
Closed

Generators are too big #52924

cramertj opened this issue Jul 31, 2018 · 29 comments
Assignees
Labels
A-async-await Area: Async & Await A-coroutines Area: Coroutines AsyncAwait-Triaged Async-await issues that have been triaged during a working group meeting. C-enhancement Category: An issue proposing an enhancement or a PR with one. F-coroutines `#![feature(coroutines)]` I-heavy Issue: Problems and improvements with respect to binary size of generated code. P-medium Medium priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@cramertj
Copy link
Member

cramertj commented Jul 31, 2018

Currently, generators won't ever reuse storage slots, causing them to take up more space than is necessary:

#![feature(generators)]

fn main() {
    let a = || { 
        {
            let x: i32 = 5;
            yield;
            println!("{:?}", x);
        }
        {
            let x: i32 = 5;
            yield;
            println!("{:?}", x);
        }
    };
    println!("{}", std::mem::size_of_val(&a)); // prints "12", could print "8" (4 for state, 4 for `x`)
}

Finding optimal solutions seems like a difficult packing problem, but it should be easy to do quite a bit better.

Also see the example case in #59087.

@RalfJung
Copy link
Member

RalfJung commented Nov 20, 2018

I am wondering if thinking of this as "slots of a struct" is the right model anyway. The content of the generator type (aside from the state flag) is basically the call frame for this generator, so shouldn't this be treated much more like the stack? There probably shouldn't even be types, just a large enough opaque "blob of bytes" that can be used for generator execution. If you think of this in terms of fields, you are not going to be able to reuse the space of two i16 for an i32 later, but there is no reason you shouldn't do that -- just like in { let a = 0i16; let b = 1i16; /* ... */ } let c = 2i32;, c can reuse the space that was previously occupied by a and b.

@cramertj
Copy link
Member Author

Yes, it's just a stack of bytes. This optimization was referring to the possibility to overlap some of the bytes to hold data that is currently stored in separate "slots of a struct".

@withoutboats
Copy link
Contributor

My intuition about generator layout is that it would be:

  1. A discriminant large enough to hold a value for each yield point.
  2. The maximum number of bytes needed to hold the live stack variables at any yield points.

I expect that this is complicated by alignment and wanting to avoid moving values around inside of the generator as it iterates through states, but clearly we should be doing better than we are if the size is currently the sum of all yield point sizes instead of the max. If benchmarks are bad enough this may be a pre-stabilization priority. :-\

@cramertj
Copy link
Member Author

It's the sum of all variables that are alive across a yield point (one slot for every variable, not one slot for every yield point). We can do better by tracking which variables are live across non-intersecting sets of yield points and using the same space to store both.

@nikomatsakis nikomatsakis added the A-async-await Area: Async & Await label Feb 22, 2019
@nikomatsakis nikomatsakis added P-medium Medium priority AsyncAwait-Triaged Async-await issues that have been triaged during a working group meeting. I-nominated T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. AsyncAwait-Polish Async-await issues that are part of the "polish" area and removed I-nominated AsyncAwait-Triaged Async-await issues that have been triaged during a working group meeting. labels Mar 5, 2019
@nikomatsakis
Copy link
Contributor

Marking as blocking for async/await stabilization. This may be too strong, but this issue certainly prevents us from having the performant sort of futures that we want.

@cramertj had advocated for a simple strategy, vaguely described here, and covered in some detail in the video

@nikomatsakis
Copy link
Contributor

nikomatsakis commented Mar 5, 2019

Created a topic on Zulip which contains some notes on this

@cramertj
Copy link
Member Author

cramertj commented Mar 5, 2019

I think @tmandry is intending to work on this, so tentatively assigning-- feel free to let me know if I should unassign.

@cramertj cramertj changed the title Reuse generator slots after StorageDead Generators are too big Mar 11, 2019
@tmandry
Copy link
Member

tmandry commented Mar 11, 2019

#59087 was merged into this, but I'm not sure it would be fixed by the optimization we were discussing here. Taking this (simplified) segment:

    let composed_2 = async {
        let inner = i_am_1kb();
        println!("");
        await!(inner);
    };

Here, removing the println!(""); statement does not change which variables are live across a yield point, but it does cut down the size of composed_2 by half. So it seems something else is going on here. Am I missing something?

@petrochenkov
Copy link
Contributor

I haven't followed the Rust async story closely, but I recently encountered this paper in the most recent C++ committee mailing.
"Coroutines: Language and Implementation Impact" (compares several implementation strategies for stackless coroutines aka #![feature(generators)]).

Apparently the size issue was considered a deal-breaker for adopting the same strategy that's implemented in Rust, and the type-erased version was chosen instead.

Where are Rust coroutines exactly in the "trade-offs" tables from the paper?

@cramertj
Copy link
Member Author

@petrochenkov We currently have some optimizations that are performed before the generator type is created, and these are allowed to affect the size_of of the generated type, which can only be calculated after generator layout has been solved.

As far as the requirements around being called recursively, placing it on an ABI boundary, being virtual, these can all be done by choosing to type-erase by using the .boxed() combinator in the futures crate, and returning BoxFuture<'lt, OutputTy>. This lines up with the "library based type-erasure" rather than "compiler based type-erasure" set of properties in the doc (flexible wrapping, fewer customization points, harder to optimize out heap allocation, harder to devirtualize indirect calls, harder to inline).

The header-file bits don't apply to Rust for obvious reasons (though it is interesting to consider that these optimizations are a dependency of crate metadata due to their affects on the generator size).

@Matthias247
Copy link
Contributor

Matthias247 commented May 19, 2019

I played a bit around with @tmandry's optimization from #60187 in my project and wanted to share the results here instead of spamming the CR:

Here are my results (for the top level Future in my project):

Old nightly:

[project/src/main.rs:107] std::mem::size_of_val(&x) = 74216
[project/src/main.rs:103] std::mem::size_of_val(&everything) = 37096
[project/src/main.rs:88] std::mem::size_of_val(&starter) = 96
[project/src/main.rs:97] std::mem::size_of_val(&conn_fut) = 11840
[project/src/main.rs:98] std::mem::size_of_val(&send_task_futs) = 16
[project/src/main.rs:99] std::mem::size_of_val(&lifecycle_fut) = 648

tmandry:generator-optimization:

[project/src/main.rs:107] std::mem::size_of_val(&x) = 41608
[project/src/main.rs:103] std::mem::size_of_val(&everything) = 20792
[project/src/main.rs:88] std::mem::size_of_val(&starter) = 96
[project/src/main.rs:97] std::mem::size_of_val(&conn_fut) = 6576
[project/src/main.rs:98] std::mem::size_of_val(&send_task_futs) = 16
[project/src/main.rs:99] std::mem::size_of_val(&lifecycle_fut) = 392

And there is the relevant code which put those things together:

let x = async {
    let everything = async {
        let (ctrl, starter) = connection_builder.build().unwrap();
        dbg!(std::mem::size_of_val(&starter));
        let conn_fut = async move {
            let conn_res = starter.start_connection().await;
            trace!("Res: {:?}", conn_res);
        };

        let send_task_futs = join_all((0..8).map(|_| send_msgs(&ctrl)));
        let lifecycle_fut = lifecycle_task(&ctrl);

        dbg!(std::mem::size_of_val(&conn_fut));
        dbg!(std::mem::size_of_val(&send_task_futs));
        dbg!(std::mem::size_of_val(&lifecycle_fut));

        join!(conn_fut, send_task_futs, lifecycle_fut);
    };
    dbg!(std::mem::size_of_val(&everything));
    everything.await
};

dbg!(std::mem::size_of_val(&x));

block_on(x);

So it definitely seems to help quite a bit. The inner Futures (conn_fut, lifecycle_fut) shrunk, which lead to everything else shrink too. I haven't looked yet into how those inner tasks changed their size in detail.

However it seems like the main issue is still the exponential growth that is caused by some operations. E.g. here size_of_val(everything) should equal size_of_val(&conn_fut) + size_of_val(&send_task_futs) + size_of_val(&lifecycle_fut) = 6984, but it is 20792bytes (3x as much).

Afterwards I was curious on how much join! impacts the problem, since I reported it at rust-lang/futures-rs#1473. In order to do this, I implemented a join future for the use case myself in order to be able to check its size.

The code then looks like:

let x = async {
    let everything = async {
        let (ctrl, starter) = connection_builder.build().unwrap();
        dbg!(std::mem::size_of_val(&starter));
        let conn_fut = async move {
            let conn_res = starter.start_connection().await;
            trace!("Res: {:?}", conn_res);
        };

        let send_task_futs = join_all((0..8).map(|_| send_msgs(&ctrl)));
        let lifecycle_fut = lifecycle_task(&ctrl);

        dbg!(std::mem::size_of_val(&conn_fut));
        dbg!(std::mem::size_of_val(&send_task_futs));
        dbg!(std::mem::size_of_val(&lifecycle_fut));

        let joiner = Joiner {
                a: Some(conn_fut),
                b: Some(send_task_futs),
                c: Some(lifecycle_fut),
                a_res: None,
                b_res: None,
                c_res: None,
        };

        dbg!(std::mem::size_of_val(&joiner));
        joiner.await;
    };
    dbg!(std::mem::size_of_val(&everything));
    everything.await
};

dbg!(std::mem::size_of_val(&x));

block_on(x);

The sizes with that change are:

[project/src/main.rs:205] std::mem::size_of_val(&x) = 42472
[project/src/main.rs:201] std::mem::size_of_val(&everything) = 21224
[project/src/main.rs:174] std::mem::size_of_val(&starter) = 96
[project/src/main.rs:183] std::mem::size_of_val(&conn_fut) = 6576
[project/src/main.rs:184] std::mem::size_of_val(&send_task_futs) = 16
[project/src/main.rs:185] std::mem::size_of_val(&lifecycle_fut) = 392
[project/src/main.rs:196] std::mem::size_of_val(&joiner) = 7032

Which are in the same ballpark as with join! - and indicates that there is no huge overhead for the join of the subtasks (6576+16+392 = 6984 -> 48 bytes overhead for my Joiner struct).

The interesting thing here: Actually there is only a single .await point in this function, which is right at the end of it (joiner.await). Thereby the only thing that needs to be persisted across await points is joiner- thereby the resulting Futures size should be exactly the same 7032bytes. Yet it gets blown up to triple its size.

Edit: The last sentence was not true, ctrl also needs to be persisted since its reference is captured by the subtasks. However it's only 56 bytes.

Maybe those are the issues described in #59087 and #59123

@tmandry
Copy link
Member

tmandry commented May 22, 2019

@Matthias247 This is really helpful, thanks! I'm not sure what the root cause here is, but if you put that future in a crate I can checkout and build somewhere, I can probably find out.

bors added a commit that referenced this issue May 22, 2019
…mertj

Preserve local scopes in generator MIR

Part of #52924, depended upon by the generator layout optimization #60187.

This PR adds `StorageDead` statements in more places in generators, so we can see when non-`Drop` locals have gone out of scope and recover their storage.

The reason this is only done for generators is compiler performance. See #60187 (comment) for what happens when we do this for all functions.

For `Drop` locals, we modify the `MaybeStorageLive` analysis to use `drop` to indicate that storage is no longer live for the local. Once `drop` returns or unwinds to our function, we implicitly assume that the local is `StorageDead`.

Instead of using `drop`, it is possible to emit more `StorageDead` statements in the MIR for `Drop` locals so we can handle all locals the same. I am fine with doing it that way, but this was the simplest approach for my purposes. It is also likely to be more performant.

r? @Zoxc (feel free to reassign)
cc @cramertj @eddyb @RalfJung @rust-lang/wg-async-await
@tmandry
Copy link
Member

tmandry commented Jun 12, 2019

So I've taken a close look at @Matthias247's example, and believe that the solution here is the same as the one in #59123 (comment).

@cramertj
Copy link
Member Author

cramertj commented Jun 14, 2019

Removing the "blocking" label here as I believe the most urgent issue has been resolved. These types are still bigger than we'd like them to be and there're lots of opportunities for improvement, but I think we're now at a point where I'd feel comfortable stabilizing WRT size.

@cramertj cramertj added AsyncAwait-Triaged Async-await issues that have been triaged during a working group meeting. and removed AsyncAwait-Polish Async-await issues that are part of the "polish" area labels Jun 14, 2019
@PvdBerg1998
Copy link

This issue caused stackoverflows for me in some cases. I think it should be a blocker.

@tmandry
Copy link
Member

tmandry commented Jun 14, 2019

@PvdBerg1998 A major optimization PR (#60187) landed the other day, have you tested with that?

That said, I think there's still a huge improvement left to do in #59123. I'm hoping to have a fix for that up soon.

@PvdBerg1998
Copy link

@tmandry I didn't, that's great! I'll take a look some time in the future.

@bstrie
Copy link
Contributor

bstrie commented Jun 16, 2019

Is generator size only a stability hazard because of the performance implications, or are there concerns about adding future generator size optimizations backward-compatibly? I was under the impression that we only guarantee size_of stability for repr(c) types.

@bstrie
Copy link
Contributor

bstrie commented Jun 16, 2019

Once we have @tmandry's patch for #59123 it might be nice to do a call on internals for people who have reported overlarge generators to re-test their code so we can get more real-world before/after confidence.

@Centril Centril added the F-coroutines `#![feature(coroutines)]` label Jul 28, 2019
@petrochenkov
Copy link
Contributor

petrochenkov commented Oct 1, 2019

I recently had to deal with a special-purpose implementation of mini stackless coroutines based on LLVM IR and used for running GPU shaders with barriers on CPU.
Some approaches that I noticed there that could be used for minimizing state saving/restoring at suspension points:

  • Some variables that are alive at suspension point are cheaper to recalculate again than to save and restore. In that case code calculating them can be duplicated and then run again after the suspension point.
  • Some passes like CFG simplification, dead code elimination, mem2reg can be run before building coroutine states to reduce the amount of saving/restoring.

Not sure how applicable these ideas are to our MIR, but I'm still just going to leave them here.

@DustinByfuglien
Copy link

The result size of generator _gen = 4100.
Can it be optimized to much small size? I feel yes.

#![feature(generators, generator_trait)]

fn main() {
    let _gen = move || {
        let x = [1u8; 1024];
        yield;
        drop(x);
        let x = [1u8; 1024];
        yield;
        drop(x);
        let x = [1u8; 1024];
        yield;
        drop(x);
        let x = [1u8; 1024];
        yield;
        drop(x);
    };
    dbg!(std::mem::size_of_val(&_gen));
}

(Playground)

Errors:

   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 0.69s
     Running `target/debug/playground`
[src/main.rs:18] std::mem::size_of_val(&_gen) = 4100

@cramertj
Copy link
Member Author

@DustinByfuglien Yes, note that adding manual scopes around each variable keeps the size at 1028 no matter how many are added:

#![feature(generators, generator_trait)]

fn main() {
    let _gen = move || {
        {
            let x = [1u8; 1024];
            yield;
            drop(x);
        }
        {
            let x = [1u8; 1024];
            yield;
            drop(x);
        }
        {
            let x = [1u8; 1024];
            yield;
            drop(x);
        }
        {
            let x = [1u8; 1024];
            yield;
            drop(x);
        }
    };
    dbg!(std::mem::size_of_val(&_gen));
}
   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 0.66s
     Running `target/debug/playground`
[src/main.rs:26] std::mem::size_of_val(&_gen) = 1028

@DustinByfuglien
Copy link

But adding one more "yield" to each scope give 4100 again.

#![feature(generators, generator_trait)]

fn main() {
    let _gen = || {
        {
            let x = [1u8; 1024];
            yield;
            yield;
            drop(x);
        }
        {
            let x = [1u8; 1024];
            yield;
            yield;
            drop(x);
        }
        {
            let x = [1u8; 1024];
            yield;
            yield;
            drop(x);
        }
        {
            let x = [1u8; 1024];
            yield;
            yield;
            drop(x);
        }

    };
    dbg!(std::mem::size_of_val(&_gen));
}

(Playground)

@DustinByfuglien
Copy link

@tmandry May be you can take some time to look at this. May be this examples can be useful to improve optimizing algorythm.

@cramertj
Copy link
Member Author

Yes, that's because we only consider temporaries to be eligible for overlap if they're only live for exactly one yield point.

@bstrie
Copy link
Contributor

bstrie commented Jan 28, 2020

The original impetus for this issue (getting generators to ever re-use storage slots) has been addressed to some degree, shall we close this? Alternatively, should we re-purpose this issue into a tracking issue for all generator-size-related issues?

@nikomatsakis
Copy link
Contributor

I feel like it would likely be better to create a new tracking issue, rather than inherit all the history from this one.

@jonas-schievink jonas-schievink added the I-heavy Issue: Problems and improvements with respect to binary size of generated code. label Feb 18, 2020
@jonas-schievink jonas-schievink added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Mar 8, 2020
@jonas-schievink
Copy link
Contributor

Closing in favor of #69826

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-async-await Area: Async & Await A-coroutines Area: Coroutines AsyncAwait-Triaged Async-await issues that have been triaged during a working group meeting. C-enhancement Category: An issue proposing an enhancement or a PR with one. F-coroutines `#![feature(coroutines)]` I-heavy Issue: Problems and improvements with respect to binary size of generated code. P-medium Medium priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests