New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discard LLVM modules earlier when performing ThinLTO #56487

Merged
merged 6 commits into from Dec 7, 2018

Conversation

Projects
None yet
5 participants
@nikic
Contributor

nikic commented Dec 4, 2018

Currently ThinLTO is performed by first compiling all modules (and keeping them in memory), and then serializing them into ThinLTO buffers in a separate, synchronized step. Modules are later read back from ThinLTO buffers when running the ThinLTO optimization pipeline.

We can also find the following comment in lto.rs:

    // FIXME: right now, like with fat LTO, we serialize all in-memory
    //        modules before working with them and ThinLTO. We really
    //        shouldn't do this, however, and instead figure out how to
    //        extract a summary from an in-memory module and then merge that
    //        into the global index. It turns out that this loop is by far
    //        the most expensive portion of this small bit of global
    //        analysis!

I don't think that what is suggested here is the right approach: One of the primary benefits of using ThinLTO over ordinary LTO is that it's not necessary to keep all the modules (merged or not) in memory for the duration of the linking step.

However, we currently don't really make use of this (at least for crate-local ThinLTO), because we keep all modules in memory until the start of the LTO step. This PR changes the implementation to instead perform the serialization into ThinLTO buffers directly after the initial optimization step.

Most of the changes here are plumbing to separate out fat and thin lto handling in write.rs, as these now use different intermediate artifacts. For fat lto this will be in-memory modules, for thin lto it will be ThinLTO buffers.

r? @alexcrichton

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Dec 4, 2018

@bors: try

Whoa this is a great idea! This seems like the correct strategy for incremental too, although I forget if that takes different paths.

This also makes me think that we should link in fat LTO ASAP instead of synchronizing and then linking as it'd allow pipelining a bit ideally. In any case that's a patch for another time!

I'll take a closer look later, but I'm curious on the perf impact here too, it should both make builds faster (slightly) and decrease peak memory usage in theory

@bors

This comment has been minimized.

Contributor

bors commented Dec 4, 2018

⌛️ Trying commit 94131eb with merge 360659b...

bors added a commit that referenced this pull request Dec 4, 2018

Auto merge of #56487 - nikic:discard-modules-earlier, r=<try>
Discard LLVM modules earlier when performing ThinLTO

Currently ThinLTO is performed by first compiling all modules (and keeping them in memory), and then serializing them into ThinLTO buffers in a separate, synchronized step. Modules are later read back from ThinLTO buffers when running the ThinLTO optimization pipeline.

We can also find the following comment in `lto.rs`:

        // FIXME: right now, like with fat LTO, we serialize all in-memory
        //        modules before working with them and ThinLTO. We really
        //        shouldn't do this, however, and instead figure out how to
        //        extract a summary from an in-memory module and then merge that
        //        into the global index. It turns out that this loop is by far
        //        the most expensive portion of this small bit of global
        //        analysis!

I don't think that what is suggested here is the right approach: One of the primary benefits of using ThinLTO over ordinary LTO is that it's not necessary to keep all the modules (merged or not) in memory for the duration of the linking step.

However, we currently don't really make use of this (at least for crate-local ThinLTO), because we keep all modules in memory until the start of the LTO step. This PR changes the implementation to instead perform the serialization into ThinLTO buffers directly after the initial optimization step.

Most of the changes here are plumbing to separate out fat and thin lto handling in `write.rs`, as these now use different intermediate artifacts. For fat lto this will be in-memory modules, for thin lto it will be ThinLTO buffers.

r? @alexcrichton
@bors

This comment has been minimized.

Contributor

bors commented Dec 4, 2018

☀️ Test successful - status-travis
State: approved= try=True

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Dec 4, 2018

@rust-timer

This comment has been minimized.

rust-timer commented Dec 4, 2018

Success: Queued 360659b with parent 0c999ed, comparison URL.

@rust-timer

This comment has been minimized.

rust-timer commented Dec 4, 2018

Finished benchmarking try commit 360659b

@nikic

This comment has been minimized.

Contributor

nikic commented Dec 4, 2018

The max-rss results basically looks like noise to me. The wall-time seem to be a minor win for opt builds (a few percent for clean/baseline-incremental).

I guess that means that LLVM memory usage is dominated by rustc memory usage, at least for the build types used here (opt w/o debuginfo), so it has no impact on max-rss. Unfortunately I was not able to get massif working with rustc, it always segfaults early on :(

@nikic

This comment has been minimized.

Contributor

nikic commented Dec 4, 2018

I've got massif to run (need to directly call the jemalloc free rustc, no rustc +stage2). Here's a comparison of two runs for a crate where I see a minor max-rss reduction of about 3% for a release/debuginfo build. Note that these vary a lot between runs, so exact numbers are not meaningful:

Before:

    MB
217.0^                         #
     |                       ::#                       :::
     |                  :@@@:::#                 :::::::::
     |               @@@:@ @:::#               ::::: ::::::
     |              @@@ :@ @:::#              :::::: :::::::  ::::
     |         :::::@@@ :@ @:::#              :::::: ::::::::::::: :@: : :
     |        ::::: @@@ :@ @:::#      ::     ::::::: :::::::::::::::@:::::
     |      ::::::: @@@ :@ @:::#::    :      ::::::: :::::::::::::::@::::::
     |      ::::::: @@@ :@ @:::#: :::::      ::::::: :::::::::::::::@::::::
     |    ::::::::: @@@ :@ @:::#: ::: :     :::::::: :::::::::::::::@::::::
     |    : ::::::: @@@ :@ @:::#: ::: :    @:::::::: :::::::::::::::@::::::
     |    : ::::::: @@@ :@ @:::#: ::: :    @:::::::: :::::::::::::::@::::::@
     |   @: ::::::: @@@ :@ @:::#: ::: :   :@:::::::: :::::::::::::::@::::::@
     |   @: ::::::: @@@ :@ @:::#: ::: :  ::@:::::::: :::::::::::::::@::::::@
     |   @: ::::::: @@@ :@ @:::#: ::: : :::@:::::::: :::::::::::::::@::::::@:
     |  @@: ::::::: @@@ :@ @:::#: ::: : :::@:::::::: :::::::::::::::@::::::@::
     |  @@: ::::::: @@@ :@ @:::#: ::: : :::@:::::::: :::::::::::::::@::::::@::
     |::@@: ::::::: @@@ :@ @:::#: ::: : :::@:::::::: :::::::::::::::@::::::@::
     |: @@: ::::::: @@@ :@ @:::#: ::: : :::@:::::::: :::::::::::::::@::::::@::
     |: @@: ::::::: @@@ :@ @:::#: ::: : :::@:::::::: :::::::::::::::@::::::@::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   133.8

After:

    MB
217.5^                                                    ##
     |                                                @@::#                   
     |                   @                           :@ : # ::::              
     |           :::@@:::@:::                     ::::@ : # :::               
     |         :::: @@:: @:: :                ::::: ::@ : # ::: @  
     |       ::: :: @@:: @:: ::             ::::: : ::@ : # ::: @:::
     |       : : :: @@:: @:: ::  @         :: ::: : ::@ : # ::: @:: :@@       
     |    :::: : :: @@:: @:: ::::@         :: ::: : ::@ : # ::: @:: :@     
     |    :: : : :: @@:: @:: ::: @        ::: ::: : ::@ : # ::: @:: :@ :::::
     |    :: : : :: @@:: @:: ::: @        ::: ::: : ::@ : # ::: @:: :@ ::::   
     |    :: : : :: @@:: @:: ::: @        ::: ::: : ::@ : # ::: @:: :@ ::::   
     |    :: : : :: @@:: @:: ::: @      ::::: ::: : ::@ : # ::: @:: :@ ::::
     |    :: : : :: @@:: @:: ::: @      : ::: ::: : ::@ : # ::: @:: :@ :::: 
     |    :: : : :: @@:: @:: ::: @::    : ::: ::: : ::@ : # ::: @:: :@ :::: :
     |  :::: : : :: @@:: @:: ::: @: ::  : ::: ::: : ::@ : # ::: @:: :@ :::: : 
     |  : :: : : :: @@:: @:: ::: @: : ::: ::: ::: : ::@ : # ::: @:: :@ :::: ::
     | @: :: : : :: @@:: @:: ::: @: : : : ::: ::: : ::@ : # ::: @:: :@ :::: ::
     | @: :: : : :: @@:: @:: ::: @: : : : ::: ::: : ::@ : # ::: @:: :@ :::: ::
     | @: :: : : :: @@:: @:: ::: @: : : : ::: ::: : ::@ : # ::: @:: :@ :::: ::
     | @: :: : : :: @@:: @:: ::: @: : : : ::: ::: : ::@ : # ::: @:: :@ :::: ::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   133.8

The first hump is peak optimization, the second hump is peak LTO. So in this case peak memory usage is moved from the optimization stage to the LTO stage, but it ultimately does not make much of a difference.

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Dec 4, 2018

Ah bummer! In any case this looks good to me and it looks like graphs are indeed confirming a shift in peaks, so this seems good to land to me.

r=me with a rebase!

nikic added some commits Dec 3, 2018

Refactor LTO type determination
Instead of only determining whether some form of LTO is necessary,
determine whether thin, fat or no LTO is necessary. I've rewritten
the conditions in a way that I think is more obvious, i.e. specified
LTO type + additional preconditions.
Separate codepaths for fat and thin LTO in write.rs
These are going to have different intermediate artifacts, so
create separate codepaths for them.
Remove unnecessary parts of run_fat_lto signature
Fat LTO merges into one module, so only return one module.

@nikic nikic force-pushed the nikic:discard-modules-earlier branch from 94131eb to 96cc381 Dec 4, 2018

@nikic

This comment has been minimized.

Contributor

nikic commented Dec 4, 2018

@bors r=alexcrichton

@bors

This comment has been minimized.

Contributor

bors commented Dec 4, 2018

📌 Commit 96cc381 has been approved by alexcrichton

@rust-highfive

This comment was marked as outdated.

Collaborator

rust-highfive commented Dec 4, 2018

The job x86_64-gnu-llvm-5.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:06d8bcdd:start=1543936815307815953,finish=1543936816856564926,duration=1548748973
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#Pull-Requests-and-Security-Restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
Setting environment variables from .travis.yml
$ export IMAGE=x86_64-gnu-llvm-5.0
---
[00:01:12] 
######################################################################## 100.0%
[00:01:12] extracting /checkout/obj/build/cache/2018-10-30/cargo-beta-x86_64-unknown-linux-gnu.tar.gz
[00:01:13]     Updating crates.io index
[00:01:19] error: the lock file /checkout/Cargo.lock needs to be updated but --locked was passed to prevent this
[00:01:19] Build completed unsuccessfully in 0:00:18
[00:01:19] make: *** [prepare] Error 1
[00:01:19] Makefile:81: recipe for target 'prepare' failed
[00:01:20] Command failed. Attempt 2/5:
[00:01:20] Command failed. Attempt 2/5:
[00:01:21] error: the lock file /checkout/Cargo.lock needs to be updated but --locked was passed to prevent this
[00:01:21] Build completed unsuccessfully in 0:00:00
[00:01:21] Makefile:81: recipe for target 'prepare' failed
[00:01:21] make: *** [prepare] Error 1
[00:01:23] Command failed. Attempt 3/5:
[00:01:23] Command failed. Attempt 3/5:
[00:01:23] error: the lock file /checkout/Cargo.lock needs to be updated but --locked was passed to prevent this
[00:01:23] Build completed unsuccessfully in 0:00:00
[00:01:23] Makefile:81: recipe for target 'prepare' failed
[00:01:23] make: *** [prepare] Error 1
[00:01:26] Command failed. Attempt 4/5:
[00:01:26] Command failed. Attempt 4/5:
[00:01:26] error: the lock file /checkout/Cargo.lock needs to be updated but --locked was passed to prevent this
[00:01:26] Build completed unsuccessfully in 0:00:00
[00:01:26] make: *** [prepare] Error 1
[00:01:26] Makefile:81: recipe for target 'prepare' failed
[00:01:30] Command failed. Attempt 5/5:
[00:01:30] Command failed. Attempt 5/5:
[00:01:30] error: the lock file /checkout/Cargo.lock needs to be updated but --locked was passed to prevent this
[00:01:30] Build completed unsuccessfully in 0:00:00
[00:01:30] Makefile:81: recipe for target 'prepare' failed
[00:01:30] make: *** [prepare] Error 1
[00:01:30] The command has failed after 5 attempts.
---
travis_time:end:0c63b4b0:start=1543936917912404299,finish=1543936917916915405,duration=4511106
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:03f99cfc
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!\(.*\)|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:0c88f47e
travis_time:start:0c88f47e
$ cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory
travis_fold:end:after_failure.5
travis_fold:start:after_failure.6
travis_time:start:022e3788
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

Serialize modules into ThinBuffer after initial optimization
Instead of keeping all modules in memory until thin LTO and only
serializing them then, serialize the module immediately after
it finishes optimizing.

@nikic nikic force-pushed the nikic:discard-modules-earlier branch from 96cc381 to 8128d0d Dec 4, 2018

@nikic

This comment has been minimized.

Contributor

nikic commented Dec 4, 2018

@bors r=alexcrichton

Rebase mistake with submodules...

@bors

This comment has been minimized.

Contributor

bors commented Dec 4, 2018

📌 Commit 8128d0d has been approved by alexcrichton

@bors

This comment has been minimized.

Contributor

bors commented Dec 7, 2018

⌛️ Testing commit 8128d0d with merge f504d3f...

bors added a commit that referenced this pull request Dec 7, 2018

Auto merge of #56487 - nikic:discard-modules-earlier, r=alexcrichton
Discard LLVM modules earlier when performing ThinLTO

Currently ThinLTO is performed by first compiling all modules (and keeping them in memory), and then serializing them into ThinLTO buffers in a separate, synchronized step. Modules are later read back from ThinLTO buffers when running the ThinLTO optimization pipeline.

We can also find the following comment in `lto.rs`:

        // FIXME: right now, like with fat LTO, we serialize all in-memory
        //        modules before working with them and ThinLTO. We really
        //        shouldn't do this, however, and instead figure out how to
        //        extract a summary from an in-memory module and then merge that
        //        into the global index. It turns out that this loop is by far
        //        the most expensive portion of this small bit of global
        //        analysis!

I don't think that what is suggested here is the right approach: One of the primary benefits of using ThinLTO over ordinary LTO is that it's not necessary to keep all the modules (merged or not) in memory for the duration of the linking step.

However, we currently don't really make use of this (at least for crate-local ThinLTO), because we keep all modules in memory until the start of the LTO step. This PR changes the implementation to instead perform the serialization into ThinLTO buffers directly after the initial optimization step.

Most of the changes here are plumbing to separate out fat and thin lto handling in `write.rs`, as these now use different intermediate artifacts. For fat lto this will be in-memory modules, for thin lto it will be ThinLTO buffers.

r? @alexcrichton
@bors

This comment has been minimized.

Contributor

bors commented Dec 7, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing f504d3f to master...

@bors bors merged commit 8128d0d into rust-lang:master Dec 7, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment