Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save LTO import info and check it when trying to reuse build products #67020

Merged

Conversation

@pnkfelix
Copy link
Member

pnkfelix commented Dec 4, 2019

Fix #59535

Previous runs of LTO optimization on the previous incremental build can import larger portions of the dependence graph into a codegen unit than the current compilation run is choosing to import. We need to take that into account when we choose to reuse PostLTO-optimization object files from previous compiler invocations.

This PR accomplishes that by serializing the LTO import information on each incremental build. We load up the previous LTO import data as well as the current LTO import data. Then as we decide whether to reuse previous PostLTO objects or redo LTO optimization, we check whether the LTO import data matches. After we finish with this decision process for every object, we write the LTO import data back to disk.


What is the scenario where comparing against past LTO import information is necessary?

I've tried to capture it in the comments in the regression test, but here's yet another attempt from me to summarize the situation:

  1. Consider a call-graph like [A] -> [B -> D] <- [C] (where the letters are functions and the modules are enclosed in [])
  2. In our specific instance, the earlier compilations were inlining the call toB into A; thus A ended up with a external reference to the symbol D in its object code, to be resolved at subsequent link time. The LTO import information provided by LLVM for those runs reflected that information: it explicitly says during those runs, B definition and D declaration were imported into [A].
  3. The change between incremental builds was that the call D <- C was removed.
  4. That change, coupled with other decisions within rustc, made the compiler decide to make D an internal symbol (since it was no longer accessed from other codegen units, this makes sense locally). And then the definition of D was inlined into B and D itself was eliminated entirely.
  5. The current LTO import information reported that B alone is imported into [A] for the current compilation. So when the Rust compiler surveyed the dependence graph, it determined that nothing [A] imports changed since the last build (and [A] itself has not changed either), so it chooses to reuse the object code generated during the previous compilation.
  6. But that previous object code has an unresolved reference to D, and that causes a link time failure!

The interesting thing is that its quite hard to actually observe the above scenario arising, which is probably why no one has noticed this bug in the year or so since incremental LTO support landed (PR #53673).

I've literally spent days trying to observe the bug on my local machine, but haven't managed to find the magic combination of factors to get LLVM and rustc to do just the right set of the inlining and internal-reclassification choices that cause this particular problem to arise.


Also, I have tried to be careful about injecting new bugs with this PR. Specifically, I was/am worried that we could get into a scenario where overwriting the current LTO import data with past LTO import data would cause us to "forget" a current import. To guard against this, the PR as currently written always asserts, at overwrite time, that the past LTO import-set is a superset of the current LTO import-set. This way, the overwriting process should always be safe to run.

  • The previous note was written based on the first version of this PR. It has since been revised to use a simpler strategy, where we never attempt to merge the past LTO import information into the current one. We just compare them, and act accordingly.
  • Also, as you can see from the comments on the PR itself, I was quite right to be worried about forgetting past imports; that scenario was observable via a trivial transformation of the regression test I had devised.
@rust-highfive

This comment has been minimized.

Copy link
Collaborator

rust-highfive commented Dec 4, 2019

r? @eddyb

(rust_highfive has picked a reviewer for you, use r? to override)

@pnkfelix

This comment has been minimized.

Copy link
Member Author

pnkfelix commented Dec 4, 2019

@pnkfelix pnkfelix added the T-compiler label Dec 4, 2019
@pnkfelix

This comment has been minimized.

Copy link
Member Author

pnkfelix commented Dec 5, 2019

I was/am worried that we could get into a scenario where overwriting the current LTO import data with past LTO import data would cause us to "forget" a current import. To guard against this, the PR as currently written always asserts, at overwrite time, that the past LTO import-set is a superset of the current LTO import-set. This way, the overwriting process should always be safe to run.

I've been thinking more about this assertion this morning. I think I should just remove it.

Here is my thinking: assuming the LTO import set computation is always based on the current state of the code, and we observe when going from revision₁ to revision₂ that the import set for (green) module A has gone from set₁ to set₂ where set₂set₁, then logically one should be able to reverse the direction of the change (going from revision₂ to revision₁), and LLVM may (or even should) compute a corresponding change in the import set for that same (still green) module A where the import set now goes from set₂ to set₁ (where we still have set₁set₂). That would cause the assertion to fail.

Instead, I should just trust the underlying logic: the serialized imports for codegen unit A correspond to whatever information the LTO used for that compilation. If we reuse previously compiled A, then we keep the serialized LTO imports. If we recompile (or re-run LTO optimization) for A, then we use its newly computed LTO imports.

@pnkfelix

This comment has been minimized.

Copy link
Member Author

pnkfelix commented Dec 5, 2019

logically one should be able to reverse the direction of the change (going from revision₂ to revision₁)

eek! Yes: One can trivially construct a test showing this with slight modification of the test that I already have on this PR.

Fixing that up now.

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Dec 5, 2019

I'm wondering, do we even need to "overwrite" anything? Or could we make the decision once, before ThinLTO, based on the previous import map? Then save the current map to disk as is, without any merging. It should be correct and deterministic as it is always generated from the entire set of modules, including the cached ones.

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Dec 5, 2019

That is an excellent description of the error scenario btw ❤️ ❤️ ❤️

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Dec 5, 2019

Here is a version that makes re-use decisions based solely on the previous import map:
https://github.com/rust-lang/rust/pull/52309/files#diff-7495f8d1204a6939d7aa2d406839e509R949-R1008
(EDIT: Clicking the link doesn't scroll down to the function in question, it seems. Look for fn determine_cgu_reuse)

It's from #52309, which was never merged.

@pnkfelix

This comment has been minimized.

Copy link
Member Author

pnkfelix commented Dec 5, 2019

I'm wondering, do we even need to "overwrite" anything? Or could we make the decision once, before ThinLTO, based on the previous import map? Then save the current map to disk as is, without any merging. It should be correct and deterministic as it is always generated from the entire set of modules, including the cached ones.

My understanding of how the current import map is generated leads me to think that we cannot get by using the previous imports and storing the current ones. Otherwise, you'll still end up in the same scenario, just one incremental compile-cycle later.

I can try to spell this out in more detail, if need be.

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Dec 5, 2019

I think the problem arises from making decision based on the previous state of the object files in combination with the current import map. If the two things match up, i.e. making decisions about the previous object files based on the previous import map (which always accurately describes them), then the problem would not arise.

I'm generally very uneasy about keeping around data from sessions N-x where x > 1. That could add a kind of indeterminism unless you re-create the entire version history of compiling a crate (which would make for a really nasty debugging experience).

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Dec 5, 2019

If you want, I can try to sketch an alternative PR tomorrow morning, so we'd see if the bug gets fixed.

@pnkfelix

This comment has been minimized.

Copy link
Member Author

pnkfelix commented Dec 5, 2019

I think the problem arises from making decision based on the previous state of the object files in combination with the current import map. If the two things match up, i.e. making decisions about the previous object files based on the previous import map (which always accurately describes them), then the problem would not arise.

I'm generally very uneasy about keeping around data from sessions N-x where x > 1. That could add a kind of indeterminism unless you re-create the entire version history of compiling a crate (which would make for a really nasty debugging experience).

But that is indeed the problem: the so-called "previous object files" could have originated from arbitrarily distant compilation sessions, not just the immediately preceding session. So I do not see how we can avoid accumulating information from an arbitrary number of sessions.

@pnkfelix pnkfelix force-pushed the pnkfelix:issue-59535-accumulate-past-lto-imports branch from 001cf76 to f29e4bd Dec 5, 2019
@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Dec 5, 2019

But that is indeed the problem: the so-called "previous object files" can come from arbitrarily distant compilation sessions.

Yes, but the import map is always re-computed from scratch from the entire set of modules, including the older ones. That will make it up-to-date.

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Dec 5, 2019

In other words: The import map should always be the same, regardless of what your cache looks like and even if there is no cache. That makes it a function of the current state of the source code. (<= that describes more accurately what I mean).

@pnkfelix

This comment has been minimized.

Copy link
Member Author

pnkfelix commented Dec 5, 2019

Okay. I think I see what you are saying. It seems like it the import map can change in non-local ways though, right? I mean, the [A] -> [B -> C] <- [D] example seems to illustrate that a change to D can cause the import map for [A] to change.

This non-locality property doesn't contradict what you said. I just want to have a clear mental model.

Update: the non-locality property is what makes me worry that we would need to accumulate imports from arbitrarily far back. However, on zulip, @michaelwoerister and I have sketched out a different variant approach for fixing this that we're both happy with.

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Dec 6, 2019

Hurray! I was able to construct a pretty small test case for this:

// ./src/test/incremental/thinlto/import_removed.rs

// revisions: cfail1 cfail2
// compile-flags: -O -Zhuman-readable-cgu-names -Cllvm-args=-import-instr-limit=10
// build-pass

// TODO: Add proper description.

fn main() {
    foo::foo();
    bar::baz();
}

mod foo {

    // In cfail1, foo() gets inlined into main.
    // In cfail2, ThinLTO decides that foo() does not get inlined into main, and
    // instead bar() gets inlined into foo(). But faulty logic in our incr.
    // ThinLTO implementation thought that `main()` is unchanged and thus reused
    // the object file still containing a call to the now non-existant bar().
    pub fn foo(){
        bar()
    }

    // This function needs to be big so that it does not get inlined by ThinLTO
    // but *does* get inlined into foo() once it is declared `internal` in
    // cfail2.
    pub fn bar(){
        println!("quux1");
        println!("quux2");
        println!("quux3");
        println!("quux4");
        println!("quux5");
        println!("quux6");
        println!("quux7");
        println!("quux8");
        println!("quux9");
    }
}

mod bar {

    #[inline(never)]
    pub fn baz() {
        #[cfg(cfail1)]
        {
            crate::foo::bar();
        }
    }
}

This test case fails with the expected error on a recent nightly. I have not tested whether the fix in this PR makes it pass, but it should.

@pnkfelix

This comment has been minimized.

Copy link
Member Author

pnkfelix commented Dec 6, 2019

(okay this is ready for re-review; I incororated all feedback and adapted the test case, thanks @michaelwoerister !)

@JohnCSimon

This comment has been minimized.

Copy link
Member

JohnCSimon commented Dec 14, 2019

Ping from triage:
@michaelwoerister - can you please review this pr?

// are doing the ThinLTO in this current compilation cycle.)
//
// See rust-lang/rust#59535.
if let (Some(prev_import_map), true) =

This comment has been minimized.

Copy link
@michaelwoerister

michaelwoerister Dec 16, 2019

Contributor

Interesting pattern :)

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Dec 16, 2019

Thanks for updating the PR, @pnkfelix!
I see you left the big test case in. I can imagine that it is a bit brittle, with linker scripts and such. I'm going to r+ but if the test case breaks on any platform, I think we should just remove it.

@bors r+

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Dec 16, 2019

📌 Commit aecb511 has been approved by michaelwoerister

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Dec 16, 2019

🌲 The tree is currently closed for pull requests below priority 100, this pull request will be tested once the tree is reopened

Mark-Simulacrum added a commit to Mark-Simulacrum/rust that referenced this pull request Dec 19, 2019
…t-lto-imports, r=michaelwoerister

save LTO import info and check it when trying to reuse build products

Fix rust-lang#59535

Previous runs of LTO optimization on the previous incremental build can import larger portions of the dependence graph into a codegen unit than the current compilation run is choosing to import. We need to take that into account when we choose to reuse PostLTO-optimization object files from previous compiler invocations.

This PR accomplishes that by serializing the LTO import information on each incremental build. We load up the previous LTO import data as well as the current LTO import data. Then as we decide whether to reuse previous PostLTO objects or redo LTO optimization, we check whether the LTO import data matches. After we finish with this decision process for every object, we write the LTO import data back to disk.

----

What is the scenario where comparing against past LTO import information is necessary?

I've tried to capture it in the comments in the regression test, but here's yet another attempt from me to summarize the situation:

 1. Consider a call-graph like `[A] -> [B -> D] <- [C]` (where the letters are functions and the modules are enclosed in `[]`)
 2. In our specific instance, the earlier compilations were inlining the call to`B` into `A`; thus `A` ended up with a external reference to the symbol `D` in its object code, to be resolved at subsequent link time. The LTO import information provided by LLVM for those runs reflected that information: it explicitly says during those runs, `B` definition and `D` declaration were imported into `[A]`.
 3. The change between incremental builds was that the call `D <- C` was removed.
 4. That change, coupled with other decisions within `rustc`, made the compiler decide to make `D` an internal symbol (since it was no longer accessed from other codegen units, this makes sense locally). And then the definition of `D` was inlined into `B` and `D` itself was eliminated entirely.
  5. The current LTO import information reported that `B` alone is imported into `[A]` for the *current compilation*. So when the Rust compiler surveyed the dependence graph, it determined that nothing `[A]` imports changed since the last build (and `[A]` itself has not changed either), so it chooses to reuse the object code generated during the previous compilation.
  6. But that previous object code has an unresolved reference to `D`, and that causes a link time failure!

----

The interesting thing is that its quite hard to actually observe the above scenario arising, which is probably why no one has noticed this bug in the year or so since incremental LTO support landed (PR rust-lang#53673).

I've literally spent days trying to observe the bug on my local machine, but haven't managed to find the magic combination of factors to get LLVM and `rustc` to do just the right set of the inlining and `internal`-reclassification choices that cause this particular problem to arise.

----

Also, I have tried to be careful about injecting new bugs with this PR. Specifically, I was/am worried that we could get into a scenario where overwriting the current LTO import data with past LTO import data would cause us to "forget" a current import. ~~To guard against this, the PR as currently written always asserts, at overwrite time, that the past LTO import-set is a *superset* of the current LTO import-set. This way, the overwriting process should always be safe to run.~~
 * The previous note was written based on the first version of this PR. It has since been revised to use a simpler strategy, where we never attempt to merge the past LTO import information into the current one. We just *compare* them, and act accordingly.
 * Also, as you can see from the comments on the PR itself, I was quite right to be worried about forgetting past imports; that scenario was observable via a trivial transformation of the regression test I had devised.
bors added a commit that referenced this pull request Dec 19, 2019
Rollup of 3 pull requests

Successful merges:

 - #66716 (Implement `DebugStruct::non_exhaustive`.)
 - #67020 (save LTO import info and check it when trying to reuse build products)
 - #67127 (Use structured suggestion for disambiguating method calls)

Failed merges:

r? @ghost
@Centril

This comment has been minimized.

Copy link
Member

Centril commented Dec 19, 2019

Failed in #67412 (comment), @bors r- rollup=never p=1

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Dec 19, 2019

@pnkfelix, how do you feel about removing the run-make test case?

@pnkfelix

This comment has been minimized.

Copy link
Member Author

pnkfelix commented Dec 20, 2019

yeah okay lets kill the run-make test case, sure

pnkfelix added 2 commits Nov 29, 2019
…roducts.

adopts simple strategy devised with assistance from mw: Instead of accumulating
(and acting upon) LTO import information over an unbounded number of prior
compilations, just see if the current import set matches the previous import set.
if they don't match, then you cannot reuse the PostLTO build product for that
module.

In either case (of a match or a non-match), we can (and must) unconditionally
emit the current import set as the recorded information in the incremental
compilation cache, ready to be loaded during the next compiler run for use in
the same check described above.

resolves issue 59535.
@pnkfelix pnkfelix force-pushed the pnkfelix:issue-59535-accumulate-past-lto-imports branch from aecb511 to 42b00a4 Dec 20, 2019
@pnkfelix

This comment has been minimized.

Copy link
Member Author

pnkfelix commented Dec 20, 2019

(I went ahead and squashed since I was rebasing anyway to remove the run-make based test.)

@pnkfelix

This comment has been minimized.

Copy link
Member Author

pnkfelix commented Dec 20, 2019

@bors r=mw

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Dec 20, 2019

📌 Commit 42b00a4 has been approved by mw

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Dec 20, 2019

🌲 The tree is currently closed for pull requests below priority 100, this pull request will be tested once the tree is reopened

@Centril

This comment has been minimized.

Copy link
Member

Centril commented Dec 20, 2019

@bors p=200

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Dec 20, 2019

⌛️ Testing commit 42b00a4 with merge ccd2383...

bors added a commit that referenced this pull request Dec 20, 2019
…ts, r=mw

save LTO import info and check it when trying to reuse build products

Fix #59535

Previous runs of LTO optimization on the previous incremental build can import larger portions of the dependence graph into a codegen unit than the current compilation run is choosing to import. We need to take that into account when we choose to reuse PostLTO-optimization object files from previous compiler invocations.

This PR accomplishes that by serializing the LTO import information on each incremental build. We load up the previous LTO import data as well as the current LTO import data. Then as we decide whether to reuse previous PostLTO objects or redo LTO optimization, we check whether the LTO import data matches. After we finish with this decision process for every object, we write the LTO import data back to disk.

----

What is the scenario where comparing against past LTO import information is necessary?

I've tried to capture it in the comments in the regression test, but here's yet another attempt from me to summarize the situation:

 1. Consider a call-graph like `[A] -> [B -> D] <- [C]` (where the letters are functions and the modules are enclosed in `[]`)
 2. In our specific instance, the earlier compilations were inlining the call to`B` into `A`; thus `A` ended up with a external reference to the symbol `D` in its object code, to be resolved at subsequent link time. The LTO import information provided by LLVM for those runs reflected that information: it explicitly says during those runs, `B` definition and `D` declaration were imported into `[A]`.
 3. The change between incremental builds was that the call `D <- C` was removed.
 4. That change, coupled with other decisions within `rustc`, made the compiler decide to make `D` an internal symbol (since it was no longer accessed from other codegen units, this makes sense locally). And then the definition of `D` was inlined into `B` and `D` itself was eliminated entirely.
  5. The current LTO import information reported that `B` alone is imported into `[A]` for the *current compilation*. So when the Rust compiler surveyed the dependence graph, it determined that nothing `[A]` imports changed since the last build (and `[A]` itself has not changed either), so it chooses to reuse the object code generated during the previous compilation.
  6. But that previous object code has an unresolved reference to `D`, and that causes a link time failure!

----

The interesting thing is that its quite hard to actually observe the above scenario arising, which is probably why no one has noticed this bug in the year or so since incremental LTO support landed (PR #53673).

I've literally spent days trying to observe the bug on my local machine, but haven't managed to find the magic combination of factors to get LLVM and `rustc` to do just the right set of the inlining and `internal`-reclassification choices that cause this particular problem to arise.

----

Also, I have tried to be careful about injecting new bugs with this PR. Specifically, I was/am worried that we could get into a scenario where overwriting the current LTO import data with past LTO import data would cause us to "forget" a current import. ~~To guard against this, the PR as currently written always asserts, at overwrite time, that the past LTO import-set is a *superset* of the current LTO import-set. This way, the overwriting process should always be safe to run.~~
 * The previous note was written based on the first version of this PR. It has since been revised to use a simpler strategy, where we never attempt to merge the past LTO import information into the current one. We just *compare* them, and act accordingly.
 * Also, as you can see from the comments on the PR itself, I was quite right to be worried about forgetting past imports; that scenario was observable via a trivial transformation of the regression test I had devised.
@bors

This comment has been minimized.

Copy link
Contributor

bors commented Dec 21, 2019

☀️ Test successful - checks-azure
Approved by: mw
Pushing ccd2383 to master...

@bors bors added the merged-by-bors label Dec 21, 2019
@bors bors merged commit 42b00a4 into rust-lang:master Dec 21, 2019
5 checks passed
5 checks passed
homu Test successful
Details
pr Build #20191220.4 succeeded
Details
pr (Linux mingw-check) Linux mingw-check succeeded
Details
pr (Linux x86_64-gnu-llvm-7) Linux x86_64-gnu-llvm-7 succeeded
Details
pr (Linux x86_64-gnu-tools) Linux x86_64-gnu-tools succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.