Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Support Incremental Compilation #594

Closed
wants to merge 1 commit into from

Conversation

@michaelwoerister
Copy link
Contributor

michaelwoerister commented Jan 18, 2015

This RFC proposes an incremental compilation strategy for rustc that allows for translation, codegen, and parts of static analysis to be done in an incremental fashion, without precluding the option of later expanding incrementality to parsing, macro expansion, and resolution.

This RFC is purely about the architecture and implementation of the Rust compiler. It does not propose any changes to the language. I also don't expect it to be acted on any time before 1.0 is out of the door, but I wanted to get this out into the open, so that it can discussed as part of the RustC Architecture Improvement Initiative (that's right, RAII) that I invented just now and that will begin to discuss how the Rust compiler can get as good as possible once the language has become a more stable target.

Rendered

@Gankra

This comment has been minimized.

Copy link

Gankra commented on text/0000-incremental-compilation.md in 3f55ba6 Jan 18, 2015

A subtle point here, not sure if it's important: There are things that can't be explicitly stated by the programmer in the function signature that affect the semantics of the caller. In particular if the body of the function stores any references it is given this will affect borrow checking, even though the signature makes no reference to this.

e.g.

fn foo1(data: &Data) {
  vec1.push(data);
}

fn foo2(data: &Data) {
  vec2.push(data.clone());
}

fn bar() {
  let mut data = get_data();
  foo2(&data); 
  data.update(); // ok, foo2 dropped the ref to the data
  foo1(&data);
  data.update(); // not ok, foo1 stored a ref to the data
}

This comment has been minimized.

Copy link
Owner Author

michaelwoerister replied Jan 18, 2015

I don't think that this would affect borrow checking. In both cases data usable after the call. In other words, everything related to borrow checking is part of the function signature and the borrow checker does explicitly not look at the implementation of a function f when it checks a call to that function f. But I'm not an expert on the topic, please correct me if I'm wrong.

This comment has been minimized.

Copy link

whataloadofwhat replied Jan 18, 2015

Isn't that just down to lifetime elision? Like those functions would have to have lifetimes attached, and they'd have to be something like:

fn foo1(data: &'v1 Data) // Where the type of `vec1` is `Vec<&'v1 Data>`
fn foo2<'a>(data: &'a Data)

This comment has been minimized.

Copy link

comex replied Jan 19, 2015

If it's true that type checking a call to a function can depend on the function's body, it sounds like an urgent bug to address before 1.0... (I don't think it is, though?)

This comment has been minimized.

Copy link

pnkfelix replied Jan 19, 2015

@gankro isn't there an obvious counter-argument to your claim? Namely, all information for borrow checking does have to be encoded in the function's signature, because you need to be able to borrow-check code like this:

fn bar(foo1: fn(data: &Data), foo2: fn(data: &Data)) { ... }

(Your example is only a sketch; if you continue to make this claim about the language, I think it behooves you to make an example that one can actually feed to the compiler; i.e. you need the lifetimes etc that one needs to be able to do things like vec1.push(data) where data: &T for some T.)

This comment has been minimized.

Copy link

Gankra replied Jan 19, 2015

@pnkfelix @whataloadofwhat Yes you're right, there's no way to give these functions equal non-elided lifetimes and have the behaviour I suggested. I had a bad mental model. 🐫

It would be possible to make dependency tracking aware of the kind of reference one item makes to another. If an item `A` mentions another item `B` only via some reference type (e.g. `&T`), then item `A` only needs to be updated if `B` is removed or `B` changes its 'sized-ness'. This is comparable to how forward declarations in C are handled. In the dependency graph this would mean that there are different kinds of edges that trigger for different kinds of changes to items.

### Global Switches Influencing Codegen
There are many compiler flags that change the way the generated code looks like, e.g. optimization and debuginfo levels. A simple strategy to deal with this would be to store the set of compiler flags used for building the cache and clearing the cache completely if another set of flags is used. Another option is to keep multiple caches, each for a different set of compiler flags (e.g. keeping both on disk, a 'debug build cache' and a 'release build cache').

This comment has been minimized.

Copy link
@ghost

ghost Jan 18, 2015

Hash the relevant flags for the subdir name? I'd expect a lot of -C options affect the cache, and only storing one set wouldn't help at all for some usage patterns.

This comment has been minimized.

Copy link
@michaelwoerister

michaelwoerister Jan 19, 2015

Author Contributor

Yeah, something like that. I'd like to see how big such a cache gets.

@nrc

This comment has been minimized.

Copy link
Member

nrc commented Jan 18, 2015

cc @epdtry who implemented most of a very similar scheme last summer.

We talked about this as incremental codegen (as opposed to proper incremental compilation). He only kept around object files, not llvm ir too.

It would be great if @epdtry could link to his WIP branch and explain the concepts, etc. here.

It should not be too hard to let the compiler keep track of which parts of the program change infrequently and then let it speculatively build object files with more than one function in them. For these aggregate object files inter-function LLVM optimizations could then be enabled, yielding faster object code at little additional cost. Other strategies for controlling cache granularity can be implemented in a similar fashion.

### Parallelization
If some care is taken in implementing the above concepts it should be rather easy to do translation and codegen in parallel for all items, since by design we already have (or can deterministically compute) all the information we need.

This comment has been minimized.

Copy link
@nrc

nrc Jan 18, 2015

Member

We already can do codegen in parallel, although there is a bug preventing most use atm.

# Unresolved questions

## Dependency Graph Construction before Type Inference
I'm not sure if it would be possible to construct valid dependency graphs *before* type-inference or if that would miss some dependency edges. Or more generally, how much per-item work can be pushed until after caching strikes.

This comment has been minimized.

Copy link
@nrc

nrc Jan 18, 2015

Member

@epdtry found that he needed type information for constructing the dependency graph, although I don't recall why, exactly.

This comment has been minimized.

Copy link
@spernsteiner

spernsteiner Jan 26, 2015

As I recall, type information wasn't strictly required, but it let the analysis obtain more precise dependency information for calls to trait methods. If a function contains x + y, knowing the types of x and y lets you find the precise implementation of add that's being called. Without type information, you must conservatively assume that it could be a call to any add implementation in scope.

I think this design would have less trouble operating without type information because (correct me if I'm wrong) the + would constitute a reference to the generic Add::add interface, not a reference to any specific implementation body. My design did not distinguish bodies from interfaces because inlining can happen anywhere, causing the body of one function to depend on another.

This comment has been minimized.

Copy link
@michaelwoerister

michaelwoerister Jan 29, 2015

Author Contributor

Thanks for the comment, @epdtry!

It occurred to me that the node template graph, together with the set of visible traits forms the dependency graph for type inference. So, while it may not be possible to accurately compute the dependencies of a function body without doing type inference, it should be possible to cache type inference results just like other compilation artefacts.


# Alternatives

I'd definitely like to hear about them.

This comment has been minimized.

Copy link
@nrc

nrc Jan 18, 2015

Member

An alternative I have been thinking about as a long term solution is full incremental compilation (as opposed to incremental codegen) where we could compile a single span, from parsing all the way to codegen. This is more useful for IDEs and similar tools, but would also give us more scope to incrementalise normal compilation. I envisaged generating much more thorough metadata for crates, enough that, for example, when a function is modified, that function could be compiled independently of the rest of the crate. This would require having all the type information we currently use for type checking in the metadata. (We would of course also need the object files and so forth used for incremental codegen).

The only way for this to be sane to implement would be if we had better representations of the compilers intermediate representations, and these could be serialised to make the metadata, rather than the current ad hoc approach (but this seems like a win from an architectural point of view too).

This comment has been minimized.

Copy link
@michaelwoerister

michaelwoerister Jan 19, 2015

Author Contributor

I definitely think of this RFC as just a first step towards an architecture that does even more things incrementally. For example the concept of 'interfaces' that I use in the RFC seems very fruitful to me in terms of thinking about what is really needed where and use that understanding to improve the whole compilation process. Once you have extracted these 'interfaces' almost everything else should be doable at a per-item level and thus in parallel (not just codegen, but also type-checking and many other parts of static analysis).

Anyway, I'd regard a fully incremental solution as the long-term goal too.

This comment has been minimized.

Copy link
@Ericson2314

Ericson2314 Jan 25, 2015

Contributor

In the long run, both rustc and cargo will benefit from a general purpose caching and dependency management framework. I'm inclined to go big or go home on these things, so perhaps that could be developed separately as a library from the get-go? Between this proposal, and the new IR ones, sounds like rustc is basically going to be rewritten.


In this example `transmogrify<Kid, Tiger>` will have a different dependency graph than `transmogrify<Dinosaur, Gastropod>`. In other words, the monomorphized implementation of `transmogrify<Kid, Tiger>` is not affected if the definition of `Dinosaur` or `Gastropod` changes and the dependency graph should reflect this.

One way to model this behavior is by not creating dependency graph nodes for generic `program items` but `node templates`, which---like generic items---have type parameters and yield a concrete, monomorphic dependency graph node, if all type parameters are substituted for concrete arguments. When the need arises to check if a particular monomorphized function implementation from the cache can be re-used, the dependency graph for the function can be constructed on demand from the given `node template` and parameter substitutions.

This comment has been minimized.

Copy link
@comex

comex Jan 19, 2015

If you explicitly stored the list of dependencies of each cache item in the cache, then it wouldn't be necessary to construct anything on demand - whether generic or not, you wouldn't have to go through the function to resolve types and such to determine what it actually depends on, which sounds like an improvement.

In any case, if you're caching object code, you need to store the list of functions that were inlined or otherwise had their behavior consulted by LLVM for codegen of the current function, or else you have to recompile a function whenever any of its transitive dependencies change in any way. (Unless, based on one of the notes below, you're giving up on combining optimization and incrementality? I guess it doesn't have to work in the initial version.)

Other than that I don't have much to say, but I will be eagerly watching any work on this, because I really hate slow compiles. :p

This comment has been minimized.

Copy link
@michaelwoerister

michaelwoerister Jan 19, 2015

Author Contributor

Ad storing explicit lists of dependencies:
I've thought about that too and in the end I think it doesn't make that much of a difference. It might well be that an actual implementation would do it that way. However, I for one have learned a lot about the problem at hand by trying to come up with a formal model to describe dependencies within a Rust program.

ad inlining:
Yes that is a problem. From a theoretical point of view it's not that hard, you just add an edge to the body-node of the inlined function in the dependency graph. The hard part is to find what will be inlined or (what was inlined if you are "recording" dependencies) because that's all LLVM turf.
But maybe it's not a big problem because you only cache unoptimized LLVM IR anyway and for object code it won't be possible to do any inlining, since implementations are not available. Special support could be added for #[inline(always)] functions. If you want a more optimized build you would have to use the granularity optimization from 'Miscellaneous' section and there you could be more conservative with adding dependency edges so you would catch inlining occurrences anyway. But that's definitely an interesting point that would need a lot of testing to get right, I think.

@dhardy

This comment has been minimized.

Copy link
Contributor

dhardy commented Jan 19, 2015

Please don't make the acronym RAII more confusing by adding a competing definition. Arguably Resource Allocation Is Initialisation is the wrong name anyway, but that's no reason to make the term more difficult to explain to new-comers.

@bstrie

This comment has been minimized.

Copy link
Contributor

bstrie commented Jan 19, 2015

I agree with @dhardy. As an alternative, may I propose "So Far, Incrementalism Necessitates An Exegesis"?

@michaelwoerister

This comment has been minimized.

Copy link
Contributor Author

michaelwoerister commented Jan 19, 2015

@dhardy Sorry for the confusion :) That suggestion was not meant entirely serious.

@bstrie I am in awe. Now do TANSTAAFL !

@Diggsey

This comment has been minimized.

Copy link
Contributor

Diggsey commented Jan 20, 2015

@dhardy
You're right that is the wrong name :P, it's "Resource Acquisition Is Initialisation"
(also we have the same first name...)

@nrc nrc self-assigned this Jan 22, 2015
}
```

The dependency tracking system as described above contains `node templates` for `program item` definitions on a syntactic level, that is, for each `struct`, `enum`, `type`, `trait`, there is one `node template`, for each `fn`, `static`, and `const` there are two (one for the interface, one for the body). However, as seen in the section on generics, the codebase can refer to monomorphized instances of program items that cannot be identified by a single identifier as described above. A reference like `Option<String>` is a composite of multiple `program item` IDs, a tree of program item IDs in the general case:

This comment has been minimized.

Copy link
@spernsteiner

spernsteiner Jan 26, 2015

On the subject of monomorphized identifiers: you'll probably need to do something about symbol naming for monomorphizations of functions. Right now the name includes the hash of the pointers to the Tys representing the type arguments (which is random, thanks to ASLR). This does fine at preventing collisions, but it means you'll need to either record the mapping of (polymorphic function, type arguments) -> (symbol name) for use in later incremental builds, or fix symbol naming to produce something consistent. I tried to do the latter, but it wound up being a little more complicated than I expected (ADT Tys reference the struct/enum definition by its DefId, which is not stable) and I don't remember if I ever got it working.

This comment has been minimized.

Copy link
@michaelwoerister

michaelwoerister Jan 29, 2015

Author Contributor

Yes, that's a problem. I'd probably try to find a more stable symbol naming scheme.

@nrc nrc added the T-compiler label May 15, 2015
@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Aug 16, 2015

I am expanding and adapting this RFC. After some discussion with @michaelwoerister we decided to close this existing PR for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.