New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Introduce a mid-level IR (MIR) in the compiler that will drive borrowck, trans #1211

Merged
merged 1 commit into from Aug 14, 2015

Conversation

Projects
None yet
@nikomatsakis
Contributor

nikomatsakis commented Jul 14, 2015

This proposal describes a mid-level IR that I believe we should use in the compiler. This is purely an implementation detail and should not affect the language, though it may make many language extensions and analyses easier to implement; the most notable of these is non-lexical lifetimes.

Rendered.

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
Contributor

nikomatsakis commented Jul 14, 2015

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Jul 14, 2015

Contributor

Sounds nice. However, as written this makes even rvalues be non-SSA - we may want to be smarter on that front.

We would want to do at least these optimizations on the MIR, to prevent codegen regressions:
* RVO (of course)
* NRVO (because we essentially do it in our codegen for non-nested return-s).
* some kind of constant-propagation
* some kind of move-elimination, like we do in match today

Contributor

arielb1 commented Jul 14, 2015

Sounds nice. However, as written this makes even rvalues be non-SSA - we may want to be smarter on that front.

We would want to do at least these optimizations on the MIR, to prevent codegen regressions:
* RVO (of course)
* NRVO (because we essentially do it in our codegen for non-nested return-s).
* some kind of constant-propagation
* some kind of move-elimination, like we do in match today

of it to make quality error messages.
3. This representation should encode drops, panics, and other
scope-dependent items explicitly.
4. This representation does not have to be well-typed Rust, though it

This comment has been minimized.

@arielb1

arielb1 Jul 14, 2015

Contributor

well-typed Rust? The representation allows for some unsafe operations (e.g. unrestricted downcasts, unchecked indexing, calling unsafe functions) but should type-check.

@arielb1

arielb1 Jul 14, 2015

Contributor

well-typed Rust? The representation allows for some unsafe operations (e.g. unrestricted downcasts, unchecked indexing, calling unsafe functions) but should type-check.

@eddyb

This comment has been minimized.

Show comment
Hide comment
@eddyb

eddyb Jul 14, 2015

Member

@arielb1 I expect pure constant expressions to have a single value in the MIR, modulo associated constant projections.

Member

eddyb commented Jul 14, 2015

@arielb1 I expect pure constant expressions to have a single value in the MIR, modulo associated constant projections.

| [LVALUE...LVALUE]
| CONSTANT
| LEN(LVALUE) // load length from a slice, see section below
| BOX // malloc for builtin box, see section below

This comment has been minimized.

@arielb1

arielb1 Jul 14, 2015

Contributor

shouldn't this also need the adjustments?

@arielb1

arielb1 Jul 14, 2015

Contributor

shouldn't this also need the adjustments?

| BYTES
| STATIC_STRING
| ITEM<SUBSTS> // reference to an item or constant etc
| <P0 as TRAIT<P1...Pn>> // projection

This comment has been minimized.

@arielb1

arielb1 Jul 14, 2015

Contributor

That's UFCS I assume (<T0 as TRAIT<T1...Tn>>::item or <T as Inherent(DefId)>::item).

@arielb1

arielb1 Jul 14, 2015

Contributor

That's UFCS I assume (<T0 as TRAIT<T1...Tn>>::item or <T as Inherent(DefId)>::item).

// call LVALUE1 with LVALUE2... as arguments. Write
// result into LVALUE0. Branch to BB0 if it returns
// normally, BB1 if it is unwinding.
| DIVERGE // return to caller, unwinding

This comment has been minimized.

@arielb1

arielb1 Jul 14, 2015

Contributor

DIVERGE? When would that be emitted? panic is a lang-item, not a keyword.

@arielb1

arielb1 Jul 14, 2015

Contributor

DIVERGE? When would that be emitted? panic is a lang-item, not a keyword.

This comment has been minimized.

@eddyb

eddyb Jul 14, 2015

Member

I thought it was the landing-pad terminator.

@eddyb

eddyb Jul 14, 2015

Member

I thought it was the landing-pad terminator.

This comment has been minimized.

@nikomatsakis

nikomatsakis Jul 14, 2015

Contributor

As @eddyb said, DIVERGE is the landing-pad terminator. PANIC is for initiating a panic.

@nikomatsakis

nikomatsakis Jul 14, 2015

Contributor

As @eddyb said, DIVERGE is the landing-pad terminator. PANIC is for initiating a panic.

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jul 14, 2015

Contributor

@arielb1

However, as written this makes even rvalues be non-SSA - we may want to be smarter on that front.

I think what you mean by this is that simple things like foo(4) would introduce a (non-SSA) temporary? This is true. I don't think it's worth having a separate class of "SSA" temporaries -- I'd personally rather just do our optimizations in the older style, with kill sets. This simplifies the IR by not having more than one kind of lvalue. However, I could be persuaded otherwise. (The truth is, this is kind of a minor detail in the end. I expect us to evolve the MIR over time, and if we find that distinguishing spilled, mutable temporaries from other rvalues is worthwhile, that's fine.)

Contributor

nikomatsakis commented Jul 14, 2015

@arielb1

However, as written this makes even rvalues be non-SSA - we may want to be smarter on that front.

I think what you mean by this is that simple things like foo(4) would introduce a (non-SSA) temporary? This is true. I don't think it's worth having a separate class of "SSA" temporaries -- I'd personally rather just do our optimizations in the older style, with kill sets. This simplifies the IR by not having more than one kind of lvalue. However, I could be persuaded otherwise. (The truth is, this is kind of a minor detail in the end. I expect us to evolve the MIR over time, and if we find that distinguishing spilled, mutable temporaries from other rvalues is worthwhile, that's fine.)

One thing the current MIR does not make explicit as explicit as it
could is when something is *moved*. For by-value uses of a value, the
code must still consult the type of the value to decide if that is a
move or not. This could be made more explicit in the IR.

This comment has been minimized.

@eefriedman

eefriedman Jul 15, 2015

Contributor

In a world with drop calls explicitly encoded into the MIR, whether something is moved as opposed to copied doesn't matter at all; either the value will be explicitly dropped, or it won't. This is true with either the current embedded drop flags or explicit stack-based drop flags. Or am I missing something?

@eefriedman

eefriedman Jul 15, 2015

Contributor

In a world with drop calls explicitly encoded into the MIR, whether something is moved as opposed to copied doesn't matter at all; either the value will be explicitly dropped, or it won't. This is true with either the current embedded drop flags or explicit stack-based drop flags. Or am I missing something?

This comment has been minimized.

@eddyb

eddyb Jul 15, 2015

Member

Eliding memcpy calls for moves, perhaps.

@eddyb

eddyb Jul 15, 2015

Member

Eliding memcpy calls for moves, perhaps.

This comment has been minimized.

@nikomatsakis

nikomatsakis Jul 17, 2015

Contributor

@eefriedman

In a world with drop calls explicitly encoded into the MIR, whether something is moved as opposed to copied doesn't matter at all; either the value will be explicitly dropped, or it won't. This is true with either the current embedded drop flags or explicit stack-based drop flags. Or am I missing something?

For one thing, the MIR as I've described it thus far is allowed to DROP things that may have been moved. I'm assuming a later pass that determines precisely what needs to be dropped and inserts code to prevent double drops; this will be a type-based, control-flow-sensitive analysis, and hence it makes sense to do it after the MIR is built.

@nikomatsakis

nikomatsakis Jul 17, 2015

Contributor

@eefriedman

In a world with drop calls explicitly encoded into the MIR, whether something is moved as opposed to copied doesn't matter at all; either the value will be explicitly dropped, or it won't. This is true with either the current embedded drop flags or explicit stack-based drop flags. Or am I missing something?

For one thing, the MIR as I've described it thus far is allowed to DROP things that may have been moved. I'm assuming a later pass that determines precisely what needs to be dropped and inserts code to prevent double drops; this will be a type-based, control-flow-sensitive analysis, and hence it makes sense to do it after the MIR is built.

This comment has been minimized.

@dotdash

dotdash Jul 26, 2015

@eddyb Right. If we explicitly encode moves, a pass could look for copies where the source is not used anymore after the copy and turn that into a move. LLVM doesn't do that for calls, most likely because it sees the address as significant when you pass a pointer to a function.

@dotdash

dotdash Jul 26, 2015

@eddyb Right. If we explicitly encode moves, a pass could look for copies where the source is not used anymore after the copy and turn that into a move. LLVM doesn't do that for calls, most likely because it sees the address as significant when you pass a pointer to a function.

@Aatch

This comment has been minimized.

Show comment
Hide comment
@Aatch

Aatch Jul 15, 2015

Contributor

Looks good to me. SSA probably isn't worth it for this level, it's brilliant for lower-level optimisations, but it's somewhat more complex to build, whatever we want to do can probably be handled with dataflow analysis and similar. As this is an internal thing, I'm not too bothered as long as we get something in this direction. The details can be changed later.

I think that unsafe blocks are inappropriate for an MIR. I also think the property that unsafe doesn't actually change the way Rust works, it merely allows some otherwise-disallowed operations is something worth maintaining.

Contributor

Aatch commented Jul 15, 2015

Looks good to me. SSA probably isn't worth it for this level, it's brilliant for lower-level optimisations, but it's somewhat more complex to build, whatever we want to do can probably be handled with dataflow analysis and similar. As this is an internal thing, I'm not too bothered as long as we get something in this direction. The details can be changed later.

I think that unsafe blocks are inappropriate for an MIR. I also think the property that unsafe doesn't actually change the way Rust works, it merely allows some otherwise-disallowed operations is something worth maintaining.

to figure out on its own how to do unwinding at that point. Because
the MIR doesn't "desugar" fat pointers, we include a special rvalue
`LEN` that extracts the length from an array value whose type matches
`[T]` or `[T;n]` (in the latter case, it yields a constant). Using

This comment has been minimized.

@eefriedman

eefriedman Jul 15, 2015

Contributor

Allowing LEN on fixed-size arrays seems like it just pointlessly complicates the MIR at the expense of possibly making it slightly easier to construct.

@eefriedman

eefriedman Jul 15, 2015

Contributor

Allowing LEN on fixed-size arrays seems like it just pointlessly complicates the MIR at the expense of possibly making it slightly easier to construct.

This comment has been minimized.

@Aatch

Aatch Jul 15, 2015

Contributor

I don't think it complicates the MIR at all. If anything it's simpler as it doesn't require an extra rule for fixed-size arrays. Also, we still need to bounds-check fixed-size arrays, so this would have to be a separate path for them for no obvious reason.

@Aatch

Aatch Jul 15, 2015

Contributor

I don't think it complicates the MIR at all. If anything it's simpler as it doesn't require an extra rule for fixed-size arrays. Also, we still need to bounds-check fixed-size arrays, so this would have to be a separate path for them for no obvious reason.

| PANIC(BB) // initiate unwinding, branching to BB for cleanup
| IF(LVALUE, BB0, BB1) // test LVALUE and branch to BB0 if true, else BB1
| SWITCH(LVALUE, BB...) // load discriminant from LVALUE (which must be an enum),
// and branch to BB... depending on which variant it is

This comment has been minimized.

@eefriedman

eefriedman Jul 15, 2015

Contributor

Any particular reason to choose SWITCH over some sort of RVALUE which just extracts the discriminant of an enum?

@eefriedman

eefriedman Jul 15, 2015

Contributor

Any particular reason to choose SWITCH over some sort of RVALUE which just extracts the discriminant of an enum?

This comment has been minimized.

@Aatch

Aatch Jul 15, 2015

Contributor

I think the SWITCH here is too limited. From the description, it looks like you need to account for every single variant in the enum, even if the match only had (for example) one matching arm and a default. I also think it should be usable for non-enum values too, at least scalar values.

I think pairs of value + block, and an optional default makes more sense.

@Aatch

Aatch Jul 15, 2015

Contributor

I think the SWITCH here is too limited. From the description, it looks like you need to account for every single variant in the enum, even if the match only had (for example) one matching arm and a default. I also think it should be usable for non-enum values too, at least scalar values.

I think pairs of value + block, and an optional default makes more sense.

This comment has been minimized.

@nikomatsakis

nikomatsakis Jul 17, 2015

Contributor

@eefriedman

Any particular reason to choose SWITCH over some sort of RVALUE which just extracts the discriminant of an enum?

Not really. Mainly that the range of values that switch operates over is easily computed and if we want to check downcasts that are guarded by a switch it is very straightforward. But it's easily possible in either case, and it'd be consistent with LEN.

@Aatch

I think the SWITCH here is too limited. From the description, it looks like you need to account for every single variant in the enum, even if the match only had (for example) one matching arm and a default. I also think it should be usable for non-enum values too, at least scalar values.

Yes, I'm amenable to tweaking this part. The important thing I wanted to get at is that we deconstruct matches -- the precise set of blocks we use for that (and how we do it) is less important. Regarding the SWITCH I described, it does require N exits, but then you can direct them all to one place if you choose. And yes, matching against scalar values would be nice, right now the matches deconstruct to a series of ifs (though LLVM, it seems, is likely to optimize those later?)

@nikomatsakis

nikomatsakis Jul 17, 2015

Contributor

@eefriedman

Any particular reason to choose SWITCH over some sort of RVALUE which just extracts the discriminant of an enum?

Not really. Mainly that the range of values that switch operates over is easily computed and if we want to check downcasts that are guarded by a switch it is very straightforward. But it's easily possible in either case, and it'd be consistent with LEN.

@Aatch

I think the SWITCH here is too limited. From the description, it looks like you need to account for every single variant in the enum, even if the match only had (for example) one matching arm and a default. I also think it should be usable for non-enum values too, at least scalar values.

Yes, I'm amenable to tweaking this part. The important thing I wanted to get at is that we deconstruct matches -- the precise set of blocks we use for that (and how we do it) is less important. Regarding the SWITCH I described, it does require N exits, but then you can direct them all to one place if you choose. And yes, matching against scalar values would be nice, right now the matches deconstruct to a series of ifs (though LLVM, it seems, is likely to optimize those later?)

This comment has been minimized.

@bkoropoff

bkoropoff Jul 19, 2015

This could be made safe by construction by specifying a list of (BB, var) pairs, where var receives the downcasted lvalue on entry to the block.

@bkoropoff

bkoropoff Jul 19, 2015

This could be made safe by construction by specifying a list of (BB, var) pairs, where var receives the downcasted lvalue on entry to the block.

@eefriedman

This comment has been minimized.

Show comment
Hide comment
@eefriedman

eefriedman Jul 15, 2015

Contributor

Have you thought about how serialization for MIR will work?

Contributor

eefriedman commented Jul 15, 2015

Have you thought about how serialization for MIR will work?

its contents (it is not yet initialized).
Note that having this kind of builtin box code is a legacy thing. The
more generalized protocol that [RFC 809][809] specifies works in

This comment has been minimized.

@erickt

erickt Jul 15, 2015

I think you meant to link to RFC 809 here.

@erickt

erickt Jul 15, 2015

I think you meant to link to RFC 809 here.

tmp0 = foo;
tmp1 = 3
x = tmp(tmp1)

This comment has been minimized.

@erickt

erickt Jul 15, 2015

I think you mean x = tmp0(tmp1) here.

@erickt

erickt Jul 15, 2015

I think you mean x = tmp0(tmp1) here.

@michaelwoerister

This comment has been minimized.

Show comment
Hide comment
@michaelwoerister

michaelwoerister Jul 16, 2015

I'm all for giving this a try. The direction of the proposed design makes sense to me. Many of the details will become more clear when working on the implementation. Only the HIR trait thing from the prototype sounds a bit too clever for my taste but since it's not even part of the RFC really ...

Regarding debuginfo, the things that come to mind in the context of the MIR are source locations, scope information, and memory locations of local varibles/arguments.

Source Locations
For every LLVM IR statement, we want to know which piece of source code it originated from. So far, trans has read this information from the AST. I imagine that there will be some way to find out about the span of a given MIR statement. One thing that warrants special consideration in this respect, are spans of compiler generated instructions, especially drop calls. We have to assign some span to them (LLVM crashes otherwise) and currently we are using a heuristic that tries to find the closing brace of the enclosing block. This is something that would best be taken care of during the lowering step to MIR.

Scope Information
LLVM and debuginfo not only want to know about the source location of every machine instruction, they also need to know about the scope that instruction is part of (so the debugger knows which variables are visible when stopped at a given position in the program). This is what we do so far: When starting to translate a function, we build a "scope map" by walking the AST of the function. This map maps every NodeId in the function to the corresponding debuginfo descriptor for the scope the node is contained in. The scope descriptor tree is built up as the AST is traversed, also taking care of implicit scopes introduced by let-statements.
Again, if it is possible to map from an MIR statement back to the node that introduced it, there's no need to do things differently. But the scope tree could also be built before lowering and then linking each MIR statement to the scope tree node it belongs to.

Local Variables and Arguments
For these we need to know the alloca that stores them. The current, non-SSA setup indeed seems to be a good match for this. Let LLVM worry about this stuff :)

I'm all for giving this a try. The direction of the proposed design makes sense to me. Many of the details will become more clear when working on the implementation. Only the HIR trait thing from the prototype sounds a bit too clever for my taste but since it's not even part of the RFC really ...

Regarding debuginfo, the things that come to mind in the context of the MIR are source locations, scope information, and memory locations of local varibles/arguments.

Source Locations
For every LLVM IR statement, we want to know which piece of source code it originated from. So far, trans has read this information from the AST. I imagine that there will be some way to find out about the span of a given MIR statement. One thing that warrants special consideration in this respect, are spans of compiler generated instructions, especially drop calls. We have to assign some span to them (LLVM crashes otherwise) and currently we are using a heuristic that tries to find the closing brace of the enclosing block. This is something that would best be taken care of during the lowering step to MIR.

Scope Information
LLVM and debuginfo not only want to know about the source location of every machine instruction, they also need to know about the scope that instruction is part of (so the debugger knows which variables are visible when stopped at a given position in the program). This is what we do so far: When starting to translate a function, we build a "scope map" by walking the AST of the function. This map maps every NodeId in the function to the corresponding debuginfo descriptor for the scope the node is contained in. The scope descriptor tree is built up as the AST is traversed, also taking care of implicit scopes introduced by let-statements.
Again, if it is possible to map from an MIR statement back to the node that introduced it, there's no need to do things differently. But the scope tree could also be built before lowering and then linking each MIR statement to the scope tree node it belongs to.

Local Variables and Arguments
For these we need to know the alloca that stores them. The current, non-SSA setup indeed seems to be a good match for this. Let LLVM worry about this stuff :)

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jul 17, 2015

Contributor

@Aatch

I think that unsafe blocks are inappropriate for an MIR. I also think the property that unsafe doesn't actually change the way Rust works, it merely allows some otherwise-disallowed operations is something worth maintaining.

The main question was whether it'd be worth CHECKING that property (that is, checking what operations are disallowed) on MIR. The reason to consider doing that is that it would be easier, since all derefs and calls are made very explicit.

Contributor

nikomatsakis commented Jul 17, 2015

@Aatch

I think that unsafe blocks are inappropriate for an MIR. I also think the property that unsafe doesn't actually change the way Rust works, it merely allows some otherwise-disallowed operations is something worth maintaining.

The main question was whether it'd be worth CHECKING that property (that is, checking what operations are disallowed) on MIR. The reason to consider doing that is that it would be easier, since all derefs and calls are made very explicit.

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jul 17, 2015

Contributor

@eefriedman

Have you thought about how serialization for MIR will work?

Not deeply. I don't foresee any particular difficulties. It should be much easier than serializing the AST, since there are no side-tables to be concerned with, and all internals links are, well, internal. That said, I'd like to define a canonical textual format for testing purposes (in an ideal world, we'd be able to supply MIR inputs directly to the compiler so we can skip early stages of the pipeline when testing).

Contributor

nikomatsakis commented Jul 17, 2015

@eefriedman

Have you thought about how serialization for MIR will work?

Not deeply. I don't foresee any particular difficulties. It should be much easier than serializing the AST, since there are no side-tables to be concerned with, and all internals links are, well, internal. That said, I'd like to define a canonical textual format for testing purposes (in an ideal world, we'd be able to supply MIR inputs directly to the compiler so we can skip early stages of the pipeline when testing).

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jul 17, 2015

Contributor

@arielb1

We would want to do at least these optimizations on the MIR, to prevent codegen regressions:

  • RVO (of course)

The existence of the "ReturnValue" lvalue allows us to do RVO, modulo the next bullet.

  • NRVO (because we essentially do it in our codegen for non-nested return-s).

So, the main problem here that I see is aggregates. That is, if you have

v = Struct { x: ..., y: ... }

it gets converted into:

tmpx = ...;
tmpy = ...;
v = Struct { x: tmpx, y: tmpy }

which is obviously not what trans would produce. However, there are a lot of advantages to starting out with this form. But after safety checks are done, as I describe in the RFC, it is pretty easy to convert this to:

v.x = ...;
v.y = ...;

I'm assuming this would run after safety analyses but also after drops are rewritten to be more minimal, since I think there are some cases where you might wind up with double-frees if you're not careful.

  • some kind of constant-propagation

This is why I separated out constants into their own thing. We can simplify constants and also rewrite MIR expressions as we choose.

  • some kind of move-elimination, like we do in match today

This can conceivably be expressed by rewriting to reference the original lvalue.

(Overall, I'm not sure how much optimization it makes sense to do on the MIR vs leaving it to LLVM -- we'll have to work out that trade-off. Certainly though we've found that doing optimizations in trans can be quite helpful for execution and compilation time so it's easy to see that the same will be true of the MIR. And I'm trying to think beyond LLVM as well, in which case doing more in the MIR would be helpful for portability -- especially Rust-specific things that would require custom LLVM passes or code anyway.)

Contributor

nikomatsakis commented Jul 17, 2015

@arielb1

We would want to do at least these optimizations on the MIR, to prevent codegen regressions:

  • RVO (of course)

The existence of the "ReturnValue" lvalue allows us to do RVO, modulo the next bullet.

  • NRVO (because we essentially do it in our codegen for non-nested return-s).

So, the main problem here that I see is aggregates. That is, if you have

v = Struct { x: ..., y: ... }

it gets converted into:

tmpx = ...;
tmpy = ...;
v = Struct { x: tmpx, y: tmpy }

which is obviously not what trans would produce. However, there are a lot of advantages to starting out with this form. But after safety checks are done, as I describe in the RFC, it is pretty easy to convert this to:

v.x = ...;
v.y = ...;

I'm assuming this would run after safety analyses but also after drops are rewritten to be more minimal, since I think there are some cases where you might wind up with double-frees if you're not careful.

  • some kind of constant-propagation

This is why I separated out constants into their own thing. We can simplify constants and also rewrite MIR expressions as we choose.

  • some kind of move-elimination, like we do in match today

This can conceivably be expressed by rewriting to reference the original lvalue.

(Overall, I'm not sure how much optimization it makes sense to do on the MIR vs leaving it to LLVM -- we'll have to work out that trade-off. Certainly though we've found that doing optimizations in trans can be quite helpful for execution and compilation time so it's easy to see that the same will be true of the MIR. And I'm trying to think beyond LLVM as well, in which case doing more in the MIR would be helpful for portability -- especially Rust-specific things that would require custom LLVM passes or code anyway.)

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Jul 17, 2015

Contributor

@nikomatsakis

That would convert into something like

v.x = ...;
v.y = ...;
v = Struct { x: v.x, y: v.y };

Your last example wasn't valid IR (v wasn't initialized). This can be handled the right way in translation anyway.

Contributor

arielb1 commented Jul 17, 2015

@nikomatsakis

That would convert into something like

v.x = ...;
v.y = ...;
v = Struct { x: v.x, y: v.y };

Your last example wasn't valid IR (v wasn't initialized). This can be handled the right way in translation anyway.

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jul 17, 2015

Contributor

@arielb1

Your last example wasn't valid IR (v wasn't initialized). This can be handled the right way in translation anyway.

I was assuming that past a certain point we would enforce looser restrictions on what's valid or invalid.

Contributor

nikomatsakis commented Jul 17, 2015

@arielb1

Your last example wasn't valid IR (v wasn't initialized). This can be handled the right way in translation anyway.

I was assuming that past a certain point we would enforce looser restrictions on what's valid or invalid.

@bkoropoff

This comment has been minimized.

Show comment
Hide comment
@bkoropoff

bkoropoff Jul 20, 2015

This looks very clean and a lot easier to work with. I'm definitely in favor. I had a few thoughts about the kind of desugaring we might want to do at this level and how it would interact with region and borrow checking:

Closures

Closures seem like a natural candidate for desugaring, since they are nearly equivalent to an anonymous struct with a trait impl. One subtlety is that assignment to a non-mut by-value upvar ought to be rejected, even though this would be translated into an assignment through mut self or &mut self, which would be accepted. We'd need to track this one way or another.

CPS and friends

If we ever want Rust to support generators, async/await, coroutines, etc., this seems like the right place to do it. I've played around with writing a CPS transformation with pure macro rules and found several constructs that would be sound when doing region/borrow analysis in direct style but are not expressible in safe Rust after translation. Doing it at the MIR level after performing region/borrow checking would solve the issue nicely. On the other hand, the transformation also introduces trait bounds (e.g. Send for async/await) and moves that are not present in the source. And, of course, any non-trivial transformation complicates good error reporting. What kind of IR to IR transformations can we reasonably accommodate here?

Lints

Do we allow pluggable lints at this level? It seems like some of the ones used by Servo (e.g. checking that GC roots are used properly) would need to operate on the MIR. Maybe I'm wrong and the HIR is enough.

This looks very clean and a lot easier to work with. I'm definitely in favor. I had a few thoughts about the kind of desugaring we might want to do at this level and how it would interact with region and borrow checking:

Closures

Closures seem like a natural candidate for desugaring, since they are nearly equivalent to an anonymous struct with a trait impl. One subtlety is that assignment to a non-mut by-value upvar ought to be rejected, even though this would be translated into an assignment through mut self or &mut self, which would be accepted. We'd need to track this one way or another.

CPS and friends

If we ever want Rust to support generators, async/await, coroutines, etc., this seems like the right place to do it. I've played around with writing a CPS transformation with pure macro rules and found several constructs that would be sound when doing region/borrow analysis in direct style but are not expressible in safe Rust after translation. Doing it at the MIR level after performing region/borrow checking would solve the issue nicely. On the other hand, the transformation also introduces trait bounds (e.g. Send for async/await) and moves that are not present in the source. And, of course, any non-trivial transformation complicates good error reporting. What kind of IR to IR transformations can we reasonably accommodate here?

Lints

Do we allow pluggable lints at this level? It seems like some of the ones used by Servo (e.g. checking that GC roots are used properly) would need to operate on the MIR. Maybe I'm wrong and the HIR is enough.

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jul 24, 2015

Contributor

@bkoropoff

Regarding upvars, the MIR actually has a richer type system than the source language, and it includes &uniq pointers, which cover the case of non-mutable upvars.

Regarding CPS transform, I agree this is the place to do it, and we'll have to do some work to produce good error messages. I think we'll gain some more experience in that regard with mapping closures etc (we've made some progress, but we definitely produce some suboptimal error messages in borrowck today, such as those that talk about "borrowing" when the borrow is implicit in the syntax today).

Regarding lints, I think it might make sense for some of them to operate on the MIR, but that's a long way off. @brson has also expressed interest in being able to write front-ends that generate MIR directly. So it seems plausible to me that we might sometime want to standardize a lowered Rust representation that can be consumed externally.

Contributor

nikomatsakis commented Jul 24, 2015

@bkoropoff

Regarding upvars, the MIR actually has a richer type system than the source language, and it includes &uniq pointers, which cover the case of non-mutable upvars.

Regarding CPS transform, I agree this is the place to do it, and we'll have to do some work to produce good error messages. I think we'll gain some more experience in that regard with mapping closures etc (we've made some progress, but we definitely produce some suboptimal error messages in borrowck today, such as those that talk about "borrowing" when the borrow is implicit in the syntax today).

Regarding lints, I think it might make sense for some of them to operate on the MIR, but that's a long way off. @brson has also expressed interest in being able to write front-ends that generate MIR directly. So it seems plausible to me that we might sometime want to standardize a lowered Rust representation that can be consumed externally.

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jul 24, 2015

Contributor

Hear ye, hear ye. This RFC is entering final comment period.

Contributor

nikomatsakis commented Jul 24, 2015

Hear ye, hear ye. This RFC is entering final comment period.

@bkoropoff

This comment has been minimized.

Show comment
Hide comment
@bkoropoff

bkoropoff Jul 24, 2015

@nikomatsakis

&uniq helps in some cases, but move captures are still a problem since an upvar may not be behind a reference that can be marked uniq independently of the others. Obviously it's nothing a little hidden metadata plumbing can't fix. I also vaguely recall some special-case handling of Fn traits in trait selection that allow picking the auto-generated impls as candidates in spite of ununified type variables that would otherwise cause problems. There are probably other other edge cases where closures don't quite behave exactly like a struct + impl we'll need to be wary of.

@nikomatsakis

&uniq helps in some cases, but move captures are still a problem since an upvar may not be behind a reference that can be marked uniq independently of the others. Obviously it's nothing a little hidden metadata plumbing can't fix. I also vaguely recall some special-case handling of Fn traits in trait selection that allow picking the auto-generated impls as candidates in spite of ununified type variables that would otherwise cause problems. There are probably other other edge cases where closures don't quite behave exactly like a struct + impl we'll need to be wary of.

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jul 25, 2015

Contributor

@bkoropoff The special unification logic in trait selection stuff is independent of the mir (which doesn't really touch on trait selection), but you're right I was forgetting about the rules to prevent assignments to moved upvars. What a pain. I should have pushed harder for mutpocalypse. :) In any case, to actually model that properly does require just a bit more extension of the type system: basically marking fields that cannot be directly assigned, even when reached uniquely (I've thought about proposing something similar from time to time -- obviously now it'd have to be more of a lint). As you say, not a big deal, but you're right that it has to be handled.

Contributor

nikomatsakis commented Jul 25, 2015

@bkoropoff The special unification logic in trait selection stuff is independent of the mir (which doesn't really touch on trait selection), but you're right I was forgetting about the rules to prevent assignments to moved upvars. What a pain. I should have pushed harder for mutpocalypse. :) In any case, to actually model that properly does require just a bit more extension of the type system: basically marking fields that cannot be directly assigned, even when reached uniquely (I've thought about proposing something similar from time to time -- obviously now it'd have to be more of a lint). As you say, not a big deal, but you're right that it has to be handled.

@bkoropoff

This comment has been minimized.

Show comment
Hide comment
@bkoropoff

bkoropoff Jul 25, 2015

@nikomatsakis Would trait selection still occur before desugaring closures to a struct + impl? I guess that's fine then, I was just hoping we could eliminate as much special case handling as possible.

@nikomatsakis Would trait selection still occur before desugaring closures to a struct + impl? I guess that's fine then, I was just hoping we could eliminate as much special case handling as possible.

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Jul 25, 2015

Contributor

@bkoropoff

The big part of the complications closures bring are the type-system issues, and most of these (e.g. consider_unification_despite_ambiguity) even occurring only during typeck (i.e. before the MIR).

Assignments to non-mut locals are already special-cased, and that's not something the MIR can really help with. The TyTuple/TyStruct/TyEnum/TyClosure distinction will remain in the MIR - we should try to handle these as uniformly as possible through.

Contributor

arielb1 commented Jul 25, 2015

@bkoropoff

The big part of the complications closures bring are the type-system issues, and most of these (e.g. consider_unification_despite_ambiguity) even occurring only during typeck (i.e. before the MIR).

Assignments to non-mut locals are already special-cased, and that's not something the MIR can really help with. The TyTuple/TyStruct/TyEnum/TyClosure distinction will remain in the MIR - we should try to handle these as uniformly as possible through.

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jul 29, 2015

Contributor

On Fri, Jul 24, 2015 at 06:47:58PM -0700, Brian Koropoff wrote:

@nikomatsakis Would trait selection still occur before desugaring closures to a struct + impl? I guess that's fine then, I was just hoping we could eliminate as much special case handling as possible.

Yes, trait selection still occurs before desugaring. Trait selection
is pretty orthogonal to the MIR really, but yes it will still require
some amount of special case handling. That said, I'm getting very
excited lately about the idea of an internal "type IR" that should
play a similar role of formalizing and simplifying the type system.
More on that soon.

Contributor

nikomatsakis commented Jul 29, 2015

On Fri, Jul 24, 2015 at 06:47:58PM -0700, Brian Koropoff wrote:

@nikomatsakis Would trait selection still occur before desugaring closures to a struct + impl? I guess that's fine then, I was just hoping we could eliminate as much special case handling as possible.

Yes, trait selection still occurs before desugaring. Trait selection
is pretty orthogonal to the MIR really, but yes it will still require
some amount of special case handling. That said, I'm getting very
excited lately about the idea of an internal "type IR" that should
play a similar role of formalizing and simplifying the type system.
More on that soon.

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Jul 29, 2015

Contributor

Don't we already have a type IR?

Contributor

arielb1 commented Jul 29, 2015

Don't we already have a type IR?

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Jul 31, 2015

Contributor

@arielb1 I'll try to write up what i'm talking about :) pretty orthogonal to this proposal.

Contributor

nikomatsakis commented Jul 31, 2015

@arielb1 I'll try to write up what i'm talking about :) pretty orthogonal to this proposal.

@qwertie

This comment has been minimized.

Show comment
Hide comment
@qwertie

qwertie Jul 31, 2015

I'd suggest using some kind of standard format as a text representation - either a subset of Rust itself, or LES. That way nobody has to go to the trouble of designing a new syntax.

qwertie commented Jul 31, 2015

I'd suggest using some kind of standard format as a text representation - either a subset of Rust itself, or LES. That way nobody has to go to the trouble of designing a new syntax.

@RalfJung

This comment has been minimized.

Show comment
Hide comment
@RalfJung

RalfJung Aug 5, 2015

Member

I like this a lot! From a formal verification standpoint, this language is much better suited than the original AST. Fewer constructs, and more things explicit, it's almost like what I dreamed of ;-)

Now, from a purely practical perspective, there's one thing I do not understand: What is the relationship to the recently accepted HIR? I'm surprised that the only relationship mentioned is that the HIR trait here is not related. Skimming over the HIR RFC, the goals also seem to be fairly similar: Lowering of high-level sugar to fewer primitives, to ease processing. My impression is that the final pipeline will be "AST -> HIR -> MIR -> LLVM", with some desugaring happening on the first arrow, and other things waiting for the second arrow. Will there be anything that works on the HIR directly? Or will it be the case that the HIR is only constructed to be immediately lowered to MIR?

Member

RalfJung commented Aug 5, 2015

I like this a lot! From a formal verification standpoint, this language is much better suited than the original AST. Fewer constructs, and more things explicit, it's almost like what I dreamed of ;-)

Now, from a purely practical perspective, there's one thing I do not understand: What is the relationship to the recently accepted HIR? I'm surprised that the only relationship mentioned is that the HIR trait here is not related. Skimming over the HIR RFC, the goals also seem to be fairly similar: Lowering of high-level sugar to fewer primitives, to ease processing. My impression is that the final pipeline will be "AST -> HIR -> MIR -> LLVM", with some desugaring happening on the first arrow, and other things waiting for the second arrow. Will there be anything that works on the HIR directly? Or will it be the case that the HIR is only constructed to be immediately lowered to MIR?

@eddyb

This comment has been minimized.

Show comment
Hide comment
@eddyb

eddyb Aug 5, 2015

Member

@RalfJung It's possible that everything outside of function bodies may be kept around in HIR form.
A strategy for constants that can be used by both the MIR and [T; N], allowing proper handling of associated constants in polymorphic contexts, is yet to be chosen, but one of the options involves holding a HIR expression tree (or an ID to one) and some type bindings for it.

Member

eddyb commented Aug 5, 2015

@RalfJung It's possible that everything outside of function bodies may be kept around in HIR form.
A strategy for constants that can be used by both the MIR and [T; N], allowing proper handling of associated constants in polymorphic contexts, is yet to be chosen, but one of the options involves holding a HIR expression tree (or an ID to one) and some type bindings for it.

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Aug 5, 2015

Contributor

@RalfJung @eddyb

HIR is compiled to tables/metadata and MIR. MIR is mostly supposed to replace ast::Block. We will also need some form of ConstExpr and are still deliberating on the best way to implement it.

Type checking works on HIR + tables/metadata.

Contributor

arielb1 commented Aug 5, 2015

@RalfJung @eddyb

HIR is compiled to tables/metadata and MIR. MIR is mostly supposed to replace ast::Block. We will also need some form of ConstExpr and are still deliberating on the best way to implement it.

Type checking works on HIR + tables/metadata.

@RalfJung

This comment has been minimized.

Show comment
Hide comment
@RalfJung

RalfJung Aug 5, 2015

Member

So what's the reason not to compile the AST directly to MIR+tables? Is there anything interesting happening on that intermediate stage?

(I'm not trying to suggest that HIR has no place in this world; I'm just trying to figure out the reason behind your design decisions here.)

Member

RalfJung commented Aug 5, 2015

So what's the reason not to compile the AST directly to MIR+tables? Is there anything interesting happening on that intermediate stage?

(I'm not trying to suggest that HIR has no place in this world; I'm just trying to figure out the reason behind your design decisions here.)

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Aug 5, 2015

Contributor

@RalfJung

The HIR is supposed to abstract over macros and name resolution. The new process should be:

  • parse: text -> AST
  • expansion: AST -> expanded AST + hygiene-info
  • resolution: expanded AST + hygiene-info -> def-map
  • HIR creation: expanded AST + def-map -> HIR
  • type checking: HIR -> tcx tables + MIR
  • late analysis: tcx tables + MIR -> more tcx tables
  • translation: tcx tables + MIR -> LLVM IR

Type checking is a big enough step to deserve its own IR.

Contributor

arielb1 commented Aug 5, 2015

@RalfJung

The HIR is supposed to abstract over macros and name resolution. The new process should be:

  • parse: text -> AST
  • expansion: AST -> expanded AST + hygiene-info
  • resolution: expanded AST + hygiene-info -> def-map
  • HIR creation: expanded AST + def-map -> HIR
  • type checking: HIR -> tcx tables + MIR
  • late analysis: tcx tables + MIR -> more tcx tables
  • translation: tcx tables + MIR -> LLVM IR

Type checking is a big enough step to deserve its own IR.

@yazaddaruvala

This comment has been minimized.

Show comment
Hide comment
@yazaddaruvala

yazaddaruvala Aug 11, 2015

@arielb1

Thanks, thats a pretty simple, but thorough list for someone thats curious but completely opaque to rustc development.

Similarly I was hoping you could expand on it a bit. I've heard in the past that one way to improve code-gen speed is for rustc to optimize the amount of LLVM IR it creates. I'm not at all suggesting it happen in this implementation but this seems like a great refactor to help with that, so I'm sure you guys are keeping it in mind.

I'm just kinda curious where these IR reductions could/would take place in your list above? or if it will be more piece-meal and happen in small increments at every level as appropriate?

@arielb1

Thanks, thats a pretty simple, but thorough list for someone thats curious but completely opaque to rustc development.

Similarly I was hoping you could expand on it a bit. I've heard in the past that one way to improve code-gen speed is for rustc to optimize the amount of LLVM IR it creates. I'm not at all suggesting it happen in this implementation but this seems like a great refactor to help with that, so I'm sure you guys are keeping it in mind.

I'm just kinda curious where these IR reductions could/would take place in your list above? or if it will be more piece-meal and happen in small increments at every level as appropriate?

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Aug 12, 2015

Contributor

@yazaddaruvala

  • parse: text -> AST
    Implemented in syntax::parse, the generated AST is in syntax::ast. A standard context-sensitive linear scan tokenizer and LL(k) parser (IIRC k<5).
  • expansion: AST -> expanded AST + hygiene-info
    Macro expansion (the rest of syntax). I don't actually understand this very well (I think @nrc understands it best). At the end of this phase, the fully-formed program AST is generated.
  • resolution: expanded-AST + hygiene-info -> def-map
    This creates a map from paths in code (e.g. mem::transmute, ::std::fmt::Display, Vec, local variables, even Trait::method) to their definition. Trait items, fields, and methods (foo.bar, <T as Trait>::method, T::Item) are handled during type-checking instead.
  • HIR creation
    Not implemented yet. This should create an HIR that abstracts over syntactic distinctions (e.g. constants vs. local variables).
  • type checking
    rustc_typeck. The most complicated phase. This determines the type of every expression in a program and ensures that traits can always be satisfied. It also resolves trait item/field accesses and method calls. We are planning on emitting a MIR after this phase is over, that contains a concrete CFG.
  • late analysis
    These are various analyses run over the code to ensure soundness and gather information required for translation. For example, borrow checking (rustc_borrowck) ensures that non-Copy values are indeed not copied and &mut references not aliased, while match checking (rustc::middle::check_match) ensures there are no missing corner cases in match expressions. Lints are also run here - this is why you don't get them if your program contains a type error. Because these are run on a program known to be essentially intact, these can do rather deep analysis relatively simply.
  • translation
    rustc_trans. This pass creates LLVM IR representing the program. It also monomorphizes (expands) generics into concrete instances. Also, it does do some basic optimizations (e.g. RVO). Because of these and the combined complexities of the AST and LLVM (and Rust control flow), this pass is rather more complicated than it should be. The main purpose of the MIR is to simplify it and allow the optimizations to be more general.
Contributor

arielb1 commented Aug 12, 2015

@yazaddaruvala

  • parse: text -> AST
    Implemented in syntax::parse, the generated AST is in syntax::ast. A standard context-sensitive linear scan tokenizer and LL(k) parser (IIRC k<5).
  • expansion: AST -> expanded AST + hygiene-info
    Macro expansion (the rest of syntax). I don't actually understand this very well (I think @nrc understands it best). At the end of this phase, the fully-formed program AST is generated.
  • resolution: expanded-AST + hygiene-info -> def-map
    This creates a map from paths in code (e.g. mem::transmute, ::std::fmt::Display, Vec, local variables, even Trait::method) to their definition. Trait items, fields, and methods (foo.bar, <T as Trait>::method, T::Item) are handled during type-checking instead.
  • HIR creation
    Not implemented yet. This should create an HIR that abstracts over syntactic distinctions (e.g. constants vs. local variables).
  • type checking
    rustc_typeck. The most complicated phase. This determines the type of every expression in a program and ensures that traits can always be satisfied. It also resolves trait item/field accesses and method calls. We are planning on emitting a MIR after this phase is over, that contains a concrete CFG.
  • late analysis
    These are various analyses run over the code to ensure soundness and gather information required for translation. For example, borrow checking (rustc_borrowck) ensures that non-Copy values are indeed not copied and &mut references not aliased, while match checking (rustc::middle::check_match) ensures there are no missing corner cases in match expressions. Lints are also run here - this is why you don't get them if your program contains a type error. Because these are run on a program known to be essentially intact, these can do rather deep analysis relatively simply.
  • translation
    rustc_trans. This pass creates LLVM IR representing the program. It also monomorphizes (expands) generics into concrete instances. Also, it does do some basic optimizations (e.g. RVO). Because of these and the combined complexities of the AST and LLVM (and Rust control flow), this pass is rather more complicated than it should be. The main purpose of the MIR is to simplify it and allow the optimizations to be more general.
@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Aug 14, 2015

Contributor

It's official. The compiler subteam has decided to accept this RFC. (As of this writing, there are a few missing votes, but @Aatch has expressed support in thread, and @pnkfelix has expressed support in person.)

Contributor

nikomatsakis commented Aug 14, 2015

It's official. The compiler subteam has decided to accept this RFC. (As of this writing, there are a few missing votes, but @Aatch has expressed support in thread, and @pnkfelix has expressed support in person.)

@nikomatsakis nikomatsakis referenced this pull request Aug 14, 2015

Closed

Tracking issue for MIR (RFC #1211) #27840

8 of 16 tasks complete

@nikomatsakis nikomatsakis merged commit bd7f40c into rust-lang:master Aug 14, 2015

nikomatsakis added a commit that referenced this pull request Aug 14, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment