Please Support Arbitrary Labels and Gotos. #796

oridb · 2016-09-08T05:45:32Z

I'd like to point out that I haven't been involved in the web assembly effort,
and I'm not maintaining any large or widely used compilers (just my own
toy-ish language, minor contributions to the QBE compiler backend, and an
internship on IBM's compiler team), but I ended up getting a bit ranty, and
was encouraged to share more widely.

So, while I'm a bit uncomfortable jumping in and suggesting major changes
to a project I haven't been working on... here goes:

My Complaints:

When I'm writing a compiler, the first thing that I'd do to with the high level
structure -- loops, if statements, and so on -- is validate them for semantics,
do type checking and so on. The second thing I do with them is just throw them
out, and flatten to basic blocks, and possibly to SSA form. In some other parts
of the compiler world, a popular format is continuation passing style. I'm not
an expert on compiling with continuation passing style, but it neither seems to
be a good fit for the loops and scoped blocks that web assembly seems to have
embraced.

I'd like to argue that a flatter, goto based format would be far more useful as
a target for compiler developers, and would not significantly hinder the
writing of a usable polyfill.

Personally, also I'm not a big fan of nested complex expressions. They're a bit
clunkier to consume, especially if inner nodes can have side effects, but I
don't strongly object to them as a compiler implementer -- The web assembly
JIT can consume them, I can ignore them and generate the instructions that map
to my IR. They don't make me want to flip tables.

The bigger problem comes down to loops, blocks, and other syntactic elements
that, as an optimizing compiler writer, you try very hard to represent as a
graph with branches representing edges; The explicit control flow constructs
are a hindrance. Reconstructing them from the graph once you've actually done
the optimizations you want is certainly possible, but it's quite a bit of
complexity to work around a more complex format. And that annoys me: Both the
producer and the consumer are working around entirely invented problems
which would be avoided by simply dropping complex control flow constructs
from web assembly.

In addition, the insistence of higher level constructs leads to some
pathological cases. For example, Duff's Device ends up with horrible web
assembly output, as seen by messing around in The Wasm Explorer.
However, the inverse is not true: Everything that can be expressed
in web assembler can be trivially converted to an equivalent in some
unstructured, goto based format.

So, at the very least, I'd like to suggest that the web assembly team add
support for arbitrary labels and gotos. If they choose to keep the higher
level constructs, it would be a bit of wasteful complexity, but at least
compiler writers like me wold be able to ignore them and generate output
directly.

Polyfilling:

One of the concerns I have heard when discussing this is that the loop
and block based structure allows for easier polyfilling of web assembly.
While this isn't entirely false, I think that a simple polyfill solution
for labels and gotos is possible. Whiie it might not be quite as optimal,
I think that it's worth a little bit of ugliness in the bytecode in order
to avoid starting a new tool with built in technical debt.

If we assume an LLVM (or QBE) like syntax for web assmembly, then some code
that looks like:

int f(int x) {
    if (x == 42)
        return 123;
    else
        return 666;
}

might compile to:

 func @f(%x : i32) {
    %1 = test %x 42
jmp %1 iftrue iffalse

 L0:
    %r =i 123
jmp LRet
 L1:
    %r =i 666
jmp LRet
 Lret:
    ret %r
 }

This could be polyfilled to Javascript that looks like:

function f(x) {
    var __label = L0;
    var __ret;

    while (__label != LRet) {
        switch (__label) {
        case L0:
            var _v1 = (x == 42)
            if (_v1) {__lablel = L1;} else {label = L2;}
            break;
        case L1:
            __ret = 123
            __label = LRet
            break;
        case L2;
            __ret = 666
            __label = LRet
            break;
        default:
            assert(false);
            break;
    }
}

Is it ugly? Yeah. Does it matter? Hopefuly, if web assembly takes off,
not for long.

And if not:

Well, if I ever got around to targeting web assembly, I guess I'd generate code
using the approach I mentioned in the polyfill, and do my best to ignore all of
the high level constructs, hoping that the compilers would be smart enough to
catch on to this pattern.

But it would be nice if we didn't need to have both sides of the code generation
work around the format specified.

The text was updated successfully, but these errors were encountered:

ghost · 2016-09-08T06:29:50Z

@oridb Wasm is somewhat optimized for the consumer to be able to quickly convert to SSA form, and the structure does help here for common code patterns, so the structure is not necessarily a burden for the consumer. I disagree with your assertion that 'both sides of the code generation work around the format specified'. Wasm is very much about a slim and fast consumer, and if you have some proposals to make it slimmer and faster then that might be constructive.

Blocks that can be ordered into a DAG can be expressed in the wasm blocks and branches, such as your example. The switch-loop is the style used when necessary, and perhaps consumers might do some jump threading to help here. Perhaps have a look at binaryen which might do much of the work for your compiler backend.

There have been other requests for more general CFG support, and some other approaches using loops mentioned, but perhaps the focus is elsewhere at present.

I don't think there are any plans to support 'continuation passing style' explicitly in the encoding, but there has been mention of blocks and loops popping arguments (just like a lambda) and supporting multiple values (multiple lambda arguments) and adding a pick operator to make it easier to reference definitions (the lambda arguments).

oridb · 2016-09-08T06:58:38Z

the structure does help here for common code patterns

I'm not seeing any common code pattern that are easier to represent in terms of branches to arbitrary labels, vs the restricted loops and blocks subset that web assembly enforces. I could see a minor benefit if there was an attempt to make the code closely resemble the input code for certain classes of langauge, but that doesn't seem to be a goal -- and the constructs are a bit bare if they were there for

Blocks that can be ordered into a DAG can be expressed in the wasm blocks and branches, such as your example.

Yes, they can be. However, I'd strongly prefer not to add extra work to determine which ones can be represented this way, versus which ones need extra work. Realistically, I'd skip doing the extra analysis, and always just generate the switch loop form.

Again, my argument isn't that loops and blocks make things impossible; It's that everything they can do is simpler and easier for a machine to write with goto, goto_if, and arbitrary, unstructured labels.

Perhaps have a look at binaryen which might do much of the work for your compiler backend.

I already have a serviceable backend that I'm fairly happy with, and plans to fully bootstrap the entire compiler in my own language. I'd rather not add in a rather large extra dependency simply to work around the enforced use of loops/blocks. If I simply use switch loops, emitting the code is pretty trivial. If I try to actually use the features present in web assembly effectively, instead of doing my damndest to pretend they don't exist, it becomes a good deal more unpleasant.

There have been other requests for more general CFG support, and some other approaches using loops mentioned, but perhaps the force is elsewhere at present.

I'm still not convinced that loops have any benefits -- anything that can be represented with a loop can be represented with a goto and label, and there are fast and well known conversions to SSA from flat instruction lists.

As afar as CPS goes, I don't think that there needs to be explicit support -- it's popular in FP circles because it's fairly easy to convert to assembly directly, and gives similar benefits to SSA in terms of reasoning (http://mlton.org/pipermail/mlton/2003-January/023054.html); Again, I'm not an expert on it, but from what I remember, the invocation continuation gets lowered to a label, a few movs, and a goto.

ghost · 2016-09-08T07:24:27Z

@oridb 'there are fast and well known conversions to SSA from flat instruction lists'

Would be interesting to know how they compare with wasm SSA decoders, that is the important question?

Wasm makes use of a values stack at present, and some of the benefits of that would gone without the structure, it would hurt decoder performance. Without the values stack the SSA decoding would have more work too, I've tried a register base code and decoding was slower (not sure how significant that is).

Would you keep the values stack, or use a register based design? If keeping the values stack then perhaps it becomes a CIL clone, and perhaps wasm performance could be compared to CIL, has anyone actually check this?

oridb · 2016-09-08T07:52:37Z

Would you keep the values stack, or use a register based design?

I don't actually have any strong feelings on that end. I'd imagine compactness of the encoding would be one of the biggest concerns; A register design may not fare that well there -- or it may turn out to compress fantastically over gzip. I don't actually know off the top of my head.

Performance is another concern, although I suspect that it might be less important given the ability to cache binary output, plus the fact that download time may outweigh the decoding by orders of magnitude.

Would be interesting to know how they compare with wasm SSA decoders, that is the important question?

If you're decoding to SSA, that implies that you'd also be doing a reasonable amount of optimization. I'd be curious to benchmark how significant decoding performance is in the first place. But, yes, that's definitely a good question.

titzer · 2016-09-08T08:35:10Z

Thanks for your questions and concerns.

It's worth noting that many of the designers and implementors of
WebAssembly have backgrounds in high performance, industrial JITs, not only
for JavaScript (V8, SpiderMonkey, Chakra, and JavaScriptCore), but also in
LLVM and other compilers. I personally have implemented two JITs for Java
bytecode and I can attest that a stack machine with unrestricted gotos
introduces quite some complexity in decoding, verifying, and constructing a
compiler IR. In fact, there are many patterns that can be expressed in Java
bytecode that will cause high-performance JITs, including both C1 and C2 in
HotSpot to simply give up and relegate the code to only running in the
interpreter. In contrast, constructing a compiler IR from something like an
AST from JavaScript or another language is something I've also done. The
extra structure of an AST makes some of this work far simpler.

The design of WebAssembly's control flow constructs simplifies consumers by
enabling fast, simple verification, easy, one pass conversion to SSA form
(even a graph IR), effective single-pass JITs, and (with postorder and the
stack machine) relatively simple in-place interpretation. Structured
control makes irreducible control flow graphs impossible, which eliminates
a whole class of nasty corner cases for decoders and compilers. It also
nicely sets the stage for exception handling in WASM bytecode, for which V8
is already developing a prototype in concert with the production
implementation.

We've had a lot of internal discussion between members about this very
topic, since, for a bytecode, it's one thing that is most different from
other machine-level targets. However, it's not any different than targeting
a source language like JavaScript (which many compilers do these days) and
requires only minor reorganization of blocks to achieve structure. There
are known algorithms to do this, and tools. We'd like to provide some
better guidance for those producers with start with an arbitrary CFG to
communicate this better. For languages targeting WASM directly from an AST
(which is actually something V8 does now for asm.js code--directly
translating a JavaScript AST to WASM bytecode), there is no restructuring
step necessary. We expect this to be the case for many language tools
across the spectrum that don't have sophisticated IRs inside.

On Thu, Sep 8, 2016 at 9:53 AM, Ori Bernstein notifications@github.com
wrote:

Would you keep the values stack, or use a register based design?

I don't actually have any strong feelings on that end. I'd imagine
compactness of the encoding would be one of the biggest concerns; As you
mentioned, performance is another.

Would be interesting to know how they compare with wasm SSA decoders, that
is the important question?

If you're decoding to SSA, that implies that you'd also be doing a
reasonable amount of optimization. I'd be curious to benchmark how
significant decoding performance is in the first place. But, yes, that's
definitely a good question.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#796 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1Iz1nn4--NL32R9ev0JPKfEnDyvqks5qn77cgaJpZM4J3ofA
.

qwertie · 2016-09-08T09:24:21Z

Thanks @titzer, I was developing a suspicion that Wasm's structure had a purpose beyond just similarity to asm.js. I wonder though: Java bytecode (and CIL) don't model CFGs or the value stack directly, they have to be inferred by the JIT. But in Wasm (especially if block signatures are added) the JIT can easily figure out what's going on with the value stack and control flow, so I wonder, if CFGs (or irreducible control flow specifically) were modeled explicitly like loops and blocks are, might that avoid most of the nasty corner cases you're thinking of?

There's this neat optimization that interpreters use that relies on irreducible control flow to improve branch prediction...

kripken · 2016-09-08T15:48:45Z

@oridb

I'd like to argue that a flatter, goto based format would be far more useful as
a target for compiler developers

I agree that gotos are very useful for many compilers. That's why tools like Binaryen let you generate arbitrary CFGs with gotos, and they can convert that very quickly and efficiently into WebAssembly for you.

It might help to think of WebAssembly as a thing optimized for browsers to consume (as @titzer pointed out). Most compilers should probably not generate WebAssembly directly, but rather use a tool like Binaryen, so that they can emit gotos, get a bunch of optimizations for free, and don't need to think about low-level binary format details of WebAssembly (instead you emit an IR using a simple API).

Regarding polyfilling with the while-switch pattern you mention: in emscripten we started out that way before we developed the "relooper" method of recreating loops. The while-switch pattern is around 4x slower on average (but in some cases significantly less or more, e.g. small loops are more sensitive). I agree with you that in theory jump-threading optimizations could speed that up, but performance will be less predictable as some VMs will do it better than others. It is also significantly larger in terms of code size.

oridb · 2016-09-08T17:28:40Z

It might help to think of WebAssembly as a thing optimized for browsers to consume (as @titzer pointed out). Most compilers should probably not generate WebAssembly directly, but rather use a tool like Binaryen...

I'm still not convinced that this aspect is going to matter that much - again, I suspect the cost of fetching the bytecode would dominate the delay the user sees, with the second biggest cost being the optimizations done, and not the parsing and validation. I'm also assuming/hoping that the bytecode would be tossed out, and the compiled output is what would be cached, making the compilation effectively a one-time cost.

But if you were optimizing for web browser consumption, why not simply define web assembly as SSA, which seems to me both more in line with what I'd expect, and less effort to 'convert' to SSA?

kripken · 2016-09-08T17:39:25Z

You can start to parse and compile while downloading, and some VMs might not do a full compile up front (they might just use a simple baseline for example). So download and compile times can be smaller than expected, and as a result parsing and validation can end up a significant factor in the total delay the user sees.

Regarding SSA representations, they tend to have large code sizes. SSA is great for optimizing code, but not for serializing code compactly.

ghost · 2016-09-09T00:00:12Z

@oridb See the comment by @titzer 'The design of WebAssembly's control flow constructs simplifies consumers by enabling fast, simple verification, easy, one pass conversion to SSA form ...' - it can generate verified SSA in one pass. Even if wasm used SSA for the encoding it would still have the burden of verifying it, of computing the dominator structure which is easy with the wasm control flow restrictions.

Much of the encoding efficiency of wasm appears to come from being optimized for the common code pattern in which definitions have a single use that is used in stack order. I expect that an SSA encoding could do so too, so it could be of similar encoding efficiency. Operators such as if_else for diamond patterns also help a lot. But without the wasm structure it looks like all basic blocks would need to read definitions from registers and write results to registers, and that might not be so efficient. For example, I think wasm can do even better with a pick operator that could reference scoped stack values up the stack and across basic block boundaries.

I think wasm is not too far from being able to encode most code in SSA style. If definitions were passed up the scope tree as basic block outputs then it might be complete. Might the SSA encoding be orthogonal to the CFG matter. E.g. There could be an SSA encoding with the wasm CFG restrictions, there could be a register based VM with the CFG restrictions.

A goal for wasm is to move the optimization burden out the runtime consumer. There is strong resistance to adding complexity in the runtime compiler, as it increases the attack surface. So much of the design challenge is to ask what can be done to simplify the runtime compiler without hurting performance, and much debate!

comex · 2016-12-07T23:33:47Z

Well, it's probably too late now, but I'd like to question the idea that the relooper algorithm, or variants thereof, can produce "good enough" results in all cases. They clearly can in most cases, since most source code doesn't contain irreducible control flow to start with, optimizations don't usually make things too hairy, and if they do, e.g. as part of merging duplicate blocks, they can probably be taught not to. But what about pathological cases? For example, what if you have a coroutine which a compiler has transformed to a regular function with structure like this pseudo-C:

void transformed_coroutine(struct autogenerated_context_struct *ctx) {
	int arg1, arg2; // function args
	int var1, var2, var3, …; // all vars used by the function
	switch (ctx->current_label) { // restore state
	case 0:
		// initial state, load function args caller supplied and proceed to start
		arg1 = ctx->arg1;
		arg2 = ctx->arg2;
		break;
	case 1: 
		// restore all vars which are live at label 1, then jump there
		var2 = ctx->var2; 
		var3 = ctx->var3;
		goto resume_1;
	[more cases…]
	}
	
	[main body goes here...]
	[somewhere deep in nested control flow:]
		// originally a yield/await/etc.
		ctx->var2 = var2;
		ctx->var3 = var3;
		ctx->current_label = 1;
		return;
		resume_1:
		// continue on
}

So you have mostly normal control flow, but with some gotos pointed at the middle of it. This is roughly how LLVM coroutines work.

I don't think there's any nice way to reloop something like that, if the 'normal' control flow is complex enough. (Could be wrong.) Either you duplicate massive parts of the function, potentially needing a separate copy for every yield point, or you turn the whole thing into a giant switch, which according to @kripken is 4x slower than relooper on typical code (which itself is probably somewhat slower than not needing relooper at all).

The VM could reduce the overhead of a giant switch with jump threading optimizations, but surely it's more expensive for the VM to perform those optimizations, essentially guessing how the code reduces to gotos, than to just accept explicit gotos. As @kripken says, it's also less predictable.

Maybe doing that kind of transformation is a bad idea to start with, since afterward nothing dominates anything so SSA-based optimizations can't do much… maybe it's better done at the assembly level, maybe wasm should eventually get native coroutine support instead? But the compiler can perform most optimizations before doing the transformation, and it seems that at least the designers of LLVM coroutines didn't see an urgent need to delay the transformation until code generation. On the other hand, since there's a fair amount of variety in the exact semantics people want from coroutines (e.g. duplication of suspended coroutines, ability to inspect 'stack frames' for GC), when it comes to designing a portable bytecode (rather than a compiler), it's more flexible to properly support already-transformed code than to have the VM do the transformation.

Anyway, coroutines are just one example. Another example I can think of is implementing a VM-within-a-VM. While a more common feature of JITs is side exits, which don't require goto, there are situations that call for side entries - again, requiring goto into the middle of loops and such. Another would be optimized interpreters: not that interpreters targeting wasm can really match those targeting native code, which at minimum can improve performance with computed gotos, and can dip into assembly for more… but part of the motivation for computed gotos is to better leverage the branch predictor by giving each case its own jump instruction, so you might be able to replicate some of the effect by having a separate switch after each opcode handler, where the cases would all just be gotos. Or at least have an if or two to check for specific instructions that commonly come after the current one. There are some special cases of that pattern that might be representable with structured control flow, but not the general case. And so on…

Surely there's some way to allow arbitrary control flow without making the VM do a lot of work. Straw man idea, might be broken: you could have a scheme where jumps to child scopes are allowed, but only if the number of scopes you have to enter is less than a limit defined by the target block. The limit would default to 0 (no jumps from parent scopes), which preserves the current semantics, and a block's limit can't be greater than the parent block's limit + 1 (easy to check). And the VM would change its dominance heuristic from "X dominates Y if it is a parent of Y" to "X dominates Y if it is a parent of Y with distance greater than Y's child jump limit". (This is a conservative approximation, not guaranteed to represent the exact dominator set, but the same is true for the existing heuristic - it's possible for an inner block to dominate the bottom half of an outer one.) Since only code with irreducible control flow would need to specify a limit, it wouldn't increase code size in the common case.

Edit: Interestingly, that would basically make the block structure into a representation of the dominance tree. I guess it would be much simpler to express that directly: a tree of basic blocks, where a block is allowed to jump to a sibling, ancestor, or immediate child block, but not to a further descendant. I'm not sure how that best maps onto the existing scope structure, where a "block" can consist of multiple basic blocks with sub-loops in between.

ghost · 2016-12-14T03:53:48Z

FWIW: Wasm has a particular design, which is explained in just a few very significant words "except that the nesting restriction makes it impossible to branch into the middle of a loop from outside the loop".

If it were just a DAG then validation could just check that branches were forward, but with loops this would allow branching into the middle of the loop from outside the loop, hence the nested block design.

The CFG is only part of this design, the other being data flow, and there is a stack of values and blocks can also be organized to unwind the values stack which can very usefully communicate the live range to the consumer which saves work converting to SSA.

It is possible to extend wasm to be an SSA encoding (add pick, allow blocks to return multiple values, and have loop entries pop values), so interestingly the constraints demanded for efficient SSA decoding might not be necessary (because it could already be SSA encoded)! This leads to a functional language (which might have a stack style encoding for efficiency).

If this were extended to handle arbitrary CFG then might it look like the following. This is an SSA style encoding so values are constants. It seems to still fit the stack style to a large extent, just not certain of all the details. So within blocks branches could be made to any other labeled blocks in that set, or some other convention used to transfer control to another block. The code within the block might still usefully reference values on the values stack higher up the stack to save passing them all in.

(func f1 (arg1)
  (let ((c1 10)) ; Some values up the stack.
    (blocks ((b1 (a1 a2 a3)
                   ... (br b3)
               (br b2 (+ a1 a2 a3 arg1 c1)))
             (b2 (a1)
                 ... (br b1 ...))
             (b3 ()
                 ...))
   .. regular structured wasm ..
   (br b2 ...)
   ....
   (br b3)
    ...
   ))

But would web browsers ever handle this efficient internally?

Would someone with a stack machine background recognize the code pattern and be able to match it to a stack encoding?

ghost · 2016-12-14T10:54:37Z

There is some interesting discussion on irreducible loops here http://bboissin.appspot.com/static/upload/bboissin-thesis-2010-09-22.pdf

I did not follow it all on a quick pass, but it mentions converting irreducible loops to reducible loops by adding an entry node. For wasm it sounds like adding a defined input to loops that is specifically for dispatching within the loop, similar to the current solution but with a defined variable for this. The above mentions this is virtualized, optimized away, in processing. Perhaps something like this could be an option?

If this is on the horizon, and given that producers already need to use a similar technique but using a local variable, then might it be worth considering now so that wasm produced early has potential to run faster on more advanced runtimes? This might also create an incentive for competition between the runtimes to explore this.

This would not exactly be arbitrary labels and gotos but something that these might be transformed into that has some chance of being efficiently compiled in future.

darkuranium · 2017-04-13T21:21:19Z

For the record, I am strongly with @oridb and @comex on this issue.
I think this is a critical issue that should be addressed before it is too late.

Given the nature of WebAssembly, any mistakes you make now are likely to stick for decades to come (look at Javascript!). That's why the issue is so critical; avoid supporting gotos now for whatever reason it is (e.g. to ease optimization, which is --- quite frankly --- a specific implementation's influence over a generic thing, and honestly, I think it's lazy), and you'll end up with problems in the long run.

I can already see future (or current, but in the future) WebAssembly implementations trying to special-case recognize the usual while/switch patterns to implement labels in order to handle them properly. That's a hack.

WebAssembly is clean slate, so now is the time to avoid dirty hacks (or rather, the requirements for them).

jfbastien · 2017-04-13T21:34:12Z

@darkuranium :

WebAssembly as currently specified is already shipping in browsers and toolchains, and developers have already created code which takes the form laid out in that design. We therefore cannot change the design in a breaking manner.

We can, however, add to the design in a backward-compatible manner. I don't think any of those involved think goto is useless. I suspect we all regularly use goto, and not just in syntactic toy manners.

At this point in time, someone with motivation needs to come up with a proposal which makes sense and implement it. I don't see such a proposal being rejected if it provides solid data.

Given the nature of WebAssembly, any mistakes you make now are likely to stick for decades to come (look at Javascript!). That's why the issue is so critical; avoid supporting gotos now for whatever reason it is (e.g. to ease optimization, which is --- quite frankly --- a specific implementation's influence over a generic thing, and honestly, I think it's lazy), and you'll end up with problems in the long run.

So I'll call your bluff: I think having the motivation you show, and not coming up with a proposal and implementation as I detail above, is quite frankly lazy.

I'm being cheeky of course. Consider that we've got folks banging on our doors for threads, GC, SIMD, etc—all making passionate and sensible arguments for why their feature is most important—it would be great if you could help us tackle one of these issues. There are folks doing so for the other features I mention. None for goto thus far. Please acquaint yourself with this group's contributing guidelines and join the fun.

Otherwise I think goto is a great future feature. Personally I'd probably tackle others first, such as JIT code generation. That's my personal interest after GC and threads.

cheery · 2017-04-13T22:32:50Z

Hi. I am in middle of writing a translation from webassembly to IR and back to webassembly, and I've had a discussion about this subject with people.

I've been pointed out that irreducible control flow is tricky to represent in webassembly. It can prove out to be troublesome for optimizing compilers that occassionally write out irreducible control flows. This might be something like the loop under, which has multiple entry points:

if (x) goto inside_loop;
// banana
while(y) {
    // things
    inside_loop:
    // do things
}

EBB compilers would produce the following:

entry:
    cjump x, inside_loop
    // banana
    jump loop

loop:
    cjump y, exit
    // things
    jump inside_loop

inside_loop:
    // do things
    jump loop
exit:
    return

Next we get to translating this to webassembly. The problem is that although we have decompilers figured out ages ago, they always had the option of adding the goto into irreducible flows.

Before it gets to be translated, the compiler is going to do tricks on this. But eventually you get to scan through the code and position the beginnings and endings of the structures. You end up with the following candinates after you eliminate the fall-through jumps:

<inside_loop, if(x)>
    // banana
<loop °>
<exit if(y)>
    // things
</inside_loop, if(x)>
    // do things
</loop ↑>
</exit>

Next you need to build a stack out of these. Which one goes to the bottom? It is either the 'inside loop' or then it is the 'loop'. We can't do this so we have to cut the stack and copy things around:

if
    // do things
else
    // banana
end
loop
  br out
    // things
    // do things
end

Now we can translate this to webassembly. Pardon me, I'm not yet familiar with how these loops construct out.

This is not a particular problem if we think about old software. It is likely that the new software is translated to web assembly. But the problem is in how our compilers work. They have been doing the control flow with basic blocks for decades and assume everything goes.

Technically the language is translated in, then translated out. We only need a mechanism that allows the values to flow across the boundaries neat without the drama. The structured flow is only useful for people intending to read the code.

But for example, the following would work just as fine:

    cjump x, label(1)
    // banana
0: label
    cjump y, label(2)
    // things
1: label
    // do things
    jump label(0)
2: label
    // exit as usual, picking the values from the top of the stack.

The numbers would be implicit, that is.. when the compiler sees a 'label', it knows that it starts a new extended block and give it a new index number, starting to increment from 0.

To produce a static stack, you could track how many items are in the stack when you encounter a jump into the label. If there ends up being inconsistent stack after a jump into the label, the program is invalid.

If you find the above bad, you can also try add an explicit stack length into each label (perhaps delta from the last indexed label's stack size, if the absolute value is bad for compression), and a marker to each jump about how many values it copies in from the top of the stack during the jump.

I could bet that you can't outsmart the gzip in any way by the fact how you represent the control flow, so you could choose the flow that's nice for the guys that have the hardest work here. (I can illustrate with my flexible compiler toolchain for the 'outsmarting the gzip' -thing if you like, just send me a message and lets put up a demo!)

cheery · 2017-04-14T13:20:09Z

I feel like a shatterhead right now. Just re-read through the WebAssembly spec and picked up that the irreducible control flow is intentionally left out from the MVP, perhaps for the reason that emscripten had to solve the problem on the early days.

The solution on how to handle the irreducible control flow in WebAssembly is explained in the paper "Emscripten: An LLVM-to-JavaScript Compiler". The relooper reorganizes the program something like this:

_b_ = bool(x)
_b_ == 0 if
  // banana
end
block loop
  _b_ if
    // do things
    _b_ = 0
  else
    y br_if 2
    // things
    _b_ = 1
  end
  br 0
end end

The rational was that the structured control flow helps reading the source code dump, and I guess it is believed to help the polyfill implementations.

The people compiling from webassembly will probably adapt to handle and separate the collapsed control flow.

comex · 2017-04-16T04:51:58Z

So:

As mentioned, WebAssembly is now stable, so the time is past for any total rewrite of how control flow is expressed.
- In one sense, that's unfortunate, because nobody actually tested whether a more directly SSA-based encoding could have achieved the same compactness as the current design.
- However, when it comes to spec-ing out goto, that makes the job much easier! The block-based instructions are already beyond bikeshedding, and it's not a big deal to expect production compilers targeting wasm to express reducible control flow using them - the algorithm is not that hard. The main problem is that a small fraction of control flow cannot be expressed using them without a performance cost. If we solve that by adding a new goto instruction, we don't have to worry nearly as much about encoding efficiency as we would with a total redesign. Code using goto should still be reasonably compact, of course, but it doesn't have to compete with other constructs for compactness; it's only for irreducible control flow and should be used rarely.
Reducibility is not particularly useful.
- Most compiler backends use an SSA representation based on a graph of basic blocks and branches between them. The nested loop structure, the thing that reducibility guarantees, is pretty much thrown away at the start.
  - I checked the current WebAssembly implementations in JavaScriptCore, V8, and SpiderMonkey, and they all seem to follow this pattern. (V8 is more complicated - some kind of "sea of nodes" representation rather than basic blocks - but also throws away the nesting structure.)
- Exception: Loop analysis can be useful, and all three of those implementations pass through information to the IR about which basic blocks are the starts of loops. (Compare to LLVM which, as a 'heavyweight' backend designed for AOT compilation, throws it away and recalculates it in the backend. This is more robust, since it can find things that don't look like loops in the source code but do after a bunch of optimizations, but slower.)
  - Loop analysis works on "natural loops", which forbid branches into the middle of the loop that don't go through the loop header.
  - WebAssembly should continue to guarantee that loop blocks are natural loops.
  - But loop analysis doesn't require that the whole function be reducible, nor even the inside of the loop: it just forbids branches from outside to inside. The base representation is still an arbitrary control flow graph.
- Irreducible control flow does make it harder to compile WebAssembly to JavaScript (polyfilling), as the compiler would have to run the relooper algorithm itself.
  - But WebAssembly already makes multiple decisions that add significant runtime overhead to any compile-to-JS approach (including unaligned memory access support and trapping on out-of-bounds accesses), suggesting that it's not considered very important.
  - Compared to that, making the compiler a little more complex is not a big deal.
- Therefore, I don't think there's a good reason not to add some kind of support for irreducible control flow.
The main information needed to build a SSA representation (which, by design, should be possible in one pass) is the dominator tree.
- Currently a backend can estimate dominance based on structured control flow. If I understand the spec correctly, the following instructions end a basic block:
  - block:
    - The BB starting the block is dominated by the previous BB.*
    - The BB following the corresponding end is dominated by the BB starting the block, but not by the BB before end (because it'll be skipped if there was a br out).
  - loop:
    - The BB starting the block is dominated by the previous BB.
    - The BB after end is dominated by the BB before end (since you can't get to the instruction after end except by executing end).
  - if:
    - The if side, the else side, and the BB after end are all dominated by the BB before if.
  - br, return, unreachable:
    - (The BB immediately after br, return, or unreachable is unreachable.)
  - br_if, br_table:
    - The BB before br_if/br_table dominates the one after it.
- Notably, this is only an estimate. It can't produce false positives (saying A dominates B when it actually doesn't) because it only says so when there's no way to get to B without going through A, by construction. But it can produce false negatives (saying A doesn't dominate B when it actually does), and I don't think a single-pass algorithm can detect those (could be wrong).
  - Example false negative:
```
block $outer
  loop
    br $outer ;; since this unconditionally breaks, it secretly dominates the end BB
  end
end
```
  - But that's OK, AFAIK.
    - False positives would be bad, because e.g. if basic block A is said to dominate basic block B, the machine code for B can use a register set in A (if nothing in between overwrites that register). If A doesn't actually dominate B, the register might have a garbage value.
    - False negatives are essentially ghost branches that never occur. The compiler assumes that those branches could occur, but not that they must, so the generated code is just more conservative than necessary.
- Anyway, think about how a goto instruction should work in terms of the dominator tree. Suppose that A dominates B, which dominates C.
  - We can't jump from A to C because that would skip B (violating the assumption of dominance). In other words, we can't jump to non-immediate descendants. (And on the binary producer's end, if they calculated the true dominator tree, there will never be such a jump.)
  - We could safely jump from A to B, but goto'ing to an immediate descendant is not that useful. It's basically equivalent to an if or switch statement, which we can do already (using the if instruction if there's only a binary test, or br_table if there are multiple).
  - Also safe, and more interesting, is jumping to a sibling or an ancestor's sibling. If we jump to our sibling, we preserved the guarantee that our parent dominates our sibling, because we must have already executed our parent to get here (since it dominates us too). Similarly for ancestors.
  - In general, a malicious binary could produce false negatives in dominance this way, but as I said, those are (a) already possible and (b) acceptable.
Based on that, here's a strawman proposal:
- One new block-type instruction:
  - labels resulttype N instr* end
- There must be exactly N immediate children instructions, where "immediate child" means either a block-type instruction (loop, block, or labels) and everything up to the corresponding end, or a single non-block instruction (which must not affect the stack).
- Instead of creating a single label like other block-type instructions, labels creates N+1 labels: N pointing to the N children, and one pointing to the end of the labels block. In each of the children, label indices 0 to N-1 refer to the children, in order, and label index N refers to the end.
  
  In other words, if you have
```
loop ;; outer
  labels 3
      block ;; child 0
          br X
      end
      nop ;; child 1
      nop ;; child 2
  end
end
```
  Depending on X, the br refers to:
  
  X Target
  
  0 end of the block
  
  1 child 0 (beginning of the block)
  
  2 child 1 (nop)
  
  3 child 2 (nop)
  
  4 end of labels
  
  5 beginning of outer loop
- Execution starts at the first child.
- If execution reaches the end of one of the children, it continues to the next. If it reaches the end of the last child, it goes back to the first child. (This is for symmetry, because the order of the children is not meant to be significant.)
- Branching to one of the children unwinds the operand stack to its depth at the start of labels.
- So does branching to the end, but if resulttype is nonempty, branching to the end pops an operand and pushes it after unwinding, similar to block.
- Dominance: The basic block before the labels instruction dominates each of the children, as well as the BB after the end of labels. The children don't dominate each other or the end.
- Design notes:
  - N is specified up front so that the code can be validated in one pass. It would be weird to have to get to the end of the labels block, to know the number of children, before knowing the targets of the indices in it.
  - Not sure if there should eventually be a way to pass values on the operand stack between labels, but by analogy with the inability to pass values into a block or loop, that can be unsupported to start with.

qwertie · 2017-04-16T09:11:37Z

It would be really nice if it were possible to jump into a loop though, wouldn't it? IIUC, if that case were accounted for then the nasty loop+br_table combo would never be needed...

Edit: oh, you can make a loops without loop by jumping upward in labels. Can't believe I missed that.

comex · 2017-04-16T18:42:24Z

@qwertie If a given loop is not a natural loop, the wasm-targeting compiler should express it using labels instead of loop. It should never be necessary to add a switch to express control flow, if that's what you're referring to. (After all, at worst you could just use one giant labels block with a label for every basic block in the function. This doesn't let the compiler know about dominance and natural loops, so you may miss out on optimizations. But labels is only required in cases where those optimizations aren't applicable.)

lukewagner · 2017-04-17T15:51:36Z

The nested loop structure, the thing that reducibility guarantees, is pretty much thrown away at the start. [...] I checked the current WebAssembly implementations in JavaScriptCore, V8, and SpiderMonkey, and they all seem to follow this pattern.

Not quite: at least in SM, the IR graph is not a fully general graph; we assume certain graph invariants that follow from being generated from a structured source (JS or wasm) and often simplify and/or optimize the algorithms. Supporting a fully general CFG would either require auditing/changing many of the passes in the pipeline to not assume these invariants (either by generalizing them or pessimizing them in case of irreducibility) or node-splitting duplication up front to make the graph reducible. This is certainly doable, of course, but it's not true that this is simply a matter of wasm being an artificial bottleneck.

Also, the fact that there are many options and different engines will do different things suggests that having the producer deal with irreducibility up front will produce somewhat more predictable performance in the presence of irreducible control flow.

When we've discussed backwards-compatible paths for extending wasm with arbitrary goto support in the past, one big question is what's the use case here: is it "make producers simpler by not having to run a relooper-type algorithm" or is it "allow more efficient codegen for actually-irreducible control flow"? If it's just the former, then I think we probably would want some scheme of embedding arbitrary labels/gotos (that is both backwards compatible and also composes with future block-structured try/catch); it's just a matter of weighing cost/benefit and the issues mentioned above.

But for the latter use case, one thing we've observed is that, while you do every now and then see a Duff's device case in the wild (which isn't actually an efficient way to unroll a loop...), often where you see irreducibility pop up where performance matters is interpreter loops. Interpreter loops also benefit from indirect threading which needs computed goto. Also, even in beefy offline compilers, interpreter loops tend to get the worst register allocation. Since interpreter loop performance can be pretty important, one question is whether what we really need is a control flow primitive that allows the engine to perform indirect threading and do decent regalloc. (This is an open question to me.)

comex · 2017-04-17T18:56:02Z

@lukewagner
I'd like to hear more detail about which passes are depending on invariants. The design I proposed, using a separate construct for irreducible flow, should make it relatively easy for optimization passes like LICM to steer clear of that flow. But if there are other types of breakage I'm not thinking of, I'd like to understand their nature better so I can get a better idea of whether and how they can be avoided.

When we've discussed backwards-compatible paths for extending wasm with arbitrary goto support in the past, one big question is what's the use case here: is it "make producers simpler by not having to run a relooper-type algorithm" or is it "allow more efficient codegen for actually-irreducible control flow"?

For me it's the latter; my proposal expects producers to still run a relooper-type algorithm to save the backend the work of identifying dominators and natural loops, falling back to labels only when necessary. However, this would still make producers simpler. If irreducible control flow has a large penalty, an ideal producer should work very hard to avoid it, using heuristics to determine whether it's more efficient to duplicate code, the minimal amount of duplication that can work, etc. If the only penalty is potentially giving up loop optimizations, this isn't really necessary, or at least is no more necessary than it would be with a regular machine code backend (which has its own loop optimizations).

I really should gather more data on how common irreducible control flow is in practice…

However, my belief is that penalizing such flow is essentially arbitrary and unnecessary. In most cases, the effect on overall program runtime should be small. However, if a hotspot happens to include irreducible control flow, there will be a severe penalty; in the future, WebAssembly optimization guides might include this as a common gotcha, and explain how to identify and avoid it. If my belief is correct, this is an entirely unnecessary form of cognitive overhead for programmers. And even when the overhead is small, WebAssembly already has enough overhead compared to native code that it should seek to avoid any extra.

I'm open to persuasion that my belief is incorrect.

Since interpreter loop performance can be pretty important, one question is whether what we really need is a control flow primitive that allows the engine to perform indirect threading and do decent regalloc.

That sounds interesting, but I think it would be better to start with a more general-purpose primitive. After all, a primitive tailored for interpreters would still require backends to deal with irreducible control flow; if you're going to bite that bullet, may as well support the general case too.

Alternately, my proposal might already serve as a decent primitive for interpreters. If you combine labels with br_table, you get the ability to point a jump table directly at arbitrary points in the function, which is not that different from a computed goto. (As opposed to a C switch, which at least initially directs control flow to points within the switch block; if the cases are all gotos, the compiler should be able to optimize away the extra jump, but it might also coalesce multiple 'redundant' switch statements into one, ruining the benefit of having a separate jump after each instruction handler.) I'm not sure what the issue with register allocation is, though...

lukewagner · 2017-04-17T22:02:41Z

@comex I guess one could simply turn off whole optimization passes at the function level in the presence of irreducible control flow (although SSA generation, regalloc, and a probably a few others would be needed and thus require work), but I was assuming we wanted to actually generate quality code for functions with irreducible control flow and that involves auditing each algorithm that previously assumed a structured graph.

rossberg · 2017-04-19T11:54:10Z

The nested loop structure, the thing that reducibility guarantees, is pretty much thrown away at the start. [...] I checked the current WebAssembly implementations in JavaScriptCore, V8, and SpiderMonkey, and they all seem to follow this pattern. Not quite: at least in SM, the IR graph is not a fully general graph; we assume certain graph invariants that follow from being generated from a structured source (JS or wasm) and often simplify and/or optimize the algorithms.

Same in V8. It is actually one of my major gripes with SSA in both respective literature and implementations that they almost never define what constitutes a "well-formed" CFG, but tend to implicitly assume various undocumented constraints anyways, usually ensured by construction by the language frontend. I bet that many/most optimisations in existing compilers would not be able to deal with truly arbitrary CFGs. As @lukewagner says, the main use case for irreducible control probably is "threaded code" for optimised interpreters. Hard to say how relevant those are for the Wasm domain, and whether its absence actually is the biggest bottleneck. Having discussed irreducible control flow with a number of people researching compiler IRs, the "cleanest" solution probably would be to add the notion of mutually recursive blocks. That would happen to fit Wasm's control structure quite well.

stoklund · 2017-04-19T17:19:56Z

Loop optimizations in LLVM will generally ignore irreducible control flow and not attempt to optimize it. The loop analysis they're based on will only recognize natural loops, so you just have to be aware that there can be CFG cycles that are not recognized as loops. Of course, other optimizations are more local in nature and work just fine with irreducible CFGs.

From memory, and probably wrong, SPEC2006 has a single irreducible loop in 401.bzip2 and that's it. It's quite rare in practice.

Clang will only emit a single indirectbr instruction in functions using computed goto. This has the effect of turning threaded interpreters into natural loops with the indirectbr block as a loop header. After leaving LLVM IR, the single indirectbr is tail-duplicated in the code generator to reconstruct the original tangle.

titzer · 2017-05-01T00:48:21Z

There is no single-pass verification algorithm for irreducible control flow that I am aware of. The design choice for reducible control flow only was highly influenced by this requirement. As mentioned earlier, irreducible control flow can be modeled at least two different ways. A loop with a switch statement can actually be optimized into the original irreducible graph by a simple local jump-threading optimization (e.g. by folding the pattern where an assignment of a constant to a local variable occurs, then a branch to a conditional branch that immediately switches on that local variable). So the irreducible control constructs are not necessary at all, and it is only a matter of a single compiler backend transformation to recover the original irreducible graph and optimize it (for engines whose compilers support irreducible control flow--which none of the 4 browsers do, to the best of my knowledge). Best, -Ben

…

On Thu, Apr 20, 2017 at 5:20 AM, Jakob Stoklund Olesen < ***@***.***> wrote: Loop optimizations in LLVM will generally ignore irreducible control flow and not attempt to optimize it. The loop analysis they're based on will only recognize natural loops, so you just have to be aware that there can be CFG cycles that are not recognized as loops. Of course, other optimizations are more local in nature and work just fine with irreducible CFGs. From memory, and probably wrong, SPEC2006 has a single irreducible loop in 401.bzip2 and that's it. It's quite rare in practice. Clang will only emit a single indirectbr instruction in functions using computed goto. This has the effect of turning threaded interpreters into natural loops with the indirectbr block as a loop header. After leaving LLVM IR, the single indirectbr is tail-duplicated in the code generator to reconstruct the original tangle. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#796 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALnq1K99AR5YaQuNOIFIckLLSIZbmbd0ks5rxkJQgaJpZM4J3ofA> .

titzer · 2017-05-01T00:54:18Z

I can also say further that if irreducible constructs were to be added to WebAssembly, they would not work in TurboFan (V8's optimizing JIT), so such functions would either end up being interpreted (extremely slow) or being compiled by a baseline compiler (somewhat slower), since we will likely not invest effort in upgrading TurboFan to support irreducible control flow. That means functions with irreducible control flow in WebAssembly would probably end up with much worse performance. Of course, another option would for the WebAssembly engine in V8 to run the relooper to feed TurboFan reducible graphs, but that would make compilation (and startup worse). Relooping should remain an offline procedure in my opinion, otherwise we are ending with up inescapable engine costs. Best, -Ben

…

On Mon, May 1, 2017 at 12:48 PM, Ben L. Titzer ***@***.***> wrote: There is no single-pass verification algorithm for irreducible control flow that I am aware of. The design choice for reducible control flow only was highly influenced by this requirement. As mentioned earlier, irreducible control flow can be modeled at least two different ways. A loop with a switch statement can actually be optimized into the original irreducible graph by a simple local jump-threading optimization (e.g. by folding the pattern where an assignment of a constant to a local variable occurs, then a branch to a conditional branch that immediately switches on that local variable). So the irreducible control constructs are not necessary at all, and it is only a matter of a single compiler backend transformation to recover the original irreducible graph and optimize it (for engines whose compilers support irreducible control flow--which none of the 4 browsers do, to the best of my knowledge). Best, -Ben On Thu, Apr 20, 2017 at 5:20 AM, Jakob Stoklund Olesen < ***@***.***> wrote: > Loop optimizations in LLVM will generally ignore irreducible control flow > and not attempt to optimize it. The loop analysis they're based on will > only recognize natural loops, so you just have to be aware that there can > be CFG cycles that are not recognized as loops. Of course, other > optimizations are more local in nature and work just fine with irreducible > CFGs. > > From memory, and probably wrong, SPEC2006 has a single irreducible loop > in 401.bzip2 and that's it. It's quite rare in practice. > > Clang will only emit a single indirectbr instruction in functions using > computed goto. This has the effect of turning threaded interpreters into > natural loops with the indirectbr block as a loop header. After leaving > LLVM IR, the single indirectbr is tail-duplicated in the code generator > to reconstruct the original tangle. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#796 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ALnq1K99AR5YaQuNOIFIckLLSIZbmbd0ks5rxkJQgaJpZM4J3ofA> > . >

aardappel · 2021-03-02T17:33:26Z

@conrad-watt as you just mentioned in the CG meeting, I think we'd be very interested in seeing details on what your multi-loop would look like.

conrad-watt · 2021-03-02T17:39:09Z

@aardappel yeah, life has come at me fast, but I should do this in the next meeting. Just to emphasise that the idea isn't mine since @rossberg originally sketched it in response to the first draft of funclets.

titzer · 2021-03-03T01:02:16Z

One reference that might be instructive is a kind of a bit dated, but generalizes familiar notions of loops to handle irreducible ones using DJ graphs.

I'm carving out a fairly large time slot for discussion, but if too many other items come along this should probably be reduced. Conversely, I'm sure the [debate](WebAssembly/design#796) has the potential to take up all available time, so this should probably be timetabled last. To be clear, not planning to poll anything.

conrad-watt · 2021-04-07T11:13:49Z

We've had a couple of discussion sessions about this in the CG, and I've written up a summary and follow-up document. Because of the length I've made it a separate gist.

https://gist.github.com/conrad-watt/6a620cb8b7d8f0191296e3eb24dffdef

I think the two immediate actionable questions (see the follow-up section for more details) are:

Can we find "wild" programs which are currently suffering and would benefit performance-wise from multiloop? These may be programs for which LLVM transformations introduce irreducible control flow even if does not exist in the source program.
Is there a world where multiloop is implemented producer-side first, with some linking/translation deployment layer for "Web" Wasm?

There is probably also a more free-wheeling discussion to be had on the consequences of the exception handling issues I discuss in the follow-up document, and of course standard bikeshedding about semantic details if we move forward with anything concrete.

Because these discussions may branch somewhat, it may be appropriate to spin some of them into issues in the funclets repository.

neelance · 2021-04-07T14:57:34Z

I am very happy to see progress on this issue. A huge "Thank you" to all people involved!

Can we find "wild" programs which are currently suffering and would benefit performance-wise from multiloop? These may be programs for which LLVM transformations introduce irreducible control flow even if does not exist in the source program.

I'd like to caution a bit against circular reasoning: Programs that currently have bad performance are less likely to occur "in the wild" for exactly this reason.

I think most Go programs should benefit a lot. The Go compiler either needs WebAssembly coroutines or multiloop to be able to emit efficient code that supports Go's goroutines.

RossTate · 2021-04-08T04:15:10Z

Precompiled regular-expression matchers, along with other precompiled state machines, often result in irreducible control flow. It's hard to say whether or not the "fusion" algorithm for Interface Types will result in irreducible control flow.

aardappel · 2021-04-08T19:55:09Z

Agree this discussion should be moved to issues on the funclets (or a new) repo.
Agree that finding program that would benefit from it is hard to quantify without having LLVM (and Go, and others) actually emit the most optimal control flow (which may be irreducible). The inefficiency caused by FixIrreducibleControlFlow and friends may be a "death by a thousand cuts" problem across a large binary.
While I would welcome a tools-only implementation as the absolute minimum progress coming out of this discussion, it would still not be optimal, as producers now have the tough choice of making use of this functionality for convenience (but then face unpredictable performance regressions/cliffs), or do the hard work to wrangle their output to standard wasm to have things be predictable.
If it were decided that "gotos" are at best a tools-only feature, I'd argue that you could probably get away with an even simpler feature than multiloop, since all you care about is producer convenience. At the absolute minimum, a goto <function_byte_offset> would be the only thing needed to be inserted into regular Wasm function bodies to allow either WABT or Binaryen to transform it into legal Wasm. Things like type signatures are useful if engines need to verify a multiloop quickly, but if its a convenience tool, might as well make it maximally convenient to emit.

kripken · 2021-04-08T20:13:48Z

Agree that finding program that would benefit from it is hard to quantify without having LLVM (and Go, and others) actually emit the most optimal control flow (which may be irreducible).

I agree that testing on modified toolchains + VMs would be optimal. But we can compare current wasm builds to native builds which do have optimal control flow. Not So Fast and others have looked at this in various ways (performance counters, direct investigation) and have not found irreducible control flow to be a significant factor.

RossTate · 2021-04-08T20:46:36Z

More specifically, they didn't find it to be a significant factor for C/C++. That might have more to do with C/C++ than with the performance of irreducible control flow. (I honestly don't know.) It sounds like @neelance has reason to believe the same would not be true for Go.

My sense is that there are multiple facets to this problem, and its worthwhile tackling it through multiple directions.

First, it sounds like there's a general issue with the generatability of WebAssembly. Much of that is caused by WebAssembly's constraint to have a compact binary with efficient type-checking and streaming compilation. We could address this issue at least partly by developing a standardized "pre"-WebAssembly that is easier to generate but which is guaranteed to be translatable to "true" WebAssembly, ideally through just code duplication and insertion of "erasable" instructions/annotations, with at least some tool providing such translation.

Second, we can consider what features of "pre"-WebAssembly are worth directly incorporating into "true" WebAssembly. We can do this in an informed manner because we will have "pre"-WebAssembly modules that we can analyze before they have been contorted into "true" WebAssembly modules.

jfmc · 2021-04-09T11:30:12Z

Some years ago I tried compiling a particular bytecode emulator for a dynamic language (https://github.com/ciao-lang/ciao) to webassembly and the performance was far from optimal (sometimes 10 times slower than the native version). The main execution loop contained a large bytecode dispatch switch, and the engine was finely tuned for decades to run on actual hardware, and we make heavy use of labels and gotos. I wonder if this kind of software would benefit from support for irreducible control flow or if the problem was another one. I didn't have time to do further investigation but I'd happy to try again if things are known to have improved. Of course I understand that compiling other languages VM to wasm is not the main use case, but I'd be good to know if this will be eventually feasible, specially since universal binaries that run efficiently, everywhere, is one of the promised advantages of wasm. (Thanks and apologies if this particular topic has been discussed in some other issue)

RossTate · 2021-04-09T17:25:17Z

@jfmc My understanding is that, if the program is realistic (i.e. not contrived in order to be pathological) and you care about its performance, then it is a perfectly valid use case. WebAssembly aims to be a good general-purpose target. So I think it would be great to gain an understanding of why you saw such significant slowdown. If that happens to be due to restrictions on control flow, then that would be very useful to know in this discussion. If it happens to be due to something else, then that would still be useful to know for how to improve WebAssembly in general.

d4tocchini · 2021-12-04T08:24:44Z

TinyCC with WebAssembly backend would be awesome... a fast in-browser C compiler

https://lists.gnu.org/archive/html/tinycc-devel/2020-02/msg00017.html

Another motivating reminder of what could be, if & hopefully when.

tiran · 2022-05-16T06:19:50Z

Python port to WebAssembly would benefit greatly from computed gotos. When we added computed goto support to CPython's ceval loop (core bytecode VM loop) many years ago, we saw an overall performance improvement of approx. 15 to 20% on platforms that use GCC and clang. I assume that addition of computed goto support in the WASM compiler tool chain would result in similar speedups.

Eli Bendersky's old blog post explains how Python's ceval loop benefits from more efficient machine code from computed gotos.

@hoodmane @rth @mdboom

taralx · 2022-05-16T16:42:13Z

I read the blog post, and it seems to be a result of their version of C not having an "unreachable" statement.

However, in wasm you won't see those improvements since even a computed goto has to have validity checks unless the engine can infer them statically.

hoodmane · 2022-05-16T17:19:10Z

@taralx I think the more important part is the branch prediction:

for each jump, the branch predictor keeps a prediction of where it will jump next. If there's a jump per opcode, this is equivalent to predicting the second opcode in an opcode pair, which actually has some chance of success from time to time. On the other hand, if there's just a single jump, the prediction is shared between all opcodes and they keep stepping on each other's toes with each iteration.

I can't say for sure which one of the two factors weighs more in the speed difference between the switch and the computed goto, but if I had to guess I'd say it's the branch prediction.

Also:

in wasm you won't see those improvements since even a computed goto has to have validity checks unless the engine can infer them statically.

This is the point of the funclet proposal, which would be perfect for our use case. The entire table is statically determined so all of the validity checks are static.

elfeiin · 2022-08-28T21:30:12Z

I think I may have a solution, although it does involve changing the way WebAssembly works.

Instead of compiling to IR that might be JIT'd, executed, or what have you, change the WASM standard to be a proper assembly, ie with jumps and all that. This makes it hard to recompile to a faster assembly, but some people wouldn't need to recompile if it already had jumps.

For those that want to use it as an IR, there should be a way to make a pointer manifest to be included with the assembly somehow- to aid in recompiling to client code. This pointer manifest would do the same job as blocks would in defining spans.

Amendment A

Reading through, I have seen comments that this is not possible given the fact that control flow is stable. This is a niche that many would like to be filled, however.

enderger · 2023-01-18T22:19:32Z

If making a language (like an esolang based on early BASICs), goto is definitely almost a necessity for the backend. It's currently looking like the project i'm working on will need to sacrifice some accuracy to avoid deeply nesting everything in blocks. This format also seems to work better for multi-backend compilers that might already compile everything into a load of jump instructions emitted as assembly, GNU Lightning IR, or some other unstructured format.

eira-fransham · 2023-04-02T17:29:56Z

I think I may have a solution, although it does involve changing the way WebAssembly works.

Instead of compiling to IR that might be JIT'd, executed, or what have you, change the WASM standard to be a proper assembly, ie with jumps and all that. This makes it hard to recompile to a faster assembly, but some people wouldn't need to recompile if it already had jumps.

For those that want to use it as an IR, there should be a way to make a pointer manifest to be included with the assembly somehow- to aid in recompiling to client code. This pointer manifest would do the same job as blocks would in defining spans.

Amendment A

Reading through, I have seen comments that this is not possible given the fact that control flow is stable. This is a niche that many would like to be filled, however.

To clarify, since I think there is a persistent misunderstanding here: arbitrary jumps would not make recompiling to faster assembly more difficult, since it is reasonably simple to detect irreducible control flow and compiling reducible control flow constructs expressed with simple jumps is arguably easier than with higher-level, nested control flow. That's why every compiler apart from V8 converts the source language's high-level control flow to jumps before optimisation (and well before translating to assembly). Irreducible control flow could make some regalloc constraint solving a bit more of a hassle but worst-case it would only make irreducible control flow a little bit slower than reducible, instead of the current situation where irreducible control flow is entirely inexpressible and therefore it must be emulated by compilers targeting Wasm. Let me reiterate: arbitrary jumps which happen to express reducible control flow are just as easy to compile as high-level control flow constructs. V8 is the only compiler to my knowledge which would find it particularly difficult to compile irreducible control flow, and the relooper algorithm which LLVM uses when generating Wasm could just be integrated into V8 itself if gotos were implemented into the Wasm standard.

Not to mention that it's years-old knowledge that V8 can't handle irreducible control flow - they might have updated the internals in the past 3+ years, in which case there is no compiler to my knowledge that would find it harder to handle gotos than nested control flow.

kripken · 2023-04-02T19:15:17Z

the relooper algorithm which LLVM uses when generating Wasm could just be integrated into V8 itself if gotos were implemented into the Wasm standard

In theory yes, but as mentioned 6 years ago in this discussion, that would move a significant amount of work from the toolchain side to the client. In general wasm tries to do the opposite (in order to achieve fast startup, and to reduce the risk of client-side bugs).

Meanwhile, in the years since the earlier discussions here, wasm has added more dependencies on structured control flow, like non-nullable locals (quick overview) and exceptions. All those things would either not be compatible with arbitrary gotos or require work to figure out how to support them there. Overall it's not impossible to support irreducibility but it is getting harder over time I think.

danielzgtg · 2023-04-02T23:39:26Z

I used to think this issue was necessary. Then I had the idea of an inverse relooper algorithm. Every compiler would run relooper, and V8 can run that output directly and call it a day. Runtimes that support irreducible CFGs can detect that relooper was used from that giant loop and switch, and they can invert the algorithm to get back the original CFG

titzer · 2023-04-03T00:41:30Z

@danielzgtg This is what I was suggesting in the comment linked above. The algorithm to undo the loop over the switch is a relatively straightforward generalization of tail duplication, a well-known compiler optimization that can be done by the engine after it's created a CFG. If that CFG is in SSA form, it will look like this:

loop block S // predecessors A, B, ....
  x = phi(0, 1, ...)
  switch (x)
    0 => block B
    1 => block A
    ...

block A
 ...
 goto S // corresponds to setting x = 0

block B
 ...
 goto S // corresponds to setting x = 1

Block S switches on a phi defined in S. Blocks A and B goto S and their corresponding phi input is a constant. Which means that if S were tail-duplicated into both A and B separately, each copy would fold x to either 0 or 1, respectively. Which leads to constant folding of the switch, which yields a goto.

Then we end up with:

block A
  ...
  goto B

block B
  ...
  goto A

Which is your classic irreducible loop.

All this is to say, it's always been within engines' power to do this optimization and reintroduce irreducible loops, if they are prepared to handle them throughout the rest of their backends.

For the record, AFAICT, all the web engines that I am aware of have exactly the same limitations as TurboFan in their optimizing tiers. None support irreducible control flow--but could, given the above.

oridb changed the title ~~Please Support Labels and Gotos.~~ Please Support Arbitrary Labels and Gotos. Sep 8, 2016

flagxor added the control flow label Feb 3, 2017

flagxor added this to the Future Features milestone Feb 3, 2017

conrad-watt mentioned this issue Mar 7, 2021

Discussion item on irreducible control flow WebAssembly/meetings#726

Merged

conrad-watt mentioned this issue Jun 26, 2021

Executing nested 'if-else-end' structures #1422

Closed

lbguilherme mentioned this issue Jul 1, 2021

Plans about WebAssembly crystal-lang/crystal#829

Closed

ericoporto mentioned this issue Jul 8, 2021

ci_find_file/unix: add fast-path for file lookup adventuregamestudio/ags#1350

Merged

neelance mentioned this issue Oct 16, 2021

wasm: large memory usage with hard-coded map/array golang/go#42979

Open

RossTate mentioned this issue Oct 19, 2021

Encoding Exceptional CFGs WebAssembly/exception-handling#185

Open

neelance mentioned this issue Feb 2, 2024

cmd/compile: performance of go wasm is very poor golang/go#65440

Open

Please Support Arbitrary Labels and Gotos. #796

Please Support Arbitrary Labels and Gotos. #796

Comments

oridb commented Sep 8, 2016

My Complaints:

Polyfilling:

And if not:

ghost commented Sep 8, 2016 • edited by ghost

oridb commented Sep 8, 2016

ghost commented Sep 8, 2016

oridb commented Sep 8, 2016 • edited

titzer commented Sep 8, 2016

qwertie commented Sep 8, 2016 • edited

kripken commented Sep 8, 2016 • edited

oridb commented Sep 8, 2016

kripken commented Sep 8, 2016

ghost commented Sep 9, 2016 • edited by ghost

comex commented Dec 7, 2016 • edited

ghost commented Dec 14, 2016 • edited by ghost

ghost commented Dec 14, 2016

darkuranium commented Apr 13, 2017

jfbastien commented Apr 13, 2017 • edited

cheery commented Apr 13, 2017 • edited

cheery commented Apr 14, 2017 • edited

comex commented Apr 16, 2017

qwertie commented Apr 16, 2017 • edited

comex commented Apr 16, 2017

lukewagner commented Apr 17, 2017

comex commented Apr 17, 2017

lukewagner commented Apr 17, 2017

rossberg commented Apr 19, 2017 via email

stoklund commented Apr 19, 2017

titzer commented May 1, 2017 via email

titzer commented May 1, 2017 via email

aardappel commented Mar 2, 2021

conrad-watt commented Mar 2, 2021

titzer commented Mar 3, 2021

conrad-watt commented Apr 7, 2021

neelance commented Apr 7, 2021

RossTate commented Apr 8, 2021

aardappel commented Apr 8, 2021 • edited

kripken commented Apr 8, 2021

RossTate commented Apr 8, 2021

jfmc commented Apr 9, 2021

RossTate commented Apr 9, 2021

d4tocchini commented Dec 4, 2021

tiran commented May 16, 2022

taralx commented May 16, 2022 • edited

hoodmane commented May 16, 2022 • edited

elfeiin commented Aug 28, 2022 • edited

enderger commented Jan 18, 2023 • edited

eira-fransham commented Apr 2, 2023

kripken commented Apr 2, 2023

danielzgtg commented Apr 2, 2023

titzer commented Apr 3, 2023

ghost commented Sep 8, 2016 •

edited by ghost

oridb commented Sep 8, 2016 •

edited

qwertie commented Sep 8, 2016 •

edited

kripken commented Sep 8, 2016 •

edited

ghost commented Sep 9, 2016 •

edited by ghost

comex commented Dec 7, 2016 •

edited

ghost commented Dec 14, 2016 •

edited by ghost

jfbastien commented Apr 13, 2017 •

edited

cheery commented Apr 13, 2017 •

edited

cheery commented Apr 14, 2017 •

edited

qwertie commented Apr 16, 2017 •

edited

aardappel commented Apr 8, 2021 •

edited

taralx commented May 16, 2022 •

edited

hoodmane commented May 16, 2022 •

edited

elfeiin commented Aug 28, 2022 •

edited

enderger commented Jan 18, 2023 •

edited