Skip to content

Excessive compile time for "simple" module #10576

Open
@alexcrichton

Description

@alexcrichton

OSS-Fuzz has had an open bug for this for quite some time but I'm only just now getting around to filing an issue about this. The gist of the fuzz bug is that Wasmtime times out in the compile test target where the only thing that target does is compile a module. A timeout here means that the 60s time limit is exceeded, and the time limit here is with sanitizers and parallelism disabled on OSS-Fuzz infrastructure. This can be roughly approximated by running locally in release mode with -Cparallel-compilation=n and multiplying the result by ~30.

The module in question is:

(module 
    (func)
    (func)
    ;; ... 119248 times ...
    (func)
)

aka this is just a giant module of a lot of empty functions. foo.wasm.gz is the compresed version of this module.

Locally I see:

$ time wasmtime compile -C parallel-compilation=n foo.wasm
wasmtime compile -C parallel-compilation=n foo.wasm  1.05s user 0.15s system 99% cpu 1.205 total

which, for being a bunch of empty functions, is quite a lot! There's a lot of functions in this module but 10 microseconds for an empty function feels a bit excessive regardless. While optimizing this probably won't help out too too much in the long term, it's perhaps worthwhile to still try to improve this if not just for oss-fuzz timeouts and fuzzing.

A profile of the compilation looks like this which notably spends the most amount of "self" time in memmove. There's also notably a fair amount of allocation traffic as well. I'm not sure how to improve the regalloc2 parts myself but for memmove that I do know how to improve.

Moving MachBuffer less

The basic problem I've seen is that we're pretty liberal in Cranelift about moving data structures by ownership between phases, notably the MachBuffer<T>. This type is very large (lots of SmallVec) and is created/moved quite a lot throughout a compilation. This I believe adds up to quite a lot of memmove costs.

The movements I've seen are:

In general I don't think rustc/LLVM are even capable of eliding most of these copies which means that for each function we're copying a this very large structure ~6 times (ish). Multiply that by ~100k and the size of the structure and that's a lot of memory moving around and can probably explain at least a good portion of the second of compile time for this module.

Ideally we would refactor cranelift to require much less movement of the MachBuffer type. In an ideal world we could even reuse MachBuffer structures between compiling functions too. In any case we can probably get a long-ish way restructuring things and ownership of the MachBuffer

Other compiler structures

I've seen other compiler structures in the profile be quite large, such as CompilerContext, which are moved around a lot. Ideally we could perhaps Box up some contexts and or make movements cheaper to work with.


I'm sure there's other parts of the profile to dig in to as well, but I wanted to at least file an issue in case anyone's interested in chipping away at some pieces here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cranelift:goal:compile-timeFocus area: how fast Cranelift can compile or how much memory it uses.fuzz-bugBugs found by a fuzzer

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions