Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wazeroir: avoid allocations per Operation #1202

Closed
mathetake opened this issue Mar 6, 2023 · 9 comments · Fixed by #1342
Closed

wazeroir: avoid allocations per Operation #1202

mathetake opened this issue Mar 6, 2023 · 9 comments · Fixed by #1342
Assignees
Labels
help wanted Extra attention is needed tech debt

Comments

@mathetake
Copy link
Member

mathetake commented Mar 6, 2023

Currently, we result in allocating an wazeroir.Operation object each time without caching or whatever. This is one of the bottlenecks during compilation (!= execution or instantiation) besides #1060:

image

memory profiling result on the compilation of GOOS=js binary

The root cause of this is that we emit a concrete type for Operation interface which is a translation for Wasm instruction.
For example, each global.get 1 allocates wazeroir.OperationGlobalGet{index: 1}:

case wasm.OpcodeGlobalSet:
c.emit(
OperationGlobalSet{Index: index},
)

And we store them in the slice of Operation interfaces

// Operations holds wazeroir operations compiled from Wasm instructions in a Wasm function.
Operations []Operation

One possible solution to this is to make it streaming given that our current compiler/interpreter does stream compilation. Instead of translating the Wasm instructions all at once, just make wazeroir compiler stateful and use the fixed-length bytes array (the length could be the maximum size of concrete operation struct type which is statically known) to store the IR info from compilers/interpreters.

There might be an easier and better way to optimize this, so I'm open to discussion!

@mathetake mathetake added help wanted Extra attention is needed tech debt labels Mar 6, 2023
@mathetake
Copy link
Member Author

mathetake commented Mar 6, 2023

another idea could be to serialize Operation as bytes and store it in Operations []byte, and decodes them via unsafe.Pointer and reflect.SliceHeader by compiler/interpreter.

@achille-roussel
Copy link
Collaborator

Have you considered changing Operation from an interface to a type with a structure like this?

type Operation struct {
  Kind OperationKind
  // all possible field types
  // ...
}

Go doesn't have union so you can either use a representation that is very memory greedy and declare each parameters as struct fields (this is the approach used in https://pkg.go.dev/go.jpap.org/zydis#DecodedInstruction), for example:

type Operation struct {
  Kind OperationKind
  Label struct { // fields from OperationLabel
    Label Label
  }
  Br struct { // fields from OperationBr
    Target Label
  }
}

To reduce the memory footprint you can emulate the union by having a single struct field using enough memory to hold the largest argument set. I think this is somewhat similar to what you're suggesting with Operations []byte. You don't necessarily have to use unsafe tho, encoding/decoding with binary.LittleEndian usually works well as a safer alternative.

@mathetake
Copy link
Member Author

actually the interpreter uses that quasi union type effectively https://github.com/tetratelabs/wazero/blob/main/internal/engine/interpreter/interpreter.go#L207-L209

so we could just lift it into wazeroir package and reuse it in compiler package as well!

@achille-roussel
Copy link
Collaborator

achille-roussel commented Mar 8, 2023

Ah nice! Yes that sounds like a good strategy.

There are two slices in that type that may cause a lot of heap allocations as well if the backing array end up escaping to the heap (tho it's much more nuanced than with interfaces).

Here are a few approaches I've used for this:

  • use output parameter instead of return values (e.g. func(op *Operation) instead of func() Operation), this lets the caller pass the same pre-allocated Operation value and the callee can reuse the backing arrays of slices via append. This works well in iterator or reader-like contexts
  • if you know the maximum possible length of the slices you can turn them into a fixed-length array + length field
  • a more advanced strategy can also be to create an array-like type which inlines memory allocation for the first N elements as well, if lists with many elements are rare you can amortize the cost of allocations quite effectively

@mathetake
Copy link
Member Author

@evacchi wanna take a shot? 😎

@evacchi
Copy link
Contributor

evacchi commented Mar 25, 2023

Alrighty 😎

@evacchi
Copy link
Contributor

evacchi commented Mar 27, 2023

so if I understand this correctly, this is kind of a large refactoring, so let me make sure I have understood correctly:

  • we want to avoid allocating an Operation while compiling
  • in this context, "compilation" means turning a wasm native opcode to a wazero IR struct (i.e. an Operation type)
  • the interpreter turns each wazero IR struct into a quasi union (interpreterOp)
  • the compiler translates each wazero IR struct into the corresponding asm instructions

so the proposed solution is:

  • lift interpreterOp to package wazeroir (renamed)
  • avoid the intermediate translation from wasm opcodes to wazeroir.Operation
  • instead, translate wasm opcodes directly to interpreterOp
  • the interpreter works on interpreterOp (so as it does already)
  • the native compiler translates the quasi-union to native asm, instead of translating Operations to native asm

then, because interpreterOp is flattened, we can also preallocate and/or apply other optimizations.

Is this the idea?

In this case, I think a (hopefully practical) way to proceed is to do this in phases, i.e. move one opcode at a time and make sure all tests still pass.

Maybe I am overthinking it, let me know 😬

@achille-roussel
Copy link
Collaborator

@evacchi you gave a very good description of what was my understanding of the work to do 😊

@mathetake
Copy link
Member Author

Yeah, #1296 is a good starting point. In the subsequent PRs, you should be able to refactor the interpreter/compiler engine so that it will do streaming rather than bulk production of the union type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed tech debt
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants