Optimize program address translation #23482

jackcmay · 2022-03-04T00:11:08Z

Problem

[Alexander Meißner]

I have been thinking about how we could optimize our address translation overhead in the VM. Currently, we are doing a full pointer translation and validation on every single memory access (load / store). Ideally we would like to reuse / share the immediate results from similar pointers. I have an idea, but it will probably require a different instruction encoding, or some heavy hinting which might bloat the executable file too much. So it might not be feasible in SBFv2 and be more like a SBFv3 (together with other ideas such as SIMD based parallel VMs and register reallocation for ARM etc).

Anyway, the approach would be to have the first compiler to emit some kind of type info, hints or use different register sets to distinguish normal integer values and pointer values in the bytecode. Then we can have a uint-to-ptr conversion instruction which does the address translation and all accesses to the same pointer can reuse that. One can extend this even further by using fat-pointers or ranges, which would allow the user to have an expensive range construction instruction (which takes a pointer, a length and can-write-flag) and cheap sub-range / offset and memory access instructions. This way loops and structure accesses could also profit by only translating a common range once and then accessing all elements of an array or struct.

What is also nice about this approach is that it does not need any static analysis at the validators and can thus push the computational burden from the compilers in every validator to the one compiler of the application developer. Furthermore, we might be able to have the normal Rust array bounds-check be covered in the range translation as well, instead of emitting extra instructions.

Proposed Solution

Add a register set for address ranges, each containing:
- u64 host base pointer
- u32 range length
- bool is range writable
Add an instruction to create new ranges (this performs the gust to host translation)
- Input: u64 guest base pointer
- Input: u32 range length
- Output: Address range register to overwrite
Add an instruction to create sub ranges
- Input: Address range register to copy from
- Input: u32 sub range offset
- Input: u32 sub range length
- Output: Address range register to overwrite
Modify all memory access instructions to take address range registers instead of pointers from data registers
Implement non uniform costs for different instruction types, making the range creation more expensive
Have the JIT perform the bounds checks of (sub) address ranges with constant offset and length at compile-time
Let the Rust frontend compiler use these address range instructions to omit explicit array bounds checks

The text was updated successfully, but these errors were encountered:

Lichtso · 2022-03-04T10:16:11Z

I added a "Proposed Solution".

@dmakarov @alessandrod Thoughts?

alessandrod · 2022-03-08T01:24:12Z

Proposed Solution

I think something like this is going to work, but we should consider other options too.

Today jitted programs are executed in the same process. In order to ensure that programs are logically separate we implement manual address translation, to ensure they don't exploit the JIT we have a number of mitigations, etc. All those things have a cost, and arguably don't offer the best kind of isolation. It's technically still possible for a program to crash or take control of the whole executing process.

Have we considered running each jitted program in a separate, lightweight process sandbox? If we did that, we could configure the address space and then let a program run without manual memory translation. We could also relax some of the JIT spraying mitigations we do, which would decrease compile times and improve runtime performance. As an example when I was looking into speeding up constant blinding, I saw that firefox doesn't do any constant blinding anymore, and instead focuses on running an airgapped sandbox.

Our programs are self hosted they don't link to anything, so memory usage would be about the same. We'd have to find a smart way to setup the sandbox processes so that they wouldn't use a lot of (non shared) memory and they'd be cheap to create. This is largely a solved problem when running under linux. We could have something like the zygote process in Android - a template process always in ready-state that is extremely cheap to clone and start.

This seems to be the direction most VMs are going into these days. The JVM has some multi-tenant instructions that are mostly a relic of applet days. Taken to the extreme, AWS lambda does what I'm proposing but spawning a whole micro OS vm for each request. ChromeOS does something similar. Browsers run JITs in external process sandboxes too etc.

We've also briefly discussed recently how we should implement some kind of strategy to run small programs using the interpreter and only JIT larger programs for which we know the compilation cost is worth it. We could easily fit sandboxing in that scheme too, where we don't sandbox interpreted programs but we switch to sandbox + JIT for only large or hot programs.

dmakarov · 2022-03-08T01:51:50Z

I really like the lightweight process isolation that @alessandrod suggested.

dmakarov · 2022-03-08T02:07:21Z

We've also briefly discussed recently how we should implement some kind of strategy to run small programs using the interpreter and only JIT larger programs for which we know the compilation cost is worth it. We could easily fit sandboxing in that scheme too, where we don't sandbox interpreted programs but we switch to sandbox + JIT for only large or hot programs.

It would be good to measure and compare the overhead of the simple binary translation the JIT performs v. the interpreter.

Lichtso · 2022-03-08T10:53:33Z

When I started working on RBPF, I had proposed having hardware accelerated address translation by making use of the MMU, either by process isolation or even a unikernel + hypervisor approach. Which I think is essentially what you suggested here.

Back then we identified the following problems IIRC:

Our VMs are extremely short lived, as our programs are so short, in the order of a single scheduler quantum of the outer operating system. So the cost of creating and tearing down our VM has to be far less than what other VMs can get away with. This simply needs to be prototyped and measured across multiple host operating systems.
The approach only works for completely isolating each VM, which in turn eliminates any hope of parallelizing the execution of multiple similar VMs (same program, different data) by ILP and SIMD. I think this can not be reconciled, either you have one virtual address space per single VM, or you can run multiple in one.

In other words I fear that going with the MMU based approach pushes us into a local minimum where we get some better address translation immediately with relatively little changes to the interfaces, but we block our path to a more parallel execution, supporting more execution bandwidth, in the future.

alessandrod · 2022-03-08T11:36:01Z

When I started working on RBPF, I had proposed having hardware accelerated address translation by making use of the MMU, either by process isolation or even a unikernel + hypervisor approach. Which I think is essentially what you suggested here.

No I wouldn't go as far as booting a full blown vm, I'm thinking locked down processes.

Back then we identified the following problems IIRC:

Our VMs are extremely short lived, as our programs are so short, in the order of a single scheduler quantum of the outer operating system. So the cost of creating and tearing down our VM has to be far less than what other VMs can get away with. This simply needs to be prototyped and measured across multiple host operating systems.

This is what firecracker does https://github.com/firecracker-microvm/firecracker. But again to be clear, I'm not suggesting we do this.

The approach only works for completely isolating each VM, which in turn eliminates any hope of parallelizing the execution of multiple similar VMs (same program, different data) by ILP and SIMD. I think this can not be reconciled, either you have one virtual address space per single VM, or you can run multiple in one.

It is true that it eliminates the possibility to parallelize via ILP and SIMD, but it still allows parallelism by running parallel instances of the same program. I'd argue that given our current ISA and APIs, sandboxing and process parallelism is the simpler, more realistic path to better perf and increased security.

And if we do decide to go crazy and crank up parallelism, I suspect we should look into SPIR-V or similar instead of trying to grow BPF into it.

In other words I fear that going with the MMU based approach pushes us into a local minimum where we get some better address translation immediately with relatively little changes to the interfaces, but we block our path to a more parallel execution, supporting more execution bandwidth, in the future.

I see what you mean, but I'm a bit skeptical of this argument. If we had a clear idea of how to unlock more parallel execution given the current ISA, APIs and programming model, I think it would be a valid argument. But since I think that doing anything that unlocks significantly higher parallelism will require changes to the programming model, the APIs and the whole stack anyway, I don't think that moving execution to separate processes today would hinder future progress in practice.

alessandrod · 2022-03-08T11:50:13Z

Replying to my own comment because it's late and...

Our VMs are extremely short lived, as our programs are so short, in the order of a single scheduler quantum of the outer operating system. So the cost of creating and tearing down our VM has to be far less than what other VMs can get away with. This simply needs to be prototyped and measured across multiple host operating systems.

This is what firecracker does https://github.com/firecracker-microvm/firecracker. But again to be clear, I'm not suggesting we do this.

...oops sorry I misread your paragraph. Do still checkout firecracker because it's a cool way to do hypervisor + (os) vm 😊

As I said the problem of latency in spinning up the SBF vm can be solved in operating systems that implement a cheap way to clone an existing process. That means linux and probably the good BSDs (😜). I haven't done any serious windows programming in a long time so I don't know if it could be made to work on windows too, but then again, is anyone really going to run validators on windows? So many things already don't work on windows today.

For development on windows and other places where creating a process is slow, we could either disable sandboxing or use process pools etc.

Lichtso · 2022-03-08T12:17:45Z

I'd argue that given our current ISA and APIs, sandboxing and process parallelism is the simpler, more realistic path to better perf and increased security.

Exactly, that is my argument. It is the easier solution short term, but might cause us significant code complexity to be maintained in the long term in case we switch to something else entirely.

I suspect we should look into SPIR-V or similar instead of trying to grow BPF into it.

Has been proposed here as well: #20323

I guess we need to prototype both, the process sandboxed VMs and the ILP / SIMD parallel VMs in order to compare their up and down sides, and see if the second is feasible at all.

sakridge · 2022-03-09T13:52:37Z

But since I think that doing anything that unlocks significantly higher parallelism will require changes to the programming model, the APIs and the whole stack anyway, I don't think that moving execution to separate processes today would hinder future progress in practice.

Why does it require changes to programming model? If you generated a vectorized version of the program, couldn't you emit a SIMD version if it that the runtime can then call with multiple versions of the program in parallel.

Lichtso · 2022-03-09T15:15:46Z

Why does it require changes to programming model? If you generated a vectorized version of the program, couldn't you emit a SIMD version if it that the runtime can then call with multiple versions of the program in parallel.

You could, but it would be sub-optimal. Or in other words there is a lot of potential in fitting the ISA to the problem. This way the need for static analysis or de-compilation and re-compilation can be reduced.

Using a stack machine or queue machine instead of a register machine based model
Having type info to distinguish between general data, pointers, boolean condition flags, etc.
Keeping the jump labels or extended basic blocks for sorting in thread pools
A simpler and more symmetric instruction encoding to reduce edge cases

And for the programming model as in the way the developer uses the software stack, that is still completely up in the clouds / remains to be seen.

pgarg66 added the runtime Issues related to runtime, BPF, and LLVM label Oct 26, 2022

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Oct 27, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize program address translation #23482

Optimize program address translation #23482

jackcmay commented Mar 4, 2022 •

edited by Lichtso

Loading

Lichtso commented Mar 4, 2022

alessandrod commented Mar 8, 2022

dmakarov commented Mar 8, 2022

dmakarov commented Mar 8, 2022

Lichtso commented Mar 8, 2022

alessandrod commented Mar 8, 2022

alessandrod commented Mar 8, 2022

Lichtso commented Mar 8, 2022

sakridge commented Mar 9, 2022 •

edited

Loading

Lichtso commented Mar 9, 2022

Optimize program address translation #23482

Optimize program address translation #23482

Comments

jackcmay commented Mar 4, 2022 • edited by Lichtso Loading

Problem

Proposed Solution

Lichtso commented Mar 4, 2022

alessandrod commented Mar 8, 2022

dmakarov commented Mar 8, 2022

dmakarov commented Mar 8, 2022

Lichtso commented Mar 8, 2022

alessandrod commented Mar 8, 2022

alessandrod commented Mar 8, 2022

Lichtso commented Mar 8, 2022

sakridge commented Mar 9, 2022 • edited Loading

Lichtso commented Mar 9, 2022

jackcmay commented Mar 4, 2022 •

edited by Lichtso

Loading

sakridge commented Mar 9, 2022 •

edited

Loading