-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize program address translation #23482
Comments
I added a "Proposed Solution". @dmakarov @alessandrod Thoughts? |
I think something like this is going to work, but we should consider other options too. Today jitted programs are executed in the same process. In order to ensure that programs are logically separate we implement manual address translation, to ensure they don't exploit the JIT we have a number of mitigations, etc. All those things have a cost, and arguably don't offer the best kind of isolation. It's technically still possible for a program to crash or take control of the whole executing process. Have we considered running each jitted program in a separate, lightweight process sandbox? If we did that, we could configure the address space and then let a program run without manual memory translation. We could also relax some of the JIT spraying mitigations we do, which would decrease compile times and improve runtime performance. As an example when I was looking into speeding up constant blinding, I saw that firefox doesn't do any constant blinding anymore, and instead focuses on running an airgapped sandbox. Our programs are self hosted they don't link to anything, so memory usage would be about the same. We'd have to find a smart way to setup the sandbox processes so that they wouldn't use a lot of (non shared) memory and they'd be cheap to create. This is largely a solved problem when running under linux. We could have something like the zygote process in Android - a template process always in ready-state that is extremely cheap to clone and start. This seems to be the direction most VMs are going into these days. The JVM has some multi-tenant instructions that are mostly a relic of applet days. Taken to the extreme, AWS lambda does what I'm proposing but spawning a whole micro OS vm for each request. ChromeOS does something similar. Browsers run JITs in external process sandboxes too etc. We've also briefly discussed recently how we should implement some kind of strategy to run small programs using the interpreter and only JIT larger programs for which we know the compilation cost is worth it. We could easily fit sandboxing in that scheme too, where we don't sandbox interpreted programs but we switch to sandbox + JIT for only large or hot programs. |
I really like the lightweight process isolation that @alessandrod suggested. |
It would be good to measure and compare the overhead of the simple binary translation the JIT performs v. the interpreter. |
When I started working on RBPF, I had proposed having hardware accelerated address translation by making use of the MMU, either by process isolation or even a unikernel + hypervisor approach. Which I think is essentially what you suggested here. Back then we identified the following problems IIRC:
In other words I fear that going with the MMU based approach pushes us into a local minimum where we get some better address translation immediately with relatively little changes to the interfaces, but we block our path to a more parallel execution, supporting more execution bandwidth, in the future. |
No I wouldn't go as far as booting a full blown vm, I'm thinking locked down processes.
This is what firecracker does https://github.com/firecracker-microvm/firecracker. But again to be clear, I'm not suggesting we do this.
It is true that it eliminates the possibility to parallelize via ILP and SIMD, but it still allows parallelism by running parallel instances of the same program. I'd argue that given our current ISA and APIs, sandboxing and process parallelism is the simpler, more realistic path to better perf and increased security. And if we do decide to go crazy and crank up parallelism, I suspect we should look into SPIR-V or similar instead of trying to grow BPF into it.
I see what you mean, but I'm a bit skeptical of this argument. If we had a clear idea of how to unlock more parallel execution given the current ISA, APIs and programming model, I think it would be a valid argument. But since I think that doing anything that unlocks significantly higher parallelism will require changes to the programming model, the APIs and the whole stack anyway, I don't think that moving execution to separate processes today would hinder future progress in practice. |
Replying to my own comment because it's late and...
...oops sorry I misread your paragraph. Do still checkout firecracker because it's a cool way to do hypervisor + (os) vm 😊 As I said the problem of latency in spinning up the SBF vm can be solved in operating systems that implement a cheap way to clone an existing process. That means linux and probably the good BSDs (😜). I haven't done any serious windows programming in a long time so I don't know if it could be made to work on windows too, but then again, is anyone really going to run validators on windows? So many things already don't work on windows today. For development on windows and other places where creating a process is slow, we could either disable sandboxing or use process pools etc. |
Exactly, that is my argument. It is the easier solution short term, but might cause us significant code complexity to be maintained in the long term in case we switch to something else entirely.
Has been proposed here as well: #20323 I guess we need to prototype both, the process sandboxed VMs and the ILP / SIMD parallel VMs in order to compare their up and down sides, and see if the second is feasible at all. |
Why does it require changes to programming model? If you generated a vectorized version of the program, couldn't you emit a SIMD version if it that the runtime can then call with multiple versions of the program in parallel. |
You could, but it would be sub-optimal. Or in other words there is a lot of potential in fitting the ISA to the problem. This way the need for static analysis or de-compilation and re-compilation can be reduced.
And for the programming model as in the way the developer uses the software stack, that is still completely up in the clouds / remains to be seen. |
Problem
[Alexander Meißner]
Proposed Solution
u64
host base pointeru32
range lengthbool
is range writableu64
guest base pointeru32
range lengthu32
sub range offsetu32
sub range lengthThe text was updated successfully, but these errors were encountered: