Improve Address Translation Performance without Breaking Changes #196

Lichtso · 2021-07-12T12:46:11Z

Concatenates all read only sections (including the text section) into one
Requires exactly one heap section
This way there are always exactly 5 memory regions:
- 0: NULL
- 1: Program & Constants (read only)
- 2: Stack
- 3: Heap
- 4: Input
Enforces the memory regions virtual address to be aligned.
Makes address translation constant time complexity by interpreting the upper half of a pointer as region index.
Replaces rust call in JIT by x86 native implementation.
Makes stack frame gaps in VM address space optional / configurable.

Lichtso · 2021-07-15T15:54:16Z

This is the code which is now generated for (a store / write) address translation:

pushq  %r11
pushq  %rax
pushq  %rcx
    pushq  %rdx
movq   %r11, %rax
shrq   $0x20, %rax
cmpq   %rax, -0x4da80037(%r10)
jbe    0x100d018e5
shlq   $0x5, %rax
addq   -0x4da8003f(%r10), %rax
cmpl   $0x0, 0x19(%rax)
je     0x100d018e5
movabsq $0xffffffff, %rcx
andq   %rcx, %r11
    movzbl 0x18(%rax), %ecx
    movq   %r11, %rdx
    shrq   %cl, %rdx
    testq  $0x1, %rdx
    jne    0x100d018e5
    movq   $-0x1, %rdx
    shlq   %cl, %rdx
    movq   %rdx, %rcx
    notq   %rcx
    andq   %r11, %rcx
    andq   %rdx, %r11
    shrq   $0x1, %r11
    orq    %rcx, %r11
leaq   0x8(%r11), %rcx
cmpl   %ecx, 0x10(%rax)
jb     0x100d018e5
addq   (%rax), %r11
    popq   %rdx
popq   %rcx
popq   %rax
addq   $0x8, %rsp
retq

Lines shifted to the right are optional and can be turned off by setting config.enable_stack_frame_gaps = false.

jon-chuang · 2021-07-17T09:19:10Z

rbpf systematic benchmarks

memcpy (builtins). size: 15 (8 ld/st pairs), iter: 1 << 16

addr trans	enc reg	sanitise imm	gap shift	instr meter	time	time per ld/st
old	true	true	true	true	13325998ns	12.708ns
new	true	true	true	true	4532151ns	4.322ns
new	true	true	false	true	3151466ns	3.005ns
new	false	false	false	true	2295472ns	2.189ns

ristretto. 1000 field mul

Reference native time: 20ns. Slowdown for best time: 28.2X.
insn count: 1455581
insn/ns for best time: 2.580

addr trans	enc reg	sanitise imm	gap shift	instr meter	time	time per op
old	true	true	true	true	2741706ns	2741.706ns
new	true	true	true	true	1081769ns	1081.769ns
new	true	true	false	true	971221ns	971.221ns
new	false	false	true	true	728362ns	728.362ns
new	false	false	false	true	564458ns	564.458ns

Analysis for Field Mul

Using the following two pieces of code to measure ld/st time without overhead:

    and r1, 64917
    mov r3, 4
    lsh r3, 32
    add r1, r3
    ldxdw r3, [r1]
    stxdw [r1], r1
    add r1, r3
    add r2, 1
    jlt r2, 100000, -9
    exit

    and r1, 64917
    mov r3, 4
    lsh r3, 32
    add r1, r3
    add r1, r3
    add r2, 1
    jlt r2, 100000, -7
    exit

we obtain:

Time per ld/st for old,true,true,true,true: 9.112ns
Time per ld/st for new,true,true,true: 2.538ns
Time per ld/st for new,false,false,false,true: 1.617ns

Total insn count: 1455581
Total ld/st pairs/op: 213.989

Time per ld/st for old,true,true,true,true: 9.112ns
Time taken for ld/st: 1.949ms
non ld/st time: 765.0ns
ld/st time / total time: 0.718

Time per ld/st for new,true,true,true,true: 2.538ns
Time taken for ld/st: 543.104ns
non ld/st time: 427.896ns
ld/st time / total time: 0.559

Time per ld/st for new,false,false,false,true: 1.617ns
Time taken for ld/st: 346.02ns
non ld/st time: 217.98ns
ld/st time / total time: 0.6134

ristretto. 1000 edwards add

Reference native time: 200ns. Slowdown for best time: 32x
insn count: 14507538
insn/ns for best time: 2.219

addr trans	enc reg	sanitise imm	gap shift	instr meter	time	time/op
old	true	true	true	true	30497511ns	30497.511ns
new	true	true	false	true	9260928ns	9260.928ns
new	false	false	false	true	6535862ns	6535.862ns

ristretto. 1000 edwards mul

Reference native time: 60us. Slowdown for best time: 28x
comments: increased call depth to 40.
insn count: 3367152664
insn/ns for best time: 1.991

addr trans	enc reg	sanitise imm	gap shift	instr meter	time	time/op
old	true	true	true	true	8132858551ns	8.132ms
new	true	true	true	true	2318464022ns	2.318ms
new	false	false	false	true	1690834376ns	1.691ms

My guess is that:

wasm still does better for ld/st, which is a significant part of the workload. Probably can do 2x better.
the other difference is mainly algorithmic, probably explaining about 5-10x. This means that a hand-optimised implementation could hit 5-6x of native.

If exposing u128 primitive type which can be leveraged by hand-optimised code, or exposed as a capability to the LLVM codegen, one can probably achieve much closer to native - about 2x with WASM-like ld/st overhead.

Lichtso added bug Something isn't working enhancement New feature or request labels Jul 12, 2021

Lichtso force-pushed the feature/address_translation_cleanup branch 2 times, most recently from 497c274 to 957e295 Compare July 14, 2021 19:25

Lichtso added 9 commits July 23, 2021 10:04

Concatenates all read only sections including the text section into one.

4c20c27

Enforces exactly one heap memory region per VM.

c0a032d

Renames frames => stack.

2f4c9ac

Enforces memory regions virtual addresses to be aligned.

689834f

Makes address translation constant time complexity.

5aa8d28

Adds an explicit empty NULL region to avoid index shift.

98a3352

Implements JIT address translation in x86 directly (no more Rust calls).

7b9a247

Fixes benchmarks

c3158ba

Makes stack frame gaps in vm address space optional / configurable.

a45e9d2

Lichtso force-pushed the feature/address_translation_cleanup branch from 5e99071 to a45e9d2 Compare July 23, 2021 08:16

Lichtso merged commit 7091677 into main Jul 23, 2021

Lichtso deleted the feature/address_translation_cleanup branch July 23, 2021 11:42

Lichtso changed the title ~~Feature/address translation cleanup~~ Improve Address Translation Performance without Breaking Changes Jul 23, 2021

Lichtso mentioned this pull request Jul 23, 2021

Const integer bounds approach to bounds checking #194

Closed

Lichtso mentioned this pull request Sep 30, 2021

ABI v2 Proposal solana-labs/solana#19191

Closed

jon-chuang mentioned this pull request Oct 5, 2021

Support for BLS12 based signature verification solana-labs/solana#20241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Address Translation Performance without Breaking Changes #196

Improve Address Translation Performance without Breaking Changes #196

Lichtso commented Jul 12, 2021 •

edited

Loading

Lichtso commented Jul 15, 2021

jon-chuang commented Jul 17, 2021 •

edited

Loading

Improve Address Translation Performance without Breaking Changes #196

Improve Address Translation Performance without Breaking Changes #196

Conversation

Lichtso commented Jul 12, 2021 • edited Loading

Lichtso commented Jul 15, 2021

jon-chuang commented Jul 17, 2021 • edited Loading

rbpf systematic benchmarks

Lichtso commented Jul 12, 2021 •

edited

Loading

jon-chuang commented Jul 17, 2021 •

edited

Loading