Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Address Translation Performance without Breaking Changes #196

Merged
merged 9 commits into from
Jul 23, 2021

Conversation

Lichtso
Copy link

@Lichtso Lichtso commented Jul 12, 2021

  • Concatenates all read only sections (including the text section) into one
  • Requires exactly one heap section
  • This way there are always exactly 5 memory regions:
    • 0: NULL
    • 1: Program & Constants (read only)
    • 2: Stack
    • 3: Heap
    • 4: Input
  • Enforces the memory regions virtual address to be aligned.
  • Makes address translation constant time complexity by interpreting the upper half of a pointer as region index.
  • Replaces rust call in JIT by x86 native implementation.
  • Makes stack frame gaps in VM address space optional / configurable.

@Lichtso Lichtso added bug Something isn't working enhancement New feature or request labels Jul 12, 2021
@Lichtso Lichtso force-pushed the feature/address_translation_cleanup branch 2 times, most recently from 497c274 to 957e295 Compare July 14, 2021 19:25
@Lichtso
Copy link
Author

Lichtso commented Jul 15, 2021

This is the code which is now generated for (a store / write) address translation:

pushq  %r11
pushq  %rax
pushq  %rcx
    pushq  %rdx
movq   %r11, %rax
shrq   $0x20, %rax
cmpq   %rax, -0x4da80037(%r10)
jbe    0x100d018e5
shlq   $0x5, %rax
addq   -0x4da8003f(%r10), %rax
cmpl   $0x0, 0x19(%rax)
je     0x100d018e5
movabsq $0xffffffff, %rcx
andq   %rcx, %r11
    movzbl 0x18(%rax), %ecx
    movq   %r11, %rdx
    shrq   %cl, %rdx
    testq  $0x1, %rdx
    jne    0x100d018e5
    movq   $-0x1, %rdx
    shlq   %cl, %rdx
    movq   %rdx, %rcx
    notq   %rcx
    andq   %r11, %rcx
    andq   %rdx, %r11
    shrq   $0x1, %r11
    orq    %rcx, %r11
leaq   0x8(%r11), %rcx
cmpl   %ecx, 0x10(%rax)
jb     0x100d018e5
addq   (%rax), %r11
    popq   %rdx
popq   %rcx
popq   %rax
addq   $0x8, %rsp
retq   

Lines shifted to the right are optional and can be turned off by setting config.enable_stack_frame_gaps = false.

@jon-chuang
Copy link

jon-chuang commented Jul 17, 2021

rbpf systematic benchmarks

memcpy (builtins). size: 15 (8 ld/st pairs), iter: 1 << 16

addr trans enc reg sanitise imm gap shift instr meter time time per ld/st
old true true true true 13325998ns 12.708ns
new true true true true 4532151ns 4.322ns
new true true false true 3151466ns 3.005ns
new false false false true 2295472ns 2.189ns

ristretto. 1000 field mul

Reference native time: 20ns. Slowdown for best time: 28.2X.
insn count: 1455581
insn/ns for best time: 2.580

addr trans enc reg sanitise imm gap shift instr meter time time per op
old true true true true 2741706ns 2741.706ns
new true true true true 1081769ns 1081.769ns
new true true false true 971221ns 971.221ns
new false false true true 728362ns 728.362ns
new false false false true 564458ns 564.458ns
Analysis for Field Mul

Using the following two pieces of code to measure ld/st time without overhead:

    and r1, 64917
    mov r3, 4
    lsh r3, 32
    add r1, r3
    ldxdw r3, [r1]
    stxdw [r1], r1
    add r1, r3
    add r2, 1
    jlt r2, 100000, -9
    exit
    and r1, 64917
    mov r3, 4
    lsh r3, 32
    add r1, r3
    add r1, r3
    add r2, 1
    jlt r2, 100000, -7
    exit

we obtain:

Time per ld/st for old,true,true,true,true: 9.112ns
Time per ld/st for new,true,true,true: 2.538ns
Time per ld/st for new,false,false,false,true: 1.617ns
Total insn count: 1455581
Total ld/st pairs/op: 213.989

Time per ld/st for old,true,true,true,true: 9.112ns
Time taken for ld/st: 1.949ms
non ld/st time: 765.0ns
ld/st time / total time: 0.718

Time per ld/st for new,true,true,true,true: 2.538ns
Time taken for ld/st: 543.104ns
non ld/st time: 427.896ns
ld/st time / total time: 0.559

Time per ld/st for new,false,false,false,true: 1.617ns
Time taken for ld/st: 346.02ns
non ld/st time: 217.98ns
ld/st time / total time: 0.6134

ristretto. 1000 edwards add

Reference native time: 200ns. Slowdown for best time: 32x
insn count: 14507538
insn/ns for best time: 2.219

addr trans enc reg sanitise imm gap shift instr meter time time/op
old true true true true 30497511ns 30497.511ns
new true true false true 9260928ns 9260.928ns
new false false false true 6535862ns 6535.862ns

ristretto. 1000 edwards mul

Reference native time: 60us. Slowdown for best time: 28x
comments: increased call depth to 40.
insn count: 3367152664
insn/ns for best time: 1.991

addr trans enc reg sanitise imm gap shift instr meter time time/op
old true true true true 8132858551ns 8.132ms
new true true true true 2318464022ns 2.318ms
new false false false true 1690834376ns 1.691ms

My guess is that:

  1. wasm still does better for ld/st, which is a significant part of the workload. Probably can do 2x better.
  2. the other difference is mainly algorithmic, probably explaining about 5-10x. This means that a hand-optimised implementation could hit 5-6x of native.

If exposing u128 primitive type which can be leveraged by hand-optimised code, or exposed as a capability to the LLVM codegen, one can probably achieve much closer to native - about 2x with WASM-like ld/st overhead.

@Lichtso Lichtso force-pushed the feature/address_translation_cleanup branch from 5e99071 to a45e9d2 Compare July 23, 2021 08:16
@Lichtso Lichtso merged commit 7091677 into main Jul 23, 2021
@Lichtso Lichtso deleted the feature/address_translation_cleanup branch July 23, 2021 11:42
@Lichtso Lichtso changed the title Feature/address translation cleanup Improve Address Translation Performance without Breaking Changes Jul 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants