New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite HTTP strings processing in assembler #1249
Conversation
1. remove tls/dummy_headers 2. replace dynaic table initialization with static one 3. shrink tables size 4. small assembly optimizations Inroduce test_debug_relax() to avoid debug message drop on unit tests with debug.
Best number for VM on my i7-6500U host is 80ms.
is suggested by the Intel optimization manual as a good practice). Actually, GCC does exactly the same since with -mavx implying -msse2avx it generates code with VEX-prefixed instructions, AVX version of SSE instructions. So according to the optimization manual there is no AVX to SSE transition penalties. I checked str_avx2.S as well as lib/str_simd.S and none of them uses plain SSE instructions. Other kernel SIMD code works with AVX (e.g. crypto routines) and/or takes care about vzeroupper call on their own. Since SIMD isn't so widely used in kernel code, there is no sense to follow the recommended good practice. The best time for the str benchmark is 84ms, which is 5% worse than previously, so I'll remove vzeroupper with the next patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Here are some minor comments:
cmpq $8, %rdx | ||
ja .str2low_test_len128 | ||
movq .str2low_switch(,%rdx,8), %rax | ||
jmpq *%rax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These indirect jumps make objtool generate a warning: "warning: objtool: __tfw_strtolower_avx2()+0x46: indirect jump found in RETPOLINE build". To suppress it we need either annotate these jumps like that:
diff --git a/tempesta_fw/str_avx2.S b/tempesta_fw/str_avx2.S
index 5f890fe6..df3e2e2f 100644
--- a/tempesta_fw/str_avx2.S
+++ b/tempesta_fw/str_avx2.S
@@ -24,6 +24,7 @@
#include <linux/linkage.h>
#include <asm/alternative-asm.h>
#include <asm/export.h>
+#include <asm/nospec-branch.h>
#define CASE 0x2020202020202020
@@ -476,6 +477,7 @@ ENTRY(__tfw_strtolower_avx2)
cmpq $8, %rdx
ja .str2low_test_len128
movq .str2low_switch(,%rdx,8), %rax
+ ANNOTATE_RETPOLINE_SAFE
jmpq *%rax
.section .rodata
.align 8
@@ -653,6 +655,7 @@ ENTRY(__tfw_stricmp_avx2)
/* Process short strings below 8 bytes in length. */
movq .stricmp_switch(,%rdx,8), %rax
+ ANNOTATE_RETPOLINE_SAFE
jmpq *%rax
.section .rodata
.align 8
@@ -1009,6 +1012,7 @@ ENTRY(__tfw_stricmp_avx2_2lc)
ja .sic2lc_short
movq .sic2lc_switch(,%rdx,8), %rax
+ ANNOTATE_RETPOLINE_SAFE
jmpq *%rax
.section .rodata
.align 8
or enable Spectre mitigation:
diff --git a/tempesta_fw/str_avx2.S b/tempesta_fw/str_avx2.S
index 5f890fe6..d5d2334c 100644
--- a/tempesta_fw/str_avx2.S
+++ b/tempesta_fw/str_avx2.S
@@ -24,6 +24,7 @@
#include <linux/linkage.h>
#include <asm/alternative-asm.h>
#include <asm/export.h>
+#include <asm/nospec-branch.h>
#define CASE 0x2020202020202020
@@ -476,7 +477,7 @@ ENTRY(__tfw_strtolower_avx2)
cmpq $8, %rdx
ja .str2low_test_len128
movq .str2low_switch(,%rdx,8), %rax
- jmpq *%rax
+ JMP_NOSPEC %rax
.section .rodata
.align 8
.str2low_switch:
@@ -653,7 +654,7 @@ ENTRY(__tfw_stricmp_avx2)
/* Process short strings below 8 bytes in length. */
movq .stricmp_switch(,%rdx,8), %rax
- jmpq *%rax
+ JMP_NOSPEC %rax
.section .rodata
.align 8
.stricmp_switch:
@@ -1009,7 +1010,7 @@ ENTRY(__tfw_stricmp_avx2_2lc)
ja .sic2lc_short
movq .sic2lc_switch(,%rdx,8), %rax
- jmpq *%rax
+ JMP_NOSPEC %rax
.section .rodata
.align 8
.sic2lc_switch:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We check jump array bounds against a constant, encoded in an instruction opcode, so there is no cache miss on boundary check required for Spectre. JPM_NOSPEC
gives about 15% performance impact in comparison with ANNOTATE_RETPOLINE_SAFE
: 101ms vs 85ms in the tfw_str
microbenchmark.
movq %rbx, %r11 | ||
jmp .str2low_test_len8 | ||
|
||
.str2low_tolower8: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use of volatile
in the C source code resulted in obviously redundant memory stores and loads. As far as I understand, code below is equivalent to:
movq (%rsi,%r11), %rax
movq %r9, %mm2
movq %r8, %mm1
movq %rax, %mm0
movq %rax, %r12
psubb %mm2, %mm0
pcmpgtb %mm0, %mm1
movq %mm1, %rax
andq %rcx, %rax
orq %r12, %rax
movq %rax, (%rdi,%r11)
movslq %ebx, %rax
jmp .str2low_small_len
Use `req` notation for macro arguments. Use ANNOTATE_RETPOLINE_SAFE as we're save against Spectre attack. __tfw_strtolower_avx2() optimizations. Small cleanups.
Rewrite HTTP strings processing in assembler:
Introduce test_debug_relax() to avoid debug message drop on unit tests with debug.