Rewrite HTTP strings processing in assembler #1249

krizhanovsky · 2019-05-06T13:53:38Z

Rewrite HTTP strings processing in assembler:

remove tls/dummy_headers
replace dynaic table initialization with static one
shrink tables size
small assembly optimizations

Introduce test_debug_relax() to avoid debug message drop on unit tests with debug.

1. remove tls/dummy_headers 2. replace dynaic table initialization with static one 3. shrink tables size 4. small assembly optimizations Inroduce test_debug_relax() to avoid debug message drop on unit tests with debug.

Best number for VM on my i7-6500U host is 80ms.

is suggested by the Intel optimization manual as a good practice). Actually, GCC does exactly the same since with -mavx implying -msse2avx it generates code with VEX-prefixed instructions, AVX version of SSE instructions. So according to the optimization manual there is no AVX to SSE transition penalties. I checked str_avx2.S as well as lib/str_simd.S and none of them uses plain SSE instructions. Other kernel SIMD code works with AVX (e.g. crypto routines) and/or takes care about vzeroupper call on their own. Since SIMD isn't so widely used in kernel code, there is no sense to follow the recommended good practice. The best time for the str benchmark is 84ms, which is 5% worse than previously, so I'll remove vzeroupper with the next patch.

tempesta_fw/t/unit/test_tfw_str.c

i-rinat

Overall looks good. Here are some minor comments:

tempesta_fw/str.c

tempesta_fw/str.h

tempesta_fw/str_avx2.S

i-rinat · 2019-05-15T14:31:46Z

tempesta_fw/str_avx2.S

+	cmpq	$8, %rdx
+	ja	.str2low_test_len128
+	movq	.str2low_switch(,%rdx,8), %rax
+	jmpq	*%rax


These indirect jumps make objtool generate a warning: "warning: objtool: __tfw_strtolower_avx2()+0x46: indirect jump found in RETPOLINE build". To suppress it we need either annotate these jumps like that:

diff --git a/tempesta_fw/str_avx2.S b/tempesta_fw/str_avx2.S index 5f890fe6..df3e2e2f 100644 --- a/tempesta_fw/str_avx2.S +++ b/tempesta_fw/str_avx2.S @@ -24,6 +24,7 @@ #include <linux/linkage.h> #include <asm/alternative-asm.h> #include <asm/export.h> +#include <asm/nospec-branch.h> #define CASE 0x2020202020202020 @@ -476,6 +477,7 @@ ENTRY(__tfw_strtolower_avx2) cmpq $8, %rdx ja .str2low_test_len128 movq .str2low_switch(,%rdx,8), %rax + ANNOTATE_RETPOLINE_SAFE jmpq *%rax .section .rodata .align 8 @@ -653,6 +655,7 @@ ENTRY(__tfw_stricmp_avx2) /* Process short strings below 8 bytes in length. */ movq .stricmp_switch(,%rdx,8), %rax + ANNOTATE_RETPOLINE_SAFE jmpq *%rax .section .rodata .align 8 @@ -1009,6 +1012,7 @@ ENTRY(__tfw_stricmp_avx2_2lc) ja .sic2lc_short movq .sic2lc_switch(,%rdx,8), %rax + ANNOTATE_RETPOLINE_SAFE jmpq *%rax .section .rodata .align 8

or enable Spectre mitigation:

diff --git a/tempesta_fw/str_avx2.S b/tempesta_fw/str_avx2.S index 5f890fe6..d5d2334c 100644 --- a/tempesta_fw/str_avx2.S +++ b/tempesta_fw/str_avx2.S @@ -24,6 +24,7 @@ #include <linux/linkage.h> #include <asm/alternative-asm.h> #include <asm/export.h> +#include <asm/nospec-branch.h> #define CASE 0x2020202020202020 @@ -476,7 +477,7 @@ ENTRY(__tfw_strtolower_avx2) cmpq $8, %rdx ja .str2low_test_len128 movq .str2low_switch(,%rdx,8), %rax - jmpq *%rax + JMP_NOSPEC %rax .section .rodata .align 8 .str2low_switch: @@ -653,7 +654,7 @@ ENTRY(__tfw_stricmp_avx2) /* Process short strings below 8 bytes in length. */ movq .stricmp_switch(,%rdx,8), %rax - jmpq *%rax + JMP_NOSPEC %rax .section .rodata .align 8 .stricmp_switch: @@ -1009,7 +1010,7 @@ ENTRY(__tfw_stricmp_avx2_2lc) ja .sic2lc_short movq .sic2lc_switch(,%rdx,8), %rax - jmpq *%rax + JMP_NOSPEC %rax .section .rodata .align 8 .sic2lc_switch:

We check jump array bounds against a constant, encoded in an instruction opcode, so there is no cache miss on boundary check required for Spectre. JPM_NOSPEC gives about 15% performance impact in comparison with ANNOTATE_RETPOLINE_SAFE: 101ms vs 85ms in the tfw_str microbenchmark.

i-rinat · 2019-05-15T19:18:51Z

tempesta_fw/str_avx2.S

+	movq	%rbx, %r11
+	jmp	.str2low_test_len8
+
+.str2low_tolower8:


Use of volatile in the C source code resulted in obviously redundant memory stores and loads. As far as I understand, code below is equivalent to:

movq (%rsi,%r11), %rax movq %r9, %mm2 movq %r8, %mm1 movq %rax, %mm0 movq %rax, %r12 psubb %mm2, %mm0 pcmpgtb %mm0, %mm1 movq %mm1, %rax andq %rcx, %rax orq %r12, %rax movq %rax, (%rdi,%r11) movslq %ebx, %rax jmp .str2low_small_len

tempesta_fw/str_avx2.S

Use `req` notation for macro arguments. Use ANNOTATE_RETPOLINE_SAFE as we're save against Spectre attack. __tfw_strtolower_avx2() optimizations. Small cleanups.

krizhanovsky added 2 commits March 28, 2019 21:02

Move non-AVX2 code out of str_simd.c and make it simple.

4b1b16f

Rewrite HTTP strings processing in assembler:

1bf06fa

1. remove tls/dummy_headers 2. replace dynaic table initialization with static one 3. shrink tables size 4. small assembly optimizations Inroduce test_debug_relax() to avoid debug message drop on unit tests with debug.

krizhanovsky requested review from i-rinat and vankoven May 6, 2019 13:53

krizhanovsky assigned i-rinat and vankoven May 6, 2019

krizhanovsky added 3 commits May 12, 2019 21:20

Microbenchmark for AVX2 string algorithms.

fc12308

Best number for VM on my i7-6500U host is 80ms.

Remove testing and unnecessary vzeroupper instructions

4ddeced

vankoven reviewed May 14, 2019

View reviewed changes

tempesta_fw/t/unit/test_tfw_str.c Outdated Show resolved Hide resolved

tempesta_fw/t/unit/test_tfw_str.c Outdated Show resolved Hide resolved

krizhanovsky added 2 commits May 14, 2019 17:43

Small fixes and cleanups for ther AVX2 str microbenchmark

4b66421

Fix ifdefs

edb7a49

i-rinat approved these changes May 15, 2019

View reviewed changes

i-rinat reviewed May 15, 2019

View reviewed changes

tempesta_fw/str_avx2.S Show resolved Hide resolved

i-rinat reviewed May 15, 2019

View reviewed changes

tempesta_fw/str_avx2.S Show resolved Hide resolved

i-rinat reviewed May 15, 2019

View reviewed changes

tempesta_fw/str_avx2.S Show resolved Hide resolved

i-rinat reviewed May 16, 2019

View reviewed changes

tempesta_fw/str_avx2.S Outdated Show resolved Hide resolved

krizhanovsky added 2 commits May 16, 2019 22:54

Save & restore more register state on DPRN_V macro.

3a7b0f9

Use `req` notation for macro arguments. Use ANNOTATE_RETPOLINE_SAFE as we're save against Spectre attack. __tfw_strtolower_avx2() optimizations. Small cleanups.

Fix typo

41739da

krizhanovsky merged commit 0dc931b into master May 16, 2019

krizhanovsky deleted the ak-str-asm2 branch May 16, 2019 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite HTTP strings processing in assembler #1249

Rewrite HTTP strings processing in assembler #1249

krizhanovsky commented May 6, 2019

i-rinat left a comment

i-rinat May 15, 2019

krizhanovsky May 16, 2019

i-rinat May 15, 2019

Rewrite HTTP strings processing in assembler #1249

Rewrite HTTP strings processing in assembler #1249

Conversation

krizhanovsky commented May 6, 2019

i-rinat left a comment

Choose a reason for hiding this comment

i-rinat May 15, 2019

Choose a reason for hiding this comment

krizhanovsky May 16, 2019

Choose a reason for hiding this comment

i-rinat May 15, 2019

Choose a reason for hiding this comment