Нuge penalties for unaligned memory access like "* (int *) p" #695

sysprg · 2017-03-09T14:47:14Z

If somewhere in our code there is reading of data from memory by whole machine words (rather than by individual bytes), then we need to check that the pointer from which we read is always aligned to the machine word boundary. Because Intel x86 has monstrous penalties for access to unaligned data.

This is especially true in the case when a string is processed in a loop. Usually in such cases it is enough to process the first bytes of the string in a special way - up to reach the alignment boundary. You can make a dumb cycle by leading bytes, you can do even trickier. But there is almost always we have more fast solution in comparison with unaligned memory access.

As illustration of huge penalties, see the test program attached to this task (compile it, for example, via:

gcc -Wall -Wextra -Wstrict-aliasing=5 -O3 -fstrict-aliasing alignment.c alignment_f.c

It's enough to start it and then look at the "alignment_f.c" file, where there are four different memory read functions that do the same thing - they read 4 bytes from the address biased by one byte relative to the word boundary. But they do this with a radically different operating time.

The second file "alignment.c" is just a driver that generates random garbage (to confuse the optimizer in compiler or loader - to prevent false results). Then driver calls these functions f1()...f4() in an absolutely identical way.

alignment.zip

krizhanovsky · 2017-03-09T15:36:25Z

The unaligned matching is very frequently used in out HTTP parser: https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/http_parser.c#L344 . The parser is one of the most performance crucial piece of Tempesta FW's code. Meantime, the Julius test gives following numbers on my Skylake CPU:

    $ ./a.out 
    Unaligned access = 5.36163 (z: E94A350C)
    Aligned access = 1.70513 (z: E94A350C)
    Checked access = 2.01529 (z: E94A350C)
    Read four bytes = 2.10906 (z: E94A350C)

aleksostapenko · 2018-10-02T08:35:09Z

The crucial performance role of HTTP parser is confirmed by test from http://natsys-lab.blogspot.ru/2018/03/http-requests-proxying.html on bare metal servers. The perf top results are:
Before patch 5577d2a:

2.42%  [tempesta_fw]        [k] tfw_http_parse_resp
1.81%  [kernel]             [k] skb_release_data
1.26%  [kernel]             [k] tcp_ack
1.23%  [mlx4_en]            [k] mlx4_en_process_rx_cq
1.18%  [kernel]             [k] _raw_spin_lock
1.14%  [tempesta_fw]        [k] tfw_http_parse_req
1.11%  [kernel]             [k] __inet_lookup_established

After patch 5577d2a:

2.53%  [tempesta_fw]       [k] tfw_http_parse_resp
1.76%  [kernel]            [k] skb_release_data
1.30%  [mlx4_en]           [k] mlx4_en_process_rx_cq
1.28%  [kernel]            [k] tcp_ack
1.25%  [kernel]            [k] _raw_spin_lock
1.12%  [kernel]            [k] __inet_lookup_established
1.12%  [kernel]            [.] syscall_return_via_sysret

Hardware:

Tempesta FW (and nginx backend) server: Intex Xeon E3-1240v5 (4 cores, 8 ht), Mellanox MT26448 (10Gbps) (8 RX, 8TX queues), 32GB RAM;
Load generator: Intel Xeon E5-1650v3 (6 cores, 12 ht), Mellanox MT26448 (10Gbps) (8 RX, 8 TX queues), 64GB RAM.

Load generator command:

./wrk -c 4096 -t 8 -d 30 http://192.168.0.1:80/

Tempesta FW configuration:

listen 192.168.0.1:80;
cache 0;
srv_group base {
        server 127.0.0.1:9090 conns_n=128;
}
vhost vh_base {
        proxy_pass base;
}
http_chain {
        -> vh_base;
}

Backend nginx configuration:

worker_processes auto;
worker_cpu_affinity auto;
events {
       worker_connections 65536;
       use                epoll;
       multi_accept       on;
       accept_mutex       off;
}
worker_rlimit_nofile   1000000;
http {
       keepalive_timeout  600;
       keepalive_requests 10000000;
       sendfile           on;
       tcp_nopush         on;
       tcp_nodelay        on;
       open_file_cache    max=1000 inactive=3600s;
       open_file_cache_valid 3600s;
       open_file_cache_min_uses 2;
       open_file_cache_errors off;
       error_log /dev/null emerg;
       access_log         off;
       server {
              listen 9090 backlog=131072 deferred reuseport fastopen=4096;
              location / { root /var/www/html; }
     }
}

With index.html file of 3 bytes in size.

@sysprg

Original test is by @sysprg tempesta-tech/tempesta#695 This version of the test confuses BPU so that it doesn't guess alignemnt and the test becomes more realistic.

krizhanovsky · 2019-03-09T05:08:50Z

The original benchmark is invalid: all the functions use fixed offset of 1 byte, so it's quite trivial for the BPU to predict the branch and we see almost no overhead during the alignment checking. If try to confuse BPU with 57 (Intel optimization manual suggests not to use loops greater than 10 iterations for good prediction, so I used 57, not power of 2, iterations to confuse BPU) with offsets 1-3 bytes and the test shows less difference, but still significant:

$ ./int_align 
Unaligned access = 6.21043 (z: D2848518)
Checked access = 2.75292 (z: D2848518)
Checked access (simple) = 2.96297 (z: D2848518)
Read four bytes = 2.45077 (z: D2848518)

Note that the benchmark, for all the access patterns call functions residing in another compilation unit, so the compiler can not optimize out the access. However, if we try different access patterns (see P() function in https://github.com/tempesta-tech/blog/blob/master/http_benchmark/http_goto.c), then there is no reason to place the function outside, instead we should use it in the same compilation unit with the parser. I defined it as a function, not a macro, to let the compiler inline it or leave as a function. The results are opposite:

$ for i in `seq 1 5`; do taskset 0x2 ./http_benchmark ; done
	goto_request_line:	236ms
	goto_request_line:	234ms
	goto_request_line:	231ms
	goto_request_line:	231ms
	goto_request_line:	236ms

This used simplified checking for the alignment with only one branch, the original one behaves even worse.

Note that for the test I had to increase the loop counter and comment out other tests as well as to exit in the request line parsing function just after the method parsing, otherwise the difference is just unobservable. This can be done with UNALIGNED=1 make command to build the test. Unaligned, current access, shows better results:

$ for i in `seq 1 5`; do taskset 0x2 ./http_benchmark ; done
	goto_request_line:	215ms
	goto_request_line:	221ms
	goto_request_line:	219ms
	goto_request_line:	214ms
	goto_request_line:	214ms

Simple byte ORing shows somewhat better results, but still worse than the unaligned one:

$ for i in `seq 1 5`; do taskset 0x2 ./http_benchmark ; done
	goto_request_line:	216ms
	goto_request_line:	219ms
	goto_request_line:	217ms
	goto_request_line:	219ms
	goto_request_line:	218ms

The aligned version is always slower than the unaligned one. However, in doesn't always produce more instruction cache misses, branch mispredictions or any penalties caused by the branching. Binary code size is almost equal, but it's very differently organized. Performance sampling is also very different for both the versions. It seems GCC generates less efficient code in semse of more expensive instructions, less efficient jumps and so on.

All in all, while unaligned word access is really much slower than aligned, there is not much we can do in the real parse.

krizhanovsky · 2019-03-10T20:06:42Z

Interesting enough is that f5() in the benchmark was written inaccurately and with the following patch

 --- a/int_align/alignment_f.c
+++ b/int_align/alignment_f.c
@@ -41,7 +41,7 @@ f5(void *buffer)
 {
        unsigned char *b = (unsigned char *)buffer;
        if ((unsigned long)b & 3)
-               return *b + ((unsigned int)b[1] << 8)
+               return (unsigned int)*b | ((unsigned int)b[1] << 8)
                        | ((unsigned int)b[2] << 16)
                        | ((unsigned int)b[3] << 24);
        return *(unsigned int *)b;

compiler generates

0000000000000070 <f5>:
  70:   8b 07                   mov    (%rdi),%eax
  72:   c3                      retq
  73:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
  7a:   00 00 00 00 
  7e:   66 90                   xchg   %ax,%ax

which is essentially just the unalignment access.

Assembly version of the alignment access doesn't improve performance:

	unsigned int r;
	__asm__ __volatile__(
		"	test $0x3, %1\n"
		"	jne 1f\n"
		"	mov (%1), %0\n"
		"	jmp 2f\n"
		"1:	movzbl (%1), %0\n"
		"	movzbl 0x1(%1), %%r8d\n"
		"	movzbl 0x2(%1), %%ecx\n"
		"	movzbl 0x3(%1), %%edx\n"
		"	shl $8, %%r8d\n"
		"	shl $16, %%ecx\n"
		"	shl $24, %%edx\n"
		"	or %%r8d, %0\n"
		"	or %%ecx, %0\n"
		"	or %%edx, %0\n"
		"2:\n"
	   : "=A"(r)
	   : "D"(p)
	   : "cc", "r8", "ecx", "edx");
	return r;

sysprg added enhancement performance test labels Mar 9, 2017

sysprg added this to the 0.6 WebOS milestone Mar 9, 2017

krizhanovsky mentioned this issue Mar 9, 2017

First version of the HTTP/2 implementation #694

Closed

keshonok changed the title ~~Нuge penalties for unaligned memory acceess like "* (int *) p"~~ Нuge penalties for unaligned memory access like "* (int *) p" Mar 10, 2017

krizhanovsky modified the milestones: backlog, 0.7 HTTP/2 Jan 9, 2018

krizhanovsky self-assigned this Jul 1, 2018

krizhanovsky modified the milestones: 1.0 Beta, 0.7 HTTP/2 Jul 15, 2018

aleksostapenko mentioned this issue Oct 17, 2018

For Mellanox NIC the tfw_lib.sh script does not assign RX queues to CPUs #1073

Closed

krizhanovsky mentioned this issue Nov 27, 2018

Case insensitivity collateral damage in HTTP parser #1121

Closed

krizhanovsky mentioned this issue Dec 7, 2018

[Parser] Incorrect move in 'End of Term` state right after TRY_STR macro was used #1128

Closed

krizhanovsky modified the milestones: 0.8 HTTP/2 & TLS 1.3, 0.7 TLS v0.3 Feb 2, 2019

krizhanovsky added a commit to tempesta-tech/blog that referenced this issue Mar 9, 2019

Add alignment test.

7c4490f

Original test is by @sysprg tempesta-tech/tempesta#695 This version of the test confuses BPU so that it doesn't guess alignemnt and the test becomes more realistic.

krizhanovsky added the invalid label Mar 9, 2019

krizhanovsky closed this as completed Mar 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Нuge penalties for unaligned memory access like "* (int *) p" #695

Нuge penalties for unaligned memory access like "* (int *) p" #695

sysprg commented Mar 9, 2017

krizhanovsky commented Mar 9, 2017

aleksostapenko commented Oct 2, 2018 •

edited

Loading

krizhanovsky commented Mar 9, 2019 •

edited

Loading

krizhanovsky commented Mar 10, 2019 •

edited

Loading

Нuge penalties for unaligned memory access like "* (int *) p" #695

Нuge penalties for unaligned memory access like "* (int *) p" #695

Comments

sysprg commented Mar 9, 2017

krizhanovsky commented Mar 9, 2017

aleksostapenko commented Oct 2, 2018 • edited Loading

krizhanovsky commented Mar 9, 2019 • edited Loading

krizhanovsky commented Mar 10, 2019 • edited Loading

aleksostapenko commented Oct 2, 2018 •

edited

Loading

krizhanovsky commented Mar 9, 2019 •

edited

Loading

krizhanovsky commented Mar 10, 2019 •

edited

Loading