Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Нuge penalties for unaligned memory access like "* (int *) p" #695

Closed
sysprg opened this issue Mar 9, 2017 · 4 comments
Closed

Нuge penalties for unaligned memory access like "* (int *) p" #695

sysprg opened this issue Mar 9, 2017 · 4 comments

Comments

@sysprg
Copy link

sysprg commented Mar 9, 2017

If somewhere in our code there is reading of data from memory by whole machine words (rather than by individual bytes), then we need to check that the pointer from which we read is always aligned to the machine word boundary. Because Intel x86 has monstrous penalties for access to unaligned data.

This is especially true in the case when a string is processed in a loop. Usually in such cases it is enough to process the first bytes of the string in a special way - up to reach the alignment boundary. You can make a dumb cycle by leading bytes, you can do even trickier. But there is almost always we have more fast solution in comparison with unaligned memory access.

As illustration of huge penalties, see the test program attached to this task (compile it, for example, via:

gcc -Wall -Wextra -Wstrict-aliasing=5 -O3 -fstrict-aliasing alignment.c alignment_f.c

It's enough to start it and then look at the "alignment_f.c" file, where there are four different memory read functions that do the same thing - they read 4 bytes from the address biased by one byte relative to the word boundary. But they do this with a radically different operating time.

The second file "alignment.c" is just a driver that generates random garbage (to confuse the optimizer in compiler or loader - to prevent false results). Then driver calls these functions f1()...f4() in an absolutely identical way.

alignment.zip

@sysprg sysprg added this to the 0.6 WebOS milestone Mar 9, 2017
@krizhanovsky
Copy link
Contributor

The unaligned matching is very frequently used in out HTTP parser: https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/http_parser.c#L344 . The parser is one of the most performance crucial piece of Tempesta FW's code. Meantime, the Julius test gives following numbers on my Skylake CPU:

    $ ./a.out 
    Unaligned access = 5.36163 (z: E94A350C)
    Aligned access = 1.70513 (z: E94A350C)
    Checked access = 2.01529 (z: E94A350C)
    Read four bytes = 2.10906 (z: E94A350C)

@keshonok keshonok changed the title Нuge penalties for unaligned memory acceess like "* (int *) p" Нuge penalties for unaligned memory access like "* (int *) p" Mar 10, 2017
@krizhanovsky krizhanovsky modified the milestones: backlog, 0.7 HTTP/2 Jan 9, 2018
@krizhanovsky krizhanovsky self-assigned this Jul 1, 2018
@krizhanovsky krizhanovsky modified the milestones: 1.0 Beta, 0.7 HTTP/2 Jul 15, 2018
@aleksostapenko
Copy link
Contributor

aleksostapenko commented Oct 2, 2018

The crucial performance role of HTTP parser is confirmed by test from http://natsys-lab.blogspot.ru/2018/03/http-requests-proxying.html on bare metal servers. The perf top results are:
Before patch 5577d2a:

2.42%  [tempesta_fw]        [k] tfw_http_parse_resp
1.81%  [kernel]             [k] skb_release_data
1.26%  [kernel]             [k] tcp_ack
1.23%  [mlx4_en]            [k] mlx4_en_process_rx_cq
1.18%  [kernel]             [k] _raw_spin_lock
1.14%  [tempesta_fw]        [k] tfw_http_parse_req
1.11%  [kernel]             [k] __inet_lookup_established

After patch 5577d2a:

2.53%  [tempesta_fw]       [k] tfw_http_parse_resp
1.76%  [kernel]            [k] skb_release_data
1.30%  [mlx4_en]           [k] mlx4_en_process_rx_cq
1.28%  [kernel]            [k] tcp_ack
1.25%  [kernel]            [k] _raw_spin_lock
1.12%  [kernel]            [k] __inet_lookup_established
1.12%  [kernel]            [.] syscall_return_via_sysret

Hardware:

  1. Tempesta FW (and nginx backend) server: Intex Xeon E3-1240v5 (4 cores, 8 ht), Mellanox MT26448 (10Gbps) (8 RX, 8TX queues), 32GB RAM;
  2. Load generator: Intel Xeon E5-1650v3 (6 cores, 12 ht), Mellanox MT26448 (10Gbps) (8 RX, 8 TX queues), 64GB RAM.

Load generator command:

./wrk -c 4096 -t 8 -d 30 http://192.168.0.1:80/

Tempesta FW configuration:

listen 192.168.0.1:80;
cache 0;
srv_group base {
        server 127.0.0.1:9090 conns_n=128;
}
vhost vh_base {
        proxy_pass base;
}
http_chain {
        -> vh_base;
}

Backend nginx configuration:

worker_processes auto;
worker_cpu_affinity auto;
events {
       worker_connections 65536;
       use                epoll;
       multi_accept       on;
       accept_mutex       off;
}
worker_rlimit_nofile   1000000;
http {
       keepalive_timeout  600;
       keepalive_requests 10000000;
       sendfile           on;
       tcp_nopush         on;
       tcp_nodelay        on;
       open_file_cache    max=1000 inactive=3600s;
       open_file_cache_valid 3600s;
       open_file_cache_min_uses 2;
       open_file_cache_errors off;
       error_log /dev/null emerg;
       access_log         off;
       server {
              listen 9090 backlog=131072 deferred reuseport fastopen=4096;
              location / { root /var/www/html; }
     }
}

With index.html file of 3 bytes in size.

krizhanovsky added a commit to tempesta-tech/blog that referenced this issue Mar 9, 2019
Original test is by @sysprg
tempesta-tech/tempesta#695

This version of the test confuses BPU so that it doesn't guess alignemnt
and the test becomes more realistic.
@krizhanovsky
Copy link
Contributor

krizhanovsky commented Mar 9, 2019

The original benchmark is invalid: all the functions use fixed offset of 1 byte, so it's quite trivial for the BPU to predict the branch and we see almost no overhead during the alignment checking. If try to confuse BPU with 57 (Intel optimization manual suggests not to use loops greater than 10 iterations for good prediction, so I used 57, not power of 2, iterations to confuse BPU) with offsets 1-3 bytes and the test shows less difference, but still significant:

$ ./int_align 
Unaligned access = 6.21043 (z: D2848518)
Checked access = 2.75292 (z: D2848518)
Checked access (simple) = 2.96297 (z: D2848518)
Read four bytes = 2.45077 (z: D2848518)

Note that the benchmark, for all the access patterns call functions residing in another compilation unit, so the compiler can not optimize out the access. However, if we try different access patterns (see P() function in https://github.com/tempesta-tech/blog/blob/master/http_benchmark/http_goto.c), then there is no reason to place the function outside, instead we should use it in the same compilation unit with the parser. I defined it as a function, not a macro, to let the compiler inline it or leave as a function. The results are opposite:

$ for i in `seq 1 5`; do taskset 0x2 ./http_benchmark ; done
	goto_request_line:	236ms
	goto_request_line:	234ms
	goto_request_line:	231ms
	goto_request_line:	231ms
	goto_request_line:	236ms

This used simplified checking for the alignment with only one branch, the original one behaves even worse.

Note that for the test I had to increase the loop counter and comment out other tests as well as to exit in the request line parsing function just after the method parsing, otherwise the difference is just unobservable. This can be done with UNALIGNED=1 make command to build the test. Unaligned, current access, shows better results:

$ for i in `seq 1 5`; do taskset 0x2 ./http_benchmark ; done
	goto_request_line:	215ms
	goto_request_line:	221ms
	goto_request_line:	219ms
	goto_request_line:	214ms
	goto_request_line:	214ms

Simple byte ORing shows somewhat better results, but still worse than the unaligned one:

$ for i in `seq 1 5`; do taskset 0x2 ./http_benchmark ; done
	goto_request_line:	216ms
	goto_request_line:	219ms
	goto_request_line:	217ms
	goto_request_line:	219ms
	goto_request_line:	218ms

The aligned version is always slower than the unaligned one. However, in doesn't always produce more instruction cache misses, branch mispredictions or any penalties caused by the branching. Binary code size is almost equal, but it's very differently organized. Performance sampling is also very different for both the versions. It seems GCC generates less efficient code in semse of more expensive instructions, less efficient jumps and so on.

All in all, while unaligned word access is really much slower than aligned, there is not much we can do in the real parse.

@krizhanovsky
Copy link
Contributor

krizhanovsky commented Mar 10, 2019

Interesting enough is that f5() in the benchmark was written inaccurately and with the following patch

 --- a/int_align/alignment_f.c
+++ b/int_align/alignment_f.c
@@ -41,7 +41,7 @@ f5(void *buffer)
 {
        unsigned char *b = (unsigned char *)buffer;
        if ((unsigned long)b & 3)
-               return *b + ((unsigned int)b[1] << 8)
+               return (unsigned int)*b | ((unsigned int)b[1] << 8)
                        | ((unsigned int)b[2] << 16)
                        | ((unsigned int)b[3] << 24);
        return *(unsigned int *)b;

compiler generates

0000000000000070 <f5>:
  70:   8b 07                   mov    (%rdi),%eax
  72:   c3                      retq
  73:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
  7a:   00 00 00 00 
  7e:   66 90                   xchg   %ax,%ax

which is essentially just the unalignment access.

Assembly version of the alignment access doesn't improve performance:

	unsigned int r;
	__asm__ __volatile__(
		"	test $0x3, %1\n"
		"	jne 1f\n"
		"	mov (%1), %0\n"
		"	jmp 2f\n"
		"1:	movzbl (%1), %0\n"
		"	movzbl 0x1(%1), %%r8d\n"
		"	movzbl 0x2(%1), %%ecx\n"
		"	movzbl 0x3(%1), %%edx\n"
		"	shl $8, %%r8d\n"
		"	shl $16, %%ecx\n"
		"	shl $24, %%edx\n"
		"	or %%r8d, %0\n"
		"	or %%ecx, %0\n"
		"	or %%edx, %0\n"
		"2:\n"
	   : "=A"(r)
	   : "D"(p)
	   : "cc", "r8", "ecx", "edx");
	return r;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants