-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Нuge penalties for unaligned memory access like "* (int *) p" #695
Comments
The unaligned matching is very frequently used in out HTTP parser: https://github.com/tempesta-tech/tempesta/blob/master/tempesta_fw/http_parser.c#L344 . The parser is one of the most performance crucial piece of Tempesta FW's code. Meantime, the Julius test gives following numbers on my Skylake CPU:
|
The crucial performance role of HTTP parser is confirmed by test from http://natsys-lab.blogspot.ru/2018/03/http-requests-proxying.html on bare metal servers. The
After patch 5577d2a:
Hardware:
Load generator command:
Backend
With |
Original test is by @sysprg tempesta-tech/tempesta#695 This version of the test confuses BPU so that it doesn't guess alignemnt and the test becomes more realistic.
The original benchmark is invalid: all the functions use fixed offset of 1 byte, so it's quite trivial for the BPU to predict the branch and we see almost no overhead during the alignment checking. If try to confuse BPU with 57 (Intel optimization manual suggests not to use loops greater than 10 iterations for good prediction, so I used 57, not power of 2, iterations to confuse BPU) with offsets 1-3 bytes and the test shows less difference, but still significant:
Note that the benchmark, for all the access patterns call functions residing in another compilation unit, so the compiler can not optimize out the access. However, if we try different access patterns (see
This used simplified checking for the alignment with only one branch, the original one behaves even worse. Note that for the test I had to increase the loop counter and comment out other tests as well as to exit in the request line parsing function just after the method parsing, otherwise the difference is just unobservable. This can be done with
Simple byte ORing shows somewhat better results, but still worse than the unaligned one:
The aligned version is always slower than the unaligned one. However, in doesn't always produce more instruction cache misses, branch mispredictions or any penalties caused by the branching. Binary code size is almost equal, but it's very differently organized. Performance sampling is also very different for both the versions. It seems GCC generates less efficient code in semse of more expensive instructions, less efficient jumps and so on. All in all, while unaligned word access is really much slower than aligned, there is not much we can do in the real parse. |
Interesting enough is that --- a/int_align/alignment_f.c
+++ b/int_align/alignment_f.c
@@ -41,7 +41,7 @@ f5(void *buffer)
{
unsigned char *b = (unsigned char *)buffer;
if ((unsigned long)b & 3)
- return *b + ((unsigned int)b[1] << 8)
+ return (unsigned int)*b | ((unsigned int)b[1] << 8)
| ((unsigned int)b[2] << 16)
| ((unsigned int)b[3] << 24);
return *(unsigned int *)b; compiler generates 0000000000000070 <f5>:
70: 8b 07 mov (%rdi),%eax
72: c3 retq
73: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
7a: 00 00 00 00
7e: 66 90 xchg %ax,%ax which is essentially just the unalignment access. Assembly version of the alignment access doesn't improve performance: unsigned int r;
__asm__ __volatile__(
" test $0x3, %1\n"
" jne 1f\n"
" mov (%1), %0\n"
" jmp 2f\n"
"1: movzbl (%1), %0\n"
" movzbl 0x1(%1), %%r8d\n"
" movzbl 0x2(%1), %%ecx\n"
" movzbl 0x3(%1), %%edx\n"
" shl $8, %%r8d\n"
" shl $16, %%ecx\n"
" shl $24, %%edx\n"
" or %%r8d, %0\n"
" or %%ecx, %0\n"
" or %%edx, %0\n"
"2:\n"
: "=A"(r)
: "D"(p)
: "cc", "r8", "ecx", "edx");
return r; |
If somewhere in our code there is reading of data from memory by whole machine words (rather than by individual bytes), then we need to check that the pointer from which we read is always aligned to the machine word boundary. Because Intel x86 has monstrous penalties for access to unaligned data.
This is especially true in the case when a string is processed in a loop. Usually in such cases it is enough to process the first bytes of the string in a special way - up to reach the alignment boundary. You can make a dumb cycle by leading bytes, you can do even trickier. But there is almost always we have more fast solution in comparison with unaligned memory access.
As illustration of huge penalties, see the test program attached to this task (compile it, for example, via:
gcc -Wall -Wextra -Wstrict-aliasing=5 -O3 -fstrict-aliasing alignment.c alignment_f.c
It's enough to start it and then look at the "alignment_f.c" file, where there are four different memory read functions that do the same thing - they read 4 bytes from the address biased by one byte relative to the word boundary. But they do this with a radically different operating time.
The second file "alignment.c" is just a driver that generates random garbage (to confuse the optimizer in compiler or loader - to prevent false results). Then driver calls these functions f1()...f4() in an absolutely identical way.
alignment.zip
The text was updated successfully, but these errors were encountered: