Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCC 10.2.1 Results #6

Open
victorstewart opened this issue Oct 18, 2020 · 6 comments
Open

GCC 10.2.1 Results #6

victorstewart opened this issue Oct 18, 2020 · 6 comments

Comments

@victorstewart
Copy link

gcc version 10.2.1 20201007 releases/gcc-10.2.0-350-g136256c32d (Clear Linux OS for Intel Architecture)

./FastMemcpy
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=48ms memcpy=35 ms
result(dst aligned, src unalign): memcpy_fast=49ms memcpy=33 ms
result(dst unalign, src aligned): memcpy_fast=49ms memcpy=34 ms
result(dst unalign, src unalign): memcpy_fast=49ms memcpy=34 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=54ms memcpy=34 ms
result(dst aligned, src unalign): memcpy_fast=54ms memcpy=34 ms
result(dst unalign, src aligned): memcpy_fast=54ms memcpy=34 ms
result(dst unalign, src unalign): memcpy_fast=54ms memcpy=34 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=85ms memcpy=56 ms
result(dst aligned, src unalign): memcpy_fast=91ms memcpy=52 ms
result(dst unalign, src aligned): memcpy_fast=93ms memcpy=56 ms
result(dst unalign, src unalign): memcpy_fast=94ms memcpy=51 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=85ms memcpy=41 ms
result(dst aligned, src unalign): memcpy_fast=91ms memcpy=43 ms
result(dst unalign, src aligned): memcpy_fast=91ms memcpy=44 ms
result(dst unalign, src unalign): memcpy_fast=90ms memcpy=44 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=40ms memcpy=20 ms
result(dst aligned, src unalign): memcpy_fast=44ms memcpy=20 ms
result(dst unalign, src aligned): memcpy_fast=44ms memcpy=21 ms
result(dst unalign, src unalign): memcpy_fast=44ms memcpy=20 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=40ms memcpy=23 ms
result(dst aligned, src unalign): memcpy_fast=43ms memcpy=23 ms
result(dst unalign, src aligned): memcpy_fast=43ms memcpy=33 ms
result(dst unalign, src unalign): memcpy_fast=43ms memcpy=34 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=54ms memcpy=43 ms
result(dst aligned, src unalign): memcpy_fast=55ms memcpy=44 ms
result(dst unalign, src aligned): memcpy_fast=55ms memcpy=47 ms
result(dst unalign, src unalign): memcpy_fast=55ms memcpy=48 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=88ms memcpy=70 ms
result(dst aligned, src unalign): memcpy_fast=88ms memcpy=78 ms
result(dst unalign, src aligned): memcpy_fast=89ms memcpy=74 ms
result(dst unalign, src unalign): memcpy_fast=91ms memcpy=75 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=96ms memcpy=90 ms
result(dst aligned, src unalign): memcpy_fast=94ms memcpy=91 ms
result(dst unalign, src aligned): memcpy_fast=95ms memcpy=91 ms
result(dst unalign, src unalign): memcpy_fast=95ms memcpy=92 ms

benchmark random access:
memcpy_fast=802ms memcpy=662ms

./FastMemcpy_Avx
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=49ms memcpy=29 ms
result(dst aligned, src unalign): memcpy_fast=49ms memcpy=29 ms
result(dst unalign, src aligned): memcpy_fast=49ms memcpy=30 ms
result(dst unalign, src unalign): memcpy_fast=49ms memcpy=29 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=49ms memcpy=29 ms
result(dst aligned, src unalign): memcpy_fast=49ms memcpy=29 ms
result(dst unalign, src aligned): memcpy_fast=49ms memcpy=30 ms
result(dst unalign, src unalign): memcpy_fast=49ms memcpy=29 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=64ms memcpy=56 ms
result(dst aligned, src unalign): memcpy_fast=64ms memcpy=51 ms
result(dst unalign, src aligned): memcpy_fast=66ms memcpy=56 ms
result(dst unalign, src unalign): memcpy_fast=66ms memcpy=52 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=43ms memcpy=41 ms
result(dst aligned, src unalign): memcpy_fast=44ms memcpy=43 ms
result(dst unalign, src aligned): memcpy_fast=44ms memcpy=44 ms
result(dst unalign, src unalign): memcpy_fast=44ms memcpy=44 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=20ms memcpy=19 ms
result(dst aligned, src unalign): memcpy_fast=22ms memcpy=21 ms
result(dst unalign, src aligned): memcpy_fast=21ms memcpy=21 ms
result(dst unalign, src unalign): memcpy_fast=21ms memcpy=21 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=21ms memcpy=23 ms
result(dst aligned, src unalign): memcpy_fast=22ms memcpy=23 ms
result(dst unalign, src aligned): memcpy_fast=22ms memcpy=34 ms
result(dst unalign, src unalign): memcpy_fast=22ms memcpy=33 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=90ms memcpy=45 ms
result(dst aligned, src unalign): memcpy_fast=90ms memcpy=45 ms
result(dst unalign, src aligned): memcpy_fast=89ms memcpy=48 ms
result(dst unalign, src unalign): memcpy_fast=88ms memcpy=48 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=88ms memcpy=72 ms
result(dst aligned, src unalign): memcpy_fast=92ms memcpy=79 ms
result(dst unalign, src aligned): memcpy_fast=88ms memcpy=76 ms
result(dst unalign, src unalign): memcpy_fast=87ms memcpy=77 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=95ms memcpy=91 ms
result(dst aligned, src unalign): memcpy_fast=98ms memcpy=92 ms
result(dst unalign, src aligned): memcpy_fast=94ms memcpy=91 ms
result(dst unalign, src unalign): memcpy_fast=95ms memcpy=95 ms

benchmark random access:
memcpy_fast=796ms memcpy=687ms

@alexey-milovidov
Copy link

alexey-milovidov commented Nov 7, 2020

The benchmark is not quite correctly implemented for the following reasons:

  1. Compiler can easily do constant propagation of size parameter and then replace memcpy to builtin for small sizes.
    The benchmark function should be marked as noinline. Even more, "function cloning" optimization should be disabled.
  2. It's not enough to test with power of two sizes because "tails" processing is not taken into account.
  3. When you use the original memcpy, the code from glibc is used. It is compiled separately by OS maintainers and it does not depend on your compiler. But it depends on your machine (dynamic dispatch on supported instruction set is performed). And you did not provide the info on your machine. Actually it should be tested on a multitude of different CPUs.
  4. Testing in a loop with the same size is misrepresentative because branches will be predictable.

@alexey-milovidov
Copy link

Bottomline: the library is probably Ok but the benchmark is a nonsense.

@zhanglistar
Copy link

Bottomline: the library is probably Ok but the benchmark is a nonsense.

Just post the right benchmark code ?

@victorstewart
Copy link
Author

victorstewart commented Nov 16, 2020

another variable here is that it's a false assumption (at least one i had myself) that standard libraries aren't using vector instructions.

I read some of the libc source code, and they use handwritten AVX2 for memcpy, memcmp and a few others when the architecture supports it. And i tested this on a machine that maxed at AVX2 instructions. So that could easily explain these results.

(And they had comments in there that they don't implement AVX512 because they've experimented and determined that the frequency downgrade is detrimental to overall application performance.)

also even if the benchmark might not be ideal, it's still legitimate and shows that head to head performance it at least one subset of all possible implementations (whether it captures a realistic pattern or not idk?). but ya what @zhanglistar basically said, we'd all love to flip the tables on libc again!

@alexey-milovidov
Copy link

alexey-milovidov commented Nov 17, 2020

I have run ClickHouse performance test and can confirm that glibc's memcpy is better than FastMemcpy (at least on one machine):

https://clickhouse-test-reports.s3.yandex.net/17111/213266b80cbc1489b411929568bd9cc8c8173c8d/performance_comparison/report.html#fail1

Although the mean difference is very small: 0.5%.

Maximum speedup (that I'm confident) is about 16% on the following query:
SELECT count() FROM zeros(1000000) WHERE NOT ignore(materialize('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') AS s, concat(s,s,s,s,s,s,s,s,s,s) AS t, concat(t,t,t,t,t,t,t,t,t,t) AS u) SETTINGS max_block_size = 1000
that is very memcpy-heavy (see these concats).

We have to continue using custom memcpy instead of glibc's to maintain compatibility with old glibc.

@victorstewart
Copy link
Author

ice lake is bringing us these goodies! (less frequency downscaling).

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants