GCC 10.2.1 Results #6

victorstewart · 2020-10-18T09:30:05Z

gcc version 10.2.1 20201007 releases/gcc-10.2.0-350-g136256c32d (Clear Linux OS for Intel Architecture)

./FastMemcpy
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=48ms memcpy=35 ms
result(dst aligned, src unalign): memcpy_fast=49ms memcpy=33 ms
result(dst unalign, src aligned): memcpy_fast=49ms memcpy=34 ms
result(dst unalign, src unalign): memcpy_fast=49ms memcpy=34 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=54ms memcpy=34 ms
result(dst aligned, src unalign): memcpy_fast=54ms memcpy=34 ms
result(dst unalign, src aligned): memcpy_fast=54ms memcpy=34 ms
result(dst unalign, src unalign): memcpy_fast=54ms memcpy=34 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=85ms memcpy=56 ms
result(dst aligned, src unalign): memcpy_fast=91ms memcpy=52 ms
result(dst unalign, src aligned): memcpy_fast=93ms memcpy=56 ms
result(dst unalign, src unalign): memcpy_fast=94ms memcpy=51 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=85ms memcpy=41 ms
result(dst aligned, src unalign): memcpy_fast=91ms memcpy=43 ms
result(dst unalign, src aligned): memcpy_fast=91ms memcpy=44 ms
result(dst unalign, src unalign): memcpy_fast=90ms memcpy=44 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=40ms memcpy=20 ms
result(dst aligned, src unalign): memcpy_fast=44ms memcpy=20 ms
result(dst unalign, src aligned): memcpy_fast=44ms memcpy=21 ms
result(dst unalign, src unalign): memcpy_fast=44ms memcpy=20 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=40ms memcpy=23 ms
result(dst aligned, src unalign): memcpy_fast=43ms memcpy=23 ms
result(dst unalign, src aligned): memcpy_fast=43ms memcpy=33 ms
result(dst unalign, src unalign): memcpy_fast=43ms memcpy=34 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=54ms memcpy=43 ms
result(dst aligned, src unalign): memcpy_fast=55ms memcpy=44 ms
result(dst unalign, src aligned): memcpy_fast=55ms memcpy=47 ms
result(dst unalign, src unalign): memcpy_fast=55ms memcpy=48 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=88ms memcpy=70 ms
result(dst aligned, src unalign): memcpy_fast=88ms memcpy=78 ms
result(dst unalign, src aligned): memcpy_fast=89ms memcpy=74 ms
result(dst unalign, src unalign): memcpy_fast=91ms memcpy=75 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=96ms memcpy=90 ms
result(dst aligned, src unalign): memcpy_fast=94ms memcpy=91 ms
result(dst unalign, src aligned): memcpy_fast=95ms memcpy=91 ms
result(dst unalign, src unalign): memcpy_fast=95ms memcpy=92 ms

benchmark random access:
memcpy_fast=802ms memcpy=662ms

./FastMemcpy_Avx
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=49ms memcpy=29 ms
result(dst aligned, src unalign): memcpy_fast=49ms memcpy=29 ms
result(dst unalign, src aligned): memcpy_fast=49ms memcpy=30 ms
result(dst unalign, src unalign): memcpy_fast=49ms memcpy=29 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=49ms memcpy=29 ms
result(dst aligned, src unalign): memcpy_fast=49ms memcpy=29 ms
result(dst unalign, src aligned): memcpy_fast=49ms memcpy=30 ms
result(dst unalign, src unalign): memcpy_fast=49ms memcpy=29 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=64ms memcpy=56 ms
result(dst aligned, src unalign): memcpy_fast=64ms memcpy=51 ms
result(dst unalign, src aligned): memcpy_fast=66ms memcpy=56 ms
result(dst unalign, src unalign): memcpy_fast=66ms memcpy=52 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=43ms memcpy=41 ms
result(dst aligned, src unalign): memcpy_fast=44ms memcpy=43 ms
result(dst unalign, src aligned): memcpy_fast=44ms memcpy=44 ms
result(dst unalign, src unalign): memcpy_fast=44ms memcpy=44 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=20ms memcpy=19 ms
result(dst aligned, src unalign): memcpy_fast=22ms memcpy=21 ms
result(dst unalign, src aligned): memcpy_fast=21ms memcpy=21 ms
result(dst unalign, src unalign): memcpy_fast=21ms memcpy=21 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=21ms memcpy=23 ms
result(dst aligned, src unalign): memcpy_fast=22ms memcpy=23 ms
result(dst unalign, src aligned): memcpy_fast=22ms memcpy=34 ms
result(dst unalign, src unalign): memcpy_fast=22ms memcpy=33 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=90ms memcpy=45 ms
result(dst aligned, src unalign): memcpy_fast=90ms memcpy=45 ms
result(dst unalign, src aligned): memcpy_fast=89ms memcpy=48 ms
result(dst unalign, src unalign): memcpy_fast=88ms memcpy=48 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=88ms memcpy=72 ms
result(dst aligned, src unalign): memcpy_fast=92ms memcpy=79 ms
result(dst unalign, src aligned): memcpy_fast=88ms memcpy=76 ms
result(dst unalign, src unalign): memcpy_fast=87ms memcpy=77 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=95ms memcpy=91 ms
result(dst aligned, src unalign): memcpy_fast=98ms memcpy=92 ms
result(dst unalign, src aligned): memcpy_fast=94ms memcpy=91 ms
result(dst unalign, src unalign): memcpy_fast=95ms memcpy=95 ms

benchmark random access:
memcpy_fast=796ms memcpy=687ms

The text was updated successfully, but these errors were encountered:

alexey-milovidov · 2020-11-07T22:06:56Z

The benchmark is not quite correctly implemented for the following reasons:

Compiler can easily do constant propagation of size parameter and then replace memcpy to builtin for small sizes.
The benchmark function should be marked as noinline. Even more, "function cloning" optimization should be disabled.
It's not enough to test with power of two sizes because "tails" processing is not taken into account.
When you use the original memcpy, the code from glibc is used. It is compiled separately by OS maintainers and it does not depend on your compiler. But it depends on your machine (dynamic dispatch on supported instruction set is performed). And you did not provide the info on your machine. Actually it should be tested on a multitude of different CPUs.
Testing in a loop with the same size is misrepresentative because branches will be predictable.

alexey-milovidov · 2020-11-07T22:08:19Z

Bottomline: the library is probably Ok but the benchmark is a nonsense.

zhanglistar · 2020-11-16T07:09:30Z

Bottomline: the library is probably Ok but the benchmark is a nonsense.

Just post the right benchmark code ?

victorstewart · 2020-11-16T16:05:03Z

another variable here is that it's a false assumption (at least one i had myself) that standard libraries aren't using vector instructions.

I read some of the libc source code, and they use handwritten AVX2 for memcpy, memcmp and a few others when the architecture supports it. And i tested this on a machine that maxed at AVX2 instructions. So that could easily explain these results.

(And they had comments in there that they don't implement AVX512 because they've experimented and determined that the frequency downgrade is detrimental to overall application performance.)

also even if the benchmark might not be ideal, it's still legitimate and shows that head to head performance it at least one subset of all possible implementations (whether it captures a realistic pattern or not idk?). but ya what @zhanglistar basically said, we'd all love to flip the tables on libc again!

alexey-milovidov · 2020-11-17T20:07:34Z

I have run ClickHouse performance test and can confirm that glibc's memcpy is better than FastMemcpy (at least on one machine):

https://clickhouse-test-reports.s3.yandex.net/17111/213266b80cbc1489b411929568bd9cc8c8173c8d/performance_comparison/report.html#fail1

Although the mean difference is very small: 0.5%.

Maximum speedup (that I'm confident) is about 16% on the following query:
SELECT count() FROM zeros(1000000) WHERE NOT ignore(materialize('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') AS s, concat(s,s,s,s,s,s,s,s,s,s) AS t, concat(t,t,t,t,t,t,t,t,t,t) AS u) SETTINGS max_block_size = 1000
that is very memcpy-heavy (see these concats).

We have to continue using custom memcpy instead of glibc's to maintain compatibility with old glibc.

victorstewart · 2020-11-19T11:56:06Z

ice lake is bringing us these goodies! (less frequency downscaling).

This was referenced Nov 7, 2020

gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) results #7

Closed

Slower on later GCC #4

Open

alexey-milovidov mentioned this issue Nov 16, 2020

Experiment with disabling fast memcpy ClickHouse/ClickHouse#17111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCC 10.2.1 Results #6

GCC 10.2.1 Results #6

victorstewart commented Oct 18, 2020

alexey-milovidov commented Nov 7, 2020 •

edited

Loading

alexey-milovidov commented Nov 7, 2020

zhanglistar commented Nov 16, 2020

victorstewart commented Nov 16, 2020 •

edited

Loading

alexey-milovidov commented Nov 17, 2020 •

edited

Loading

victorstewart commented Nov 19, 2020

GCC 10.2.1 Results #6

GCC 10.2.1 Results #6

Comments

victorstewart commented Oct 18, 2020

alexey-milovidov commented Nov 7, 2020 • edited Loading

alexey-milovidov commented Nov 7, 2020

zhanglistar commented Nov 16, 2020

victorstewart commented Nov 16, 2020 • edited Loading

alexey-milovidov commented Nov 17, 2020 • edited Loading

victorstewart commented Nov 19, 2020

alexey-milovidov commented Nov 7, 2020 •

edited

Loading

victorstewart commented Nov 16, 2020 •

edited

Loading

alexey-milovidov commented Nov 17, 2020 •

edited

Loading