[SOLVED] Compiled binaries executing significantly slower #36705

Closed
mmstick opened this Issue Sep 25, 2016 · 31 comments

Projects

None yet

10 participants

@mmstick
mmstick commented Sep 25, 2016

As I've developed this Parallel application, I've noticed that everyone who uses the application with an AMD processor always reports significantly slower runtime performance than those using Intel processors, as in the case of my personal benchmark listed within. The performance difference is so drastic that my AMD FX-8120 is magnitudes slower than my mobile Intel i5, to the point that the original Perl implementation is much faster than the Rust implementation on AMD hardware, and yet on Intel hardware it is 40x slower than the Rust implementation.

@Shnatsel

I can confirm this on rustc 1.11.0 (9b21dcd6a 2016-08-15)

Sysprof recording showing where the time is spent is attached at mmstick/parallel#22

@Shnatsel

Also can confirm this on rustc 1.13.0-nightly (4f9812a59 2016-09-21)

@mmstick
mmstick commented Sep 25, 2016

sysprofs-intel.zip Another sysprof recording but from my Intel laptop (ignore chrome streaming in the background). There's certainly something wonky happening with AMD hardware, looking at the provided sysprof from @Shnatsel.

screenshot from 2016-09-24 23-58-31

@gnzlbg
Contributor
gnzlbg commented Sep 25, 2016

I would need to know:

  • the results of perf in both systems, are the same functions hot in both systems?
  • perf gives you which part of the assembly in the hottest functions is "hot", gives us this info as well,
  • How are you compiling your application? Are you using target=native ? (It might well be that the assembly generated by LLVM's generic target benefits intel in your particular case).
@mmstick
mmstick commented Sep 25, 2016

Here's the result of perf record -F 99 with both the debug and release build on my Intel laptop. Command executed was seq 1 10000 | parallel echo > /dev/null.

@mmstick
mmstick commented Sep 25, 2016

Here's results from my AMD FX-8120 desktop:

Compiling with -C target_cpu=native makes no difference to performance.

@mmstick
mmstick commented Sep 25, 2016

Compilation is simply being done by cargo build --release

@bstrie bstrie added the I-slow label Sep 25, 2016
@mmstick
mmstick commented Sep 25, 2016

And these are measurements taken with using just four threads instead of eight for the FX-8120.

@cuviper
Contributor
cuviper commented Sep 26, 2016

You should also use perf archive to collect the objects used in your run, because perf.data alone doesn't contain any symbol information.

Anyway, even without symbols I can see that your AMD machine is mostly busy within just one particular function in the kernel:

Overhead  Command   Shared Object       Symbol                
  82.29%  parallel  [kernel.kallsyms]   [k] 0xffffffff812f8c37
   1.23%  parallel  [kernel.kallsyms]   [k] 0xffffffff812f8c3a
   0.58%  parallel  libpthread-2.24.so  [.] 0x0000000000009c0b
   0.41%  parallel  [kernel.kallsyms]   [k] 0xffffffff810fdf43
   0.41%  parallel  [kernel.kallsyms]   [k] 0xffffffff812f8adc
   0.33%  parallel  [kernel.kallsyms]   [k] 0xffffffff811e48bb
   0.29%  sh        [kernel.kallsyms]   [k] 0xffffffff810c3cba
   0.27%  parallel  [kernel.kallsyms]   [k] 0xffffffff8117a771
   0.26%  parallel  [kernel.kallsyms]   [k] 0xffffffff81301729
   0.26%  parallel  [kernel.kallsyms]   [k] 0xffffffff815d7337

Whereas the Intel machine is much more spread out:

Overhead  Command   Shared Object       Symbol                
  10.70%  parallel  [kernel.vmlinux]    [k] 0xffffffffaa0378e4
  10.24%  parallel  libpthread-2.22.so  [.] 0x0000000000009a47
   9.28%  sh        ld-2.22.so          [.] 0x0000000000017273
   5.82%  sh        [kernel.vmlinux]    [k] 0xffffffffaa49df6a
   5.24%  parallel  ld-2.22.so          [.] 0x0000000000015d68
   4.99%  parallel  [kernel.vmlinux]    [k] 0xffffffffaa105044
   4.93%  parallel  [kernel.vmlinux]    [k] 0xffffffffaa0b9057
   4.85%  parallel  [kernel.vmlinux]    [k] 0xffffffffaa0d760f
   4.68%  sh        libc-2.22.so        [.] 0x000000000007c55a
   4.59%  sh        [kernel.vmlinux]    [k] 0xffffffffaa0ae454

But even the full Intel report doesn't any samples in the parallel object itself, so it doesn't seem your benchmark is even stressing Rust generated code. With so much in the kernel and ld.so, I think you're mostly just hammering on process creation. If I understand right, you're launching 10,000 distinct echo processes, so that probably dwarfs the actual parallel work.

(And note that if you're doing this through sh -c, then it will actually get the shell's echo builtin, which would explain why the report only shows sh. Try with /bin/echo if you want that to be a truly distinct process too.)

Perhaps that's hitting a pathological case in the AMD case with something like page table setup, for instance. But we shouldn't just guess -- run perf report yourself to see what that 82% symbol is.

@mmstick
mmstick commented Sep 26, 2016

Does that mean that it's an issue with the Linux kernel being poorly optimized for AMD processors with that specific function?

@cuviper
Contributor
cuviper commented Sep 26, 2016

Possibly, but let the data guide you. See what that symbol is first. Then you can also hit a in perf report to annotate that function with the actual instructions that recorded samples.

Have you tried the same benchmark with GNU parallel? I'll bet it will face the same problem, and if not you can try to figure out what it's doing differently.

@mmstick
mmstick commented Sep 26, 2016

Using the same benchmark, GNU Parallel takes 30s compared to the 36s spent with the Rust implementation on my AMD FX-8120 using four cores. With my i5 2410M laptop, GNU Parallel takes 54 seconds to do what the Rust implementation does in 5 seconds.

@mmstick
mmstick commented Sep 26, 2016

Not exactly sure what I'm looking at, but I have a new perf data with the archive data alongside it.
amd_perf_data.zip

@cuviper
Contributor
cuviper commented Sep 26, 2016

Hmm, that new profile doesn't have the same 82% cluster. Is it still slow compared to Intel?

Overhead  Command   Shared Object       Symbol                            
  14.20%  parallel  libpthread-2.24.so  [.] pthread_mutex_lock            
  10.69%  sh        libc-2.24.so        [.] __strchr_sse2                 
   5.04%  sh        libc-2.24.so        [.] strlen                        
   4.61%  parallel  libpthread-2.24.so  [.] __pthread_mutex_unlock_usercnt
   4.13%  parallel  [kernel.vmlinux]    [k] entry_SYSCALL_64              
   3.60%  sh        ld-2.24.so          [.] strcmp                        
   3.49%  sh        ld-2.24.so          [.] do_lookup_x                   
   3.47%  parallel  [kernel.vmlinux]    [k] page_fault                    
   3.13%  parallel  parallel            [.] sdallocx                      
   2.74%  sh        libc-2.24.so        [.] malloc                        

I can also see your kernel updated, from 4.7.3-2-ARCH before to 4.7.5-1-ARCH now. (Hit i in perf report.) If there was a kernel problem, maybe they've fixed it...

@mmstick
mmstick commented Sep 26, 2016 edited

Indeed, it's still slow compared to Intel. This is an eight core FX @ 4GHz that's taking 36s to do what an i5 2410M @ 2.3 GHz does in 5s, and manages to run slower than GNU Parallel which does the same job in 30s. The wall time should be 3s to be comparable to the performance improvement that Intel sees (from 54s with GNU to 5s with Rust).

@Shnatsel

Sadly I have no clue how to use perf (links to howtos would be appreciated), but sysprof shows a 70% cluster in copy_page in the kernel:

screenshot at 2016-09-27 00 28 29

Here's the sysprof capture so you can inspect it yourself

I am on Linux 4.4.0-38-generic #57-Ubuntu

@Shnatsel

For whatever reason perf archive command seems to be missing on Ubuntu despite being listed in perf --help; I get perf: 'archive' is not a perf-command. See 'perf --help'. error.

However, I could get perf report to work, and it points at copy_page as well: 81,58% parallel [kernel.kallsyms] [k] copy_page and everything else is under 1%

screenshot at 2016-09-27 00 53 32

@cuviper
Contributor
cuviper commented Sep 26, 2016

Your sysprof shows that copy_page is occurring under do_huge_pmd_page, which is related to the kernel's transparent huge pages. You might cat /sys/kernel/mm/transparent_hugepage/enabled and compare that between your Intel and AMD machines. On Fedora 24 I see always [madvise] never, meaning it's currently only using huge pages if advised to. Try never and see what happens.

# echo never >/sys/kernel/mm/transparent_hugepage/enabled

(run that as root)

If that helps you, it doesn't necessarily mean that you need to leave that disabled all the time. Spawning so many processes in such a short time is a pretty unrealistic workload. Do some performance testing with more typical use cases before deciding how to tune your system.

We're pretty far afield from anything Rust can do about this, unless you find evidence of something else that Rust may be doing unoptimally. This doesn't look like a codegen issue to me.

@Shnatsel

[always] madvise never for me by default.

# echo never >/sys/kernel/mm/transparent_hugepage/enabled helped, it is fast even on AMD now. Thanks!

@mcpherrinm
Contributor

To confirm that the CPU is a red herring: If you echo always >/sys/kernel/mm/transparent_hugepage/enabled on an Intel system, do you see an equivalent slow down?

@Shnatsel

Tried it in "madvise" mode and it's still fast, so Rust is probably doing everything right here, i.e. not advising the kernel to use huge pages.

@Aatch
Contributor
Aatch commented Sep 26, 2016

Seems that the issue was unrelated to rustc. Closing.

@Aatch Aatch closed this Sep 26, 2016
@ssokolow

I started looking into it and apparently this isn't the first case of a jemalloc-using application interacting poorly with transparent huge pages. Here's one from a little over a year ago:

https://www.digitalocean.com/company/blog/transparent-huge-pages-and-alternative-memory-allocators/

(In the linked case, it was leak-like behaviour because THP was forcing 2MiB pages on jemalloc and it was assuming that applying MADV_DONTNEED to ranges smaller than 2MiB was enough for them to be returned to the OS.)

@mmstick
mmstick commented Sep 27, 2016

I compiled my kernel with support for transparent_hugepages and tried out always on my Intel laptop. Results were that my Intel processor became just as slow, so it's not specific to AMD. It would also appear that Solus OS already knows about issues with this being set to always, as Solus OS has been using madvise by default for a while now.

@mmstick mmstick changed the title from Compiled binaries executing significantly slower on AMD hardware to [SOLVED] Compiled binaries executing significantly slower Sep 27, 2016
@mmstick
mmstick commented Sep 27, 2016

@Shnatsel You can use perf by running sudo perf record -F 99 <COMMAND> followed by sudo perf archive. You'll get a perf.data file from record and symbols from archive. Then you can use perf script to view it.

@Shnatsel
Shnatsel commented Oct 4, 2016

I can confirm that this is a bug with jemalloc.

Forcing the use of system default allocator as described in https://doc.rust-lang.org/nightly/book/custom-allocators.html gives good fork performance even with transparent huge pages enabled.

Too bad it is an unstable language feature. Is there any way to get jemalloc off my lawn without using unstable language features, like a compiler switch?

@alexcrichton
Member

cc @sfackler, another data point for jemalloc...

@gnzlbg
Contributor
gnzlbg commented Oct 4, 2016

@Shnatsel is there a tracking issue in jemallocs repo for this?

@mmstick
mmstick commented Oct 4, 2016

I think you should report a bug against your Linux distribution that it should be using madvise, not always.

@Shnatsel
Shnatsel commented Oct 4, 2016

@gnzlbg I have not opened one. I have no clue where the jemalloc bug tracker is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment