Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wtpoa-cns not using all requested threads #19

Closed
lh3 opened this issue Sep 24, 2018 · 5 comments
Closed

wtpoa-cns not using all requested threads #19

lh3 opened this issue Sep 24, 2018 · 5 comments

Comments

@lh3
Copy link
Collaborator

lh3 commented Sep 24, 2018

I asked wtpoa-cns to use 16 threads. However, in average, only it only uses 500% CPU on my machine. I changed the default memory allocator and it seems to improve the multi-thread performance.

I am using the E. coli example from PBcR:

http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz

The command lines I was using:

wtdbg2 -i ecoli.fa.gz -t 16 -fo test -L5000 -e2
wtpoa-cns -i test.ctg.lay -t 16 -fo test.ctg.fa

You can override the system allocator with LD_PRELAD:

LD_PRELOAD=libtcmalloc.so wtpoa-cns -i test.ctg.lay -t 16 -fo test.ctg.fa

Here are some results:

Library Real time (sec) User time Sys time Max RSS (kb)
glibc-2.12 285.901 848.230 575.720 1660412.0
jemalloc 75.703 814.820 41.580 3274516.0
tcmalloc 72.275 1023.740 26.120 1765996.0
lockless 100.658 953.020 102.220 4018172.0

You can see that the default glibc allocator (I am using CentOS 6) is quite bad, spending lots of system time on thread scheduling. tcmalloc is much better. You get almost a 4-fold speedup. jemalloc is good, too, but it takes too much extra memory.

Typically, you see the effect of memory allocators when you frequently malloc/free in each thread. Bwa suffers from this problem, too. I think there are two ways to fix this:

  1. Use a custom memory allocator. tcmalloc has been quite good for the few examples I have tried. This solution doesn't require you to modify the C source code. However, it is a little difficult for general users to build performant binaries.

  2. Reorganize malloc/free calls. You allocate a buffer before spawning the workers and try to avoid frequent malloc/free in each worker. Minimap2 takes this approach with a thread-local buffer. With this buffer disabled, minimap2 will become noticeably slower on many threads.

@lh3
Copy link
Collaborator Author

lh3 commented Sep 24, 2018

PS: I should add that the issue may depend on available RAM and other processes running on the same machine. On a just rebooted machine or on a machine with lots of free RAM, the issue may be alleviated.

@lh3
Copy link
Collaborator Author

lh3 commented Sep 24, 2018

PSS: I have updated the precompiled binaries in the release page. wtdbg2, wtdbg-cns and wtpoa-cns in the binary tar-ball are now statically linked to TCMalloc. They are faster on my machine. Not sure if you can see the difference on your side.

@ruanjue
Copy link
Owner

ruanjue commented Sep 25, 2018

Dear Heng,

wtpoa-cns and wtdbg-cns used the same schema in paralleling. They take one contig and break the task into mutiple parallel parts naturally according to edges in wtdbg, then merge the consensus edge sequences into the contig sequence.

If paralleling in contigs instead of edges within contig, the CPU usage will be expected as -t 16. But my concern was the various contig lengths might leave some threads run much much longer than others.

Another way might be seprating the 'edge consensus sequence' and 'merging edges'. For all contigs, we first build consensus edge sequences and write a prefix.ctg.lay.edges.fa file, it will take full CPU usage. Then, merging, it also will take full CPU usage. Let me try it.

@lh3
Copy link
Collaborator Author

lh3 commented Sep 25, 2018

On my machine, the problem is not caused by some threads running longer. Using tcmalloc wouldn't help in that case. I believe the slowdown is due to frequent malloc calls. I have seen similar behaviors a few times before.

@ruanjue
Copy link
Owner

ruanjue commented Oct 5, 2018

Hi Heng,

You are right, the frequent malloc calls slowed down multi-threads. I have located the causal codes.

Thanks very much!
Jue

@ruanjue ruanjue closed this as completed Oct 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants