-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wtpoa-cns not using all requested threads #19
Comments
PS: I should add that the issue may depend on available RAM and other processes running on the same machine. On a just rebooted machine or on a machine with lots of free RAM, the issue may be alleviated. |
PSS: I have updated the precompiled binaries in the release page. |
Dear Heng,
If paralleling in contigs instead of edges within contig, the CPU usage will be expected as Another way might be seprating the 'edge consensus sequence' and 'merging edges'. For all contigs, we first build consensus edge sequences and write a |
On my machine, the problem is not caused by some threads running longer. Using tcmalloc wouldn't help in that case. I believe the slowdown is due to frequent malloc calls. I have seen similar behaviors a few times before. |
Hi Heng, You are right, the frequent malloc calls slowed down multi-threads. I have located the causal codes. Thanks very much! |
I asked wtpoa-cns to use 16 threads. However, in average, only it only uses 500% CPU on my machine. I changed the default memory allocator and it seems to improve the multi-thread performance.
I am using the E. coli example from PBcR:
http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz
The command lines I was using:
wtdbg2 -i ecoli.fa.gz -t 16 -fo test -L5000 -e2 wtpoa-cns -i test.ctg.lay -t 16 -fo test.ctg.fa
You can override the system allocator with
LD_PRELAD
:Here are some results:
You can see that the default glibc allocator (I am using CentOS 6) is quite bad, spending lots of system time on thread scheduling. tcmalloc is much better. You get almost a 4-fold speedup. jemalloc is good, too, but it takes too much extra memory.
Typically, you see the effect of memory allocators when you frequently malloc/free in each thread. Bwa suffers from this problem, too. I think there are two ways to fix this:
Use a custom memory allocator. tcmalloc has been quite good for the few examples I have tried. This solution doesn't require you to modify the C source code. However, it is a little difficult for general users to build performant binaries.
Reorganize malloc/free calls. You allocate a buffer before spawning the workers and try to avoid frequent malloc/free in each worker. Minimap2 takes this approach with a thread-local buffer. With this buffer disabled, minimap2 will become noticeably slower on many threads.
The text was updated successfully, but these errors were encountered: