Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samtools sort number of threads in reading phase #891

Open
bernt-matthias opened this issue Jul 11, 2018 · 6 comments
Open

samtools sort number of threads in reading phase #891

bernt-matthias opened this issue Jul 11, 2018 · 6 comments

Comments

@bernt-matthias
Copy link

bernt-matthias commented Jul 11, 2018

Is your feature request related to a problem? Please specify.

  1. When using mapper | samtools sort - it is difficult to specify the number of threads for the mapper and for samtools.
  2. Until all data is read entirely samtools seldomly uses the available CPUs efficiently (CPU usage is seldomly larger than 100%).

Describe the solution you would like.

I suggest to allow to specify the number of CPUs used by samtools during reading the data (and producing pre sorted chunks) separately. This would simplify the specification of the number of threads used by both programs. Until the mapper is finished samtools could for instance use a single thread for reading and chunking and then use the full number of threads afterwards (when the mapper has finished). Thereby

  • the CPU usage could be better limited (in shared environments you need to specify the number of cores and sometimes admins really check)
  • the currently suboptimal performance of samtools sort during reading would be nicely hidden.
  • I guess the single thread for the first phase could nicely fill the missing CPU utilization of the mapper.
@jkbonfield
Copy link
Contributor

Sort could certainly be more efficient. Ideally it would be using asynchronous I/O too.

However this particular problem is perhaps one of expectation. Over-specifying the number of threads is not a catastrophically bad thing to do, and you can use cgroups or hwloc-bind to govern how many cores the entire process can take up too.

Also I don't think it's true to say that samtools sort only uses more than one CPU until the mapper has finished. It uses one thread until it's read enough data and then it uses multiple threads to sort and write that temporary data to disk, repeatedly. On finishing (no more stdin) it then has a separate merge stage. If your mapper is the slow part, then yes samtools will likely be stuck at under 100% CPU, but that's not really a samtools issue I think.

Note there is more or less a way to handle what you want already (untested, but I think it's equivalent), eg:

mapper | samtools sort -l 0 -O bam -@2 | samtools view -O bam -@16 -o out.bam

The second merge stage only starts when the mapper has finished, and this will be I/O bound and won't be threading on output as there are no lengthy bgzf compression steps. The samtools view command will only start consuming cpu after the mapper has finished so both mapper and view can be given the same cores to work on.

Finally maybe you'll get more luck using mapper | mbuffer | samtools too with some systems and/or aligners. This can avoid issues with small pipe sizes.

@bernt-matthias
Copy link
Author

bernt-matthias commented Jul 11, 2018

Thanks for the info and suggestions.

On finishing (no more stdin) it then has a separate merge stage. If your mapper is the slow part, then yes samtools will likely be stuck at under 100% CPU, but that's not really a samtools issue I think.

Actually (in my case the mapper is hisat2) CPU usage is most of the time approx 100% and then spikes for a short time to approx. x*100%, where x ist the number of threads given to samtools. But this time is really short.

Note there is more or less a way to handle what you want already (untested, but I think it's equivalent), eg:

mapper | samtools sort -l 0 -O bam -@2 | samtools view -O bam -@16 -o out.bam

The second merge stage only starts when the mapper has finished, and this will be I/O bound and won't be threading on output as there are no lengthy bgzf compression steps. The samtools view command will only start consuming cpu after the mapper has finished so both mapper and view can be given the same cores to work on.

Sounds like a cool idea. The result should be equivalent.

Efficiency depends a bit on how sort merges the temporary files. If it is done in a tree like fashion, then it would start to write output on the top level of the merge tree. But if all temporary files are merged at once, then it would start writing output immediately (which would start view earlier). For the suggested solution the latter would be better -- I guess.

@jkbonfield
Copy link
Contributor

Sadly sort is pretty noddy. It simply reads until hitting the memory limit, sort, writes to temporary file, repeat. At the end it then opens ALL files and merges. This isn't particularly efficient and can cause major I/O bottlenecks and/or running out of file descriptors if you've set the memory limit too low.

It can perhaps be sped up by adjusting the block size to be larger than the file system hints at (fstat) via the --input-fmt-option block_size=10000000 option, for example. This would use more memory (but probably still less than you used for sorting), but will perhaps thrash the system less.

@kemin711
Copy link

I have computers with memory 500G, could it use more memory to speed it up? I was using pipe
bwa | samtools sort -t 6. I saw bwa finished, but sort is still working hard using only about 50% of 1 cpu. Maybe it is in the merging state which does not need more CPU.

@kemin711
Copy link

I saw that the sort algorithm has -m 10G option, I will explore using this one to speed up sorting

@alexjacobsCDS
Copy link

@kemin711 curious if you had luck with increasing the memory per thread to speed up sorting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants