Sam multi-threading #786

jkbonfield · 2018-10-26T14:21:26Z

This adds multi-threaded reading and writing for the SAM format.

SAM output has been tested with full and partial (region) based work flows. SAM input has been tested with full files, but it's hard to evaluate seek handling as this needs PR #718 merging in order to be able to do index and random access on SAMs.

Some metrics on ./test/test_view -@16 /tmp/10m.sam > /dev/null

Dev 11.41sec elapsed
This PR: 2.0sec elapsed (~18s cpu)

It still has bottlenecks somewhere, although maybe it is I/O. It doesn't gain much performance beyond 10 threads. Note this faster elapsed time than decoding from 10m.u.BAM and outputting to u.BAM (3.1sec elapsed, 10.3s cpu), possibly due to CRC32 calculations.

NB: this needs squashing, but rebase including resolving all the SAM speed changes has been done.

This function is now around double the speed of develop.

Benchmarks: Best of 3 runs, elapsed time measured by perf. File is first 3 million records from 9827_2#49.bam. Built using --with-libdeflate ===================== Read speed: (view -c) gcc8 clang50 icc Orig CRAM.u 2.72 3.14 3.19 Orig CRAM.ux 2.22 2.48 2.53 Orig CRAM 4.36 4.79 5.26 Orig BAM.u 0.37 0.38 0.44 Orig BAM 2.07 2.08 2.15 Orig SAM 2.54 2.63 3.08 Orig SAM.gz 4.59 4.55 5.31 New SAM 1.20 1.39 1.65 New SAM.gz 3.26 3.36 3.95 CRAM.u and BAM.u are uncompressed CRAM/BAM. CRAM.ux is uncompressed and referenceless. SAM.gz is bgzipped SAM. So reading speed is 87%(icc) to 112%(gcc) faster for uncompressed SAM, and up to 40% faster for compressed SAM. Compressed SAM is now 57(gcc)-84%(icc) slower than compressed BAM. ===================== Write speed (convert BAM.u to new format, so includes ~0.4 BAM decode time): gcc8 clang50 icc Orig CRAM.u 3.97 4.24 4.17 Orig CRAM 9.97 10.30 10.23 Orig BAM.u 0.63 0.65 0.79 Orig BAM 12.27 12.27 12.52 Orig SAM 2.49 2.62 3.35 Orig SAM.gz 14.82 14.79 14.71 New SAM 1.40 1.46 2.14 New SAM.gz 14.73 14.86 14.73 NB: Writing SAM.gz is samtools view | bgzip. Writing uncompressed SAM is 56(icc) to 79% faster. Writing compressed data is dominated by libdeflate speeds, and differences between new/orig are likely in the noise rather than meaningful. It's clear though that writing gzipped SAM (albeit via a pipe) is about 15% slower than BAM. Uncompressed BAM is still obviously the best format for piping, as expected. Not shown here, but level 1 BAM took about 8s to write and is 5% larger than the default BAM. (That doesn't sound like a bad tradeoff to make.) Level 9 BAM took 57s and was only 3% smaller than default. ===================== -rw-r--r-- 1 jkb team117 866008213 Jun 28 11:20 3m.u.bam -rw-r--r-- 1 jkb team117 350739851 Jun 28 11:20 3m.bam -rw-r--r-- 1 jkb team117 1121385629 Jun 28 11:20 3m.sam -rw-r--r-- 1 jkb team117 333746006 Jun 28 11:28 3m.sam.gz -rw-r--r-- 1 jkb team117 809884391 Jun 28 11:38 3m.ux.cram -rw-r--r-- 1 jkb team117 485074200 Jun 28 11:37 3m.u.cram -rw-r--r-- 1 jkb team117 173363126 Jun 28 11:36 3m.cram

Ditched the (non-working due to lack of m4_include) autoconf check for __builtin_clz as it doesn't work well anyway, given it's in a public header file.

Bug fix to kputw and negative values. It wasn't checking the size needed for '-' char. Silence valgrind warnings in SAM parser as it's minimal speed impact to do so. Dropped the now unneeded HAVE___BUILTIN_CLZ from Makefile. (The code still checks for it incase your compiler supports this and isn't gcc or clang, permitting users to manually set it).

Needs tidy up as it is using global memory and hard-coded on at the moment. No thread clean up code either.

Use a fixed sized buffer (checking to come!) instead of repeated hts_getline. Benchmarks main+10 thread: (NB faster than 10 thread on test.u.bam) real 0m0.978s user 0m8.137s sys 0m1.048s main+1 thread: real 0m5.918s user 0m6.964s sys 0m0.688s main only (unthreaded): real 0m6.291s user 0m5.600s sys 0m0.676s

@10

TODO: handle exiting early (test_view -N 100000 -@10 -B foo.sam)

This follows a similar strategy to the multi-threaded BAM writer with a thread dedicated to pulling things off the result queue and writing to the hfile. ABI BREAKING: The sam_write1 prototype has changed. The bam_hdr_t pointer is no longer const, so we can record the use of this header by incrementing the reference count. (It may be possible we can avoid this, but some tools destroy the header prior to closing the file, which causes crashes instead.)

jkbonfield added 15 commits October 26, 2018 15:11

Further speed ups to sam_parse1.

90be7bb

This function is now around double the speed of develop.

Fix configure script.

3082516

Ditched the (non-working due to lack of m4_include) autoconf check for __builtin_clz as it doesn't work well anyway, given it's in a public header file.

SAM decode multi-threading trial (HACK!)

cb40c5f

.

e13892e

Mostly working multi-threaded decoded.

fad1f0e

Needs tidy up as it is using global memory and hard-coded on at the moment. No thread clean up code either.

Fixes to MT sam decode

6a88df8

Tidyup

dc3d6a6

Removal of globals.

ba47df1

TODO: handle exiting early (test_view -N 100000 -@10 -B foo.sam)

Fixed early shutdown of MT SAM read.

c70883c

Trivial code tidyups

341f647

Bug fix to handle extra long SAM lines

7f85374

jkbonfield added the breaks-ABI label Oct 26, 2018

jkbonfield changed the title ~~Sam speed+thread~~ Sam multi-threading Nov 27, 2018

jkbonfield mentioned this pull request Aug 20, 2019

Multi-threaded SAM reading and writing. #916

Merged

jkbonfield closed this Aug 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sam multi-threading #786

Sam multi-threading #786

jkbonfield commented Oct 26, 2018 •

edited

Loading

Sam multi-threading #786

Sam multi-threading #786

Conversation

jkbonfield commented Oct 26, 2018 • edited Loading

jkbonfield commented Oct 26, 2018 •

edited

Loading