Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header merge efficiency #419

Merged
merged 2 commits into from Aug 20, 2015
Merged

Conversation

smowton
Copy link
Contributor

@smowton smowton commented Jul 1, 2015

Hi all,

I was working with some BAM files that had a large number of contigs (they were aligned against parts of the NCBI NT database, which has thousands of short contigs). The input files had disjoint contigs, having been aligned to disjoint subsets of the database. This turned out to make 'samtools merge' prohibitively expensive, spending hours in trans_tbl_init.

This patch improves the situation by (a) switching from a quadratic search for @SQ lines in translate->text to a linear pass over it, (b) switching from using regexec et al to plain old string.h operations, making a pass over a single file much faster, and (c) allowing the hashtable used to build the global translation table to live between reading each input, avoid quadratic cost of repeatedly building and expanding the table when there are more than a few input files involved.

This is only lightly tested with the particular workload that was causing trouble, but if you think the use case (many small contigs) is interesting, this might be useful to incorporate.

Christopher Smowton added 2 commits July 1, 2015 16:41
…s and switch from expensive regexes to simple string.h operations.
@SamStudio8
Copy link
Contributor

Very useful! See also #337, #260 and #203.
Some of the fixes you present may have been replicated by PR #337 which was
recently merged (
https://github.com/legumeinfo/samtools/commit/f0c42fee1d73a12919f09b4eaf3b66b3e22b4a52)
in to develop.
On 1 Jul 2015 5:02 pm, "Chris Smowton" notifications@github.com wrote:

Hi all,

I was working with some BAM files that had a large number of contigs (they
were aligned against parts of the NCBI NT database, which has thousands of
short contigs). The input files had disjoint contigs, having been aligned
to disjoint subsets of the database. This turned out to make 'samtools
merge' prohibitively expensive, spending hours in trans_tbl_init.

This patch improves the situation by (a) switching from a quadratic search
for @sq lines in translate->text to a linear pass over it, (b) switching
from using regexec et al to plain old string.h operations, making a pass
over a single file much faster, and (c) allowing the hashtable used to
build the global translation table to live between reading each input,
avoid quadratic cost of repeatedly building and expanding the table when
there are more than a few input files involved.

This is only lightly tested with the particular workload that was causing
trouble, but if you think the use case (many small contigs) is interesting,

this might be useful to incorporate.

You can view, comment on, or merge this pull request online at:

#419
Commit Summary

  • Amend SAM header merge strategy to avoid two quadratic-cost
    operations and switch from expensive regexes to simple string.h operations.
  • Merge branch 'develop' of https://github.com/samtools/samtools into
    header_merge_efficiency

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#419.

@smowton
Copy link
Contributor Author

smowton commented Jul 1, 2015

There isn't duplication, actually-- that recent fix targets the prettify-header stage which comes after building the header in the first place. As I see it, the four PRs target similar problems but don't overlap:

#203 avoids quadratic-cost prettify-header stage
#260 switches from linear search for a contig to using a hash table, but still does linear search over the plaintext header
#337 further improves prettify-header by avoiding use of the libc regex functons
...and finally here we use a hash table for the plaintext header too, and persist the hashtable introduced in #260 across several files.

@SamStudio8
Copy link
Contributor

I didn't say there definitely was duplication, I just wanted to link in
other merge related issues!
On 1 Jul 2015 6:07 pm, "Chris Smowton" notifications@github.com wrote:

There isn't duplication, actually-- that recent fix targets the
prettify-header stage which comes after building the header in the first
place. As I see it, the four PRs target similar problems but don't overlap:

#203 #203 avoids
quadratic-cost prettify-header stage
#260 #260 switches from
linear search for a contig to using a hash table, but still does linear
search over the plaintext header
#337 #337 further improves
prettify-header by avoiding use of the libc regex functons
...and finally here we use a hash table for the plaintext header too, and
persist the hashtable introduced in #260
#260 across several files.


Reply to this email directly or view it on GitHub
#419 (comment).

@smowton
Copy link
Contributor Author

smowton commented Jul 1, 2015

Sure, no defensiveness intended -- just noting that they are distinct issues

daviesrob added a commit to daviesrob/samtools that referenced this pull request Aug 18, 2015
Also fixed tabs that came in with samtools#419.

Conflicts:
	bam_sort.c
@daviesrob daviesrob mentioned this pull request Aug 18, 2015
@jmarshall jmarshall merged commit a181403 into samtools:develop Aug 20, 2015
jmarshall added a commit that referenced this pull request Aug 20, 2015
Finish removal of test_pretty_header.

[NEWS]
* Merging files with millions of headers now completes in a reasonable
  amount of time (#337, #373, #419, #453; thanks to Nathan Weeks,
  Chris Smowton, Martin Pollard, Rob Davies)
@jmarshall
Copy link
Member

Thanks for the patch and for pushing us to look at this again. Rob extended this in #453 and the regexps are now gone from bam_sort.c… hopefully we've all finally squashed the algorithmic badness in here!

charles-plessy added a commit to Debian/samtools that referenced this pull request Apr 26, 2016
Samtools release 1.3: many improvements, fixes, new commands

* The obsolete "samtools sort in.bam out.prefix" usage has been removed.
  If you are still using -f, -o, or out.prefix, convert to use -T PREFIX
  and/or -o FILE instead.  (samtools#295, samtools#349, samtools#356, samtools#418, PR samtools#441; see also
  discussions in samtools#171, samtools#213.)

* The "bamshuf" command has been renamed to "collate" (hence the term
  bamshuf no longer appears in the documentation, though it still works
  on the command line for compatibility with existing scripts).

* The mpileup command now outputs the unseen allele in VCF/BCF as <*>
  rather than X or <X> as previously, and now has AD, ADF, ADR, INFO/AD,
  INFO/ADF, INFO/ADR --output-tags annotations that largely supersede
  the existing DV, DP4, DPR annotations.

* The mpileup command now applies BAQ calculations at all base positions,
  regardless of which -l or -r options are used (previously with -l it was
  not applied to the first few tens of bases of each chromosome, leading
  to different mpileup results with -l vs. -r; samtools#79, samtools#125, samtools#286, samtools#407).

* Samtools now has a configure script which checks your build environment
  and facilitates choosing which HTSlib to build against.  See INSTALL
  for details.

* Samtools's Makefile now fully supports the standard convention of
  allowing CC/CPPFLAGS/CFLAGS/LDFLAGS/LIBS to be overridden as needed.
  Previously it listened to $(LDLIBS) instead; if you were overriding
  that, you should now override LIBS rather than LDLIBS.

* A new addreplacerg command that adds or alters @rg headers and RG:Z
  record tags has been added.

* The rmdup command no longer immediately aborts (previously it always
  aborted with "bam_get_library() not yet implemented"), but remains
  not recommended for most use (samtools#159, samtools#252, samtools#291, samtools#393).

* Merging files with millions of headers now completes in a reasonable
  amount of time (samtools#337, samtools#373, samtools#419, samtools#453; thanks to Nathan Weeks,
  Chris Smowton, Martin Pollard, Rob Davies).

* Samtools index's optional index output path argument works again (samtools#199).

* Fixed calmd, targetcut, and potential mpileup segfaults when given broken
  alignments with POS far beyond the end of their reference sequences.

* If you have source code using bam_md.c's bam_fillmd1_core(), bam_cap_mapQ(),
  or bam_prob_realn_core() functions, note that these now take an additional
  ref_len parameter.  (The versions named without "_core" are unchanged.)

* The tview command's colour scheme has been altered to be more suitable
  for users with colour blindness (samtools#457).

* Samtools depad command now handles CIGAR N operators and accepts
  CRAM files (samtools#201, samtools#404).

* Samtools stats now outputs separate "N" and "other" columns in the
  ACGT content per cycle section (samtools#376).

* Added -a option to samtools depth to show all locations, including
  zero depth sites (samtools#374).

* New samtools dict command, which creates a sequence dictionary
  (as used by Picard) from a FASTA reference file.

* Samtools stats --target-regions option works again.

* Added legacy API sam.h functions sam_index_load() and samfetch() providing
  bam_fetch()-style iteration over either BAM or CRAM files.  (In general
  we recommend recoding against the htslib API directly, but this addition
  may help existing libbam-using programs to be CRAM-enabled easily.)

* Fixed legacy API's samopen() to write headers only with "wh" when writing
  SAM files.  Plain "w" suppresses headers for SAM file output, but this
  was broken in 1.2.

* "samtools fixmate - -" works in pipelines again; with 1.0 to 1.2,
  this failed with "[bam_mating] cannot determine output format".

* Restored previous "samtools calmd -u" behaviour of writing compression
  level 0 BAM files.  Samtools 1.0 to 1.2 incorrectly wrote raw non-BGZF
  BAM files, which cannot be read by most other tools.  (Samtools commands
  other than calmd were unaffected by this bug.)

* Restored bam_nt16_nt4_table[] to legacy API header bam.h.

* Fixed bugs samtools#269, samtools#305, samtools#320, samtools#328, samtools#346, samtools#353, samtools#365, samtools#392, samtools#410, samtools#445,
  samtools#462, samtools#475, and samtools#495.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants