New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Header merge efficiency #419
Conversation
…s and switch from expensive regexes to simple string.h operations.
…eader_merge_efficiency
Very useful! See also #337, #260 and #203.
|
There isn't duplication, actually-- that recent fix targets the prettify-header stage which comes after building the header in the first place. As I see it, the four PRs target similar problems but don't overlap: #203 avoids quadratic-cost prettify-header stage |
I didn't say there definitely was duplication, I just wanted to link in
|
Sure, no defensiveness intended -- just noting that they are distinct issues |
Also fixed tabs that came in with samtools#419. Conflicts: bam_sort.c
Thanks for the patch and for pushing us to look at this again. Rob extended this in #453 and the regexps are now gone from |
Samtools release 1.3: many improvements, fixes, new commands * The obsolete "samtools sort in.bam out.prefix" usage has been removed. If you are still using -f, -o, or out.prefix, convert to use -T PREFIX and/or -o FILE instead. (samtools#295, samtools#349, samtools#356, samtools#418, PR samtools#441; see also discussions in samtools#171, samtools#213.) * The "bamshuf" command has been renamed to "collate" (hence the term bamshuf no longer appears in the documentation, though it still works on the command line for compatibility with existing scripts). * The mpileup command now outputs the unseen allele in VCF/BCF as <*> rather than X or <X> as previously, and now has AD, ADF, ADR, INFO/AD, INFO/ADF, INFO/ADR --output-tags annotations that largely supersede the existing DV, DP4, DPR annotations. * The mpileup command now applies BAQ calculations at all base positions, regardless of which -l or -r options are used (previously with -l it was not applied to the first few tens of bases of each chromosome, leading to different mpileup results with -l vs. -r; samtools#79, samtools#125, samtools#286, samtools#407). * Samtools now has a configure script which checks your build environment and facilitates choosing which HTSlib to build against. See INSTALL for details. * Samtools's Makefile now fully supports the standard convention of allowing CC/CPPFLAGS/CFLAGS/LDFLAGS/LIBS to be overridden as needed. Previously it listened to $(LDLIBS) instead; if you were overriding that, you should now override LIBS rather than LDLIBS. * A new addreplacerg command that adds or alters @rg headers and RG:Z record tags has been added. * The rmdup command no longer immediately aborts (previously it always aborted with "bam_get_library() not yet implemented"), but remains not recommended for most use (samtools#159, samtools#252, samtools#291, samtools#393). * Merging files with millions of headers now completes in a reasonable amount of time (samtools#337, samtools#373, samtools#419, samtools#453; thanks to Nathan Weeks, Chris Smowton, Martin Pollard, Rob Davies). * Samtools index's optional index output path argument works again (samtools#199). * Fixed calmd, targetcut, and potential mpileup segfaults when given broken alignments with POS far beyond the end of their reference sequences. * If you have source code using bam_md.c's bam_fillmd1_core(), bam_cap_mapQ(), or bam_prob_realn_core() functions, note that these now take an additional ref_len parameter. (The versions named without "_core" are unchanged.) * The tview command's colour scheme has been altered to be more suitable for users with colour blindness (samtools#457). * Samtools depad command now handles CIGAR N operators and accepts CRAM files (samtools#201, samtools#404). * Samtools stats now outputs separate "N" and "other" columns in the ACGT content per cycle section (samtools#376). * Added -a option to samtools depth to show all locations, including zero depth sites (samtools#374). * New samtools dict command, which creates a sequence dictionary (as used by Picard) from a FASTA reference file. * Samtools stats --target-regions option works again. * Added legacy API sam.h functions sam_index_load() and samfetch() providing bam_fetch()-style iteration over either BAM or CRAM files. (In general we recommend recoding against the htslib API directly, but this addition may help existing libbam-using programs to be CRAM-enabled easily.) * Fixed legacy API's samopen() to write headers only with "wh" when writing SAM files. Plain "w" suppresses headers for SAM file output, but this was broken in 1.2. * "samtools fixmate - -" works in pipelines again; with 1.0 to 1.2, this failed with "[bam_mating] cannot determine output format". * Restored previous "samtools calmd -u" behaviour of writing compression level 0 BAM files. Samtools 1.0 to 1.2 incorrectly wrote raw non-BGZF BAM files, which cannot be read by most other tools. (Samtools commands other than calmd were unaffected by this bug.) * Restored bam_nt16_nt4_table[] to legacy API header bam.h. * Fixed bugs samtools#269, samtools#305, samtools#320, samtools#328, samtools#346, samtools#353, samtools#365, samtools#392, samtools#410, samtools#445, samtools#462, samtools#475, and samtools#495.
Hi all,
I was working with some BAM files that had a large number of contigs (they were aligned against parts of the NCBI NT database, which has thousands of short contigs). The input files had disjoint contigs, having been aligned to disjoint subsets of the database. This turned out to make 'samtools merge' prohibitively expensive, spending hours in
trans_tbl_init
.This patch improves the situation by (a) switching from a quadratic search for
@SQ
lines intranslate->text
to a linear pass over it, (b) switching from usingregexec
et al to plain oldstring.h
operations, making a pass over a single file much faster, and (c) allowing the hashtable used to build the global translation table to live between reading each input, avoid quadratic cost of repeatedly building and expanding the table when there are more than a few input files involved.This is only lightly tested with the particular workload that was causing trouble, but if you think the use case (many small contigs) is interesting, this might be useful to incorporate.