samtools merge very slow with many files #203

jrandall · 2014-04-16T16:05:07Z

samtools merge appears to work fine for a small number of files (up to 50 or so it goes fairly fast). However, when merging larger numbers of files, the operations to examine and build the header (before the actual data merge even begins) starts to dominate the time.

Each additional input file that is opened is processed more and more slowly. I think this is because the regexes are searching through all of the data that has been loaded previously (this appears to especially be a problem for the PG ID searches).

By the 100th file, it takes around a second to process each additional file. By the 200th, each file open takes a few seconds. By the 500th file, it takes around 50s each. By the 1200th file, it is around 200s. We had to stop testing beyond that, but expect that opening the 5000th file would take nearly an hour to process, and the whole process of getting all files opened and headers processed would take several months to complete.

I have refactored some of the code in bam_sort.c to assist with profiling (07e333e), and I believe the issue comes down to a few of the regexes dealing with PG lines.

If we had a header parser, we could replace the regex implementation with one that built sensible data structures to keep track of the needed merge information.

The text was updated successfully, but these errors were encountered:

SamStudio8 · 2014-07-25T13:50:27Z

@jrandall I think this issue has been largely solved by your recent commit (9dff552), running pretty_header just once rather than for each file seems to be far more efficient for large groups of files and per file processing time appears to be more uniform for each additional input.

SamStudio8 · 2014-07-25T18:53:32Z

@jrandall @jmarshall As a follow up I've been able to merge ~4500 BAM files in about 2.8 hours.
Prior to 9dff552 merely opening the files for this job took longer than the 48 hour maximum execution time LSF would allow!

SamStudio8 · 2014-08-01T21:12:58Z

Following up once more, this job now completes in 1.1 hours (although using around ~1.5x - 2x the memory [which makes sense given the duplication between the RG/PG fields being stored as well as the whole data line -- the latter will be deprecated in future]) using samtools with htslib header parsing.

SamStudio8 · 2015-11-03T12:00:37Z

See #481

daviesrob · 2023-01-17T15:09:21Z

Closing as this was made more efficient a while back.

SamStudio8 mentioned this issue Aug 1, 2014

RFC: samtools merge to use htslib header parser #259

Closed

SamStudio8 mentioned this issue Aug 5, 2014

RFC: htslib header parser samtools/htslib#116

Closed

SamStudio8 mentioned this issue May 12, 2015

Vastly improve performance when many header lines #337

Merged

SamStudio8 mentioned this issue Jul 1, 2015

Header merge efficiency #419

Merged

jrandall added P1: Urgent D3: Easy labels Jan 25, 2016

daviesrob closed this as completed Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

samtools merge very slow with many files #203

samtools merge very slow with many files #203

jrandall commented Apr 16, 2014

SamStudio8 commented Jul 25, 2014

SamStudio8 commented Jul 25, 2014

SamStudio8 commented Aug 1, 2014

SamStudio8 commented Nov 3, 2015

daviesrob commented Jan 17, 2023

samtools merge very slow with many files #203

samtools merge very slow with many files #203

Comments

jrandall commented Apr 16, 2014

SamStudio8 commented Jul 25, 2014

SamStudio8 commented Jul 25, 2014

SamStudio8 commented Aug 1, 2014

SamStudio8 commented Nov 3, 2015

daviesrob commented Jan 17, 2023