Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samtools merge very slow with many files #203

Closed
jrandall opened this issue Apr 16, 2014 · 5 comments
Closed

samtools merge very slow with many files #203

jrandall opened this issue Apr 16, 2014 · 5 comments

Comments

@jrandall
Copy link
Contributor

samtools merge appears to work fine for a small number of files (up to 50 or so it goes fairly fast). However, when merging larger numbers of files, the operations to examine and build the header (before the actual data merge even begins) starts to dominate the time.

Each additional input file that is opened is processed more and more slowly. I think this is because the regexes are searching through all of the data that has been loaded previously (this appears to especially be a problem for the PG ID searches).

By the 100th file, it takes around a second to process each additional file. By the 200th, each file open takes a few seconds. By the 500th file, it takes around 50s each. By the 1200th file, it is around 200s. We had to stop testing beyond that, but expect that opening the 5000th file would take nearly an hour to process, and the whole process of getting all files opened and headers processed would take several months to complete.

I have refactored some of the code in bam_sort.c to assist with profiling (07e333e), and I believe the issue comes down to a few of the regexes dealing with PG lines.

If we had a header parser, we could replace the regex implementation with one that built sensible data structures to keep track of the needed merge information.

@SamStudio8
Copy link
Contributor

@jrandall I think this issue has been largely solved by your recent commit (9dff552), running pretty_header just once rather than for each file seems to be far more efficient for large groups of files and per file processing time appears to be more uniform for each additional input.

@SamStudio8
Copy link
Contributor

@jrandall @jmarshall As a follow up I've been able to merge ~4500 BAM files in about 2.8 hours.
Prior to 9dff552 merely opening the files for this job took longer than the 48 hour maximum execution time LSF would allow!

@SamStudio8
Copy link
Contributor

Following up once more, this job now completes in 1.1 hours (although using around ~1.5x - 2x the memory [which makes sense given the duplication between the RG/PG fields being stored as well as the whole data line -- the latter will be deprecated in future]) using samtools with htslib header parsing.

@SamStudio8
Copy link
Contributor

See #481

@daviesrob
Copy link
Member

Closing as this was made more efficient a while back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants