New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
samtools merge very slow with many files #203
Comments
@jrandall @jmarshall As a follow up I've been able to merge ~4500 BAM files in about 2.8 hours. |
Following up once more, this job now completes in 1.1 hours (although using around ~1.5x - 2x the memory [which makes sense given the duplication between the RG/PG fields being stored as well as the whole data line -- the latter will be deprecated in future]) using samtools with htslib header parsing. |
See #481 |
Closing as this was made more efficient a while back. |
samtools merge appears to work fine for a small number of files (up to 50 or so it goes fairly fast). However, when merging larger numbers of files, the operations to examine and build the header (before the actual data merge even begins) starts to dominate the time.
Each additional input file that is opened is processed more and more slowly. I think this is because the regexes are searching through all of the data that has been loaded previously (this appears to especially be a problem for the PG ID searches).
By the 100th file, it takes around a second to process each additional file. By the 200th, each file open takes a few seconds. By the 500th file, it takes around 50s each. By the 1200th file, it is around 200s. We had to stop testing beyond that, but expect that opening the 5000th file would take nearly an hour to process, and the whole process of getting all files opened and headers processed would take several months to complete.
I have refactored some of the code in bam_sort.c to assist with profiling (07e333e), and I believe the issue comes down to a few of the regexes dealing with PG lines.
If we had a header parser, we could replace the regex implementation with one that built sensible data structures to keep track of the needed merge information.
The text was updated successfully, but these errors were encountered: