-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BAM index generated with samtools merge --write-index missing a contig #1197
Comments
Does this happen if you disable multi-threading? I can see there are subtle differences to indices when running |
No, if I disable multithreading I don't see the effect (i.e., I see all of
the contigs in the generated index).
In case this is relevant, the merging step is not required. If I take the
merged bam and simply do:
samtools view -b -o out.bam -@8 --write-index in.bam
then the resulting index will have the missing contig.
If it helps I should be able to share the bam file - it is human data but
it is for a benchmarking dataset so we are meant to be able to share the
data.
Simon
…On Mon, Feb 17, 2020 at 1:24 PM James Bonfield ***@***.***> wrote:
Does this happen if you disable multi-threading?
I can see there are subtle differences to indices when running --write-index
***@***.*** vs --write-index although so far I haven't spotted any entire
chromosomes going missing. The differences are limited to virtual offsets
being the end of the previous block or the start of the next block, which
make no real difference I think. That was GRCh37 alignments. I'll try again
on GRCh38 as it has many more chromosomes to trigger issues with.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1197?email_source=notifications&email_token=AAY465YQ3KCUXBXSIS4WPG3RDJ6XZA5CNFSM4KVW5LSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL6HJFA#issuecomment-586970260>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAY465ZLLYVZVAHI74LP36TRDJ6XZANCNFSM4KVW5LSA>
.
|
I can reproduce this exact problem with a smaller data set, so thanks for the offer but I have my own data to play with now. |
This adds the analogue of the hts_idx_amend_last function for bgzf. This is necessary when multi-threading output using --write-index. Fixes samtools/samtools#1197 In theory the change should have no impact as the only difference is whether our virtual offset points to the end of a block or the start of the next block. Either way the two offsets are essentially the same locaiton on disk. However due to a bug elsewhere (see next commit) this lead to unreported bgzf_read failures.
This adds the analogue of the hts_idx_amend_last function for bgzf. This is necessary when multi-threading output using --write-index. Fixes samtools/samtools#1197 In theory the change should have no impact as the only difference is whether our virtual offset points to the end of a block or the start of the next block. Either way the two offsets are essentially the same locaiton on disk. However due to a bug elsewhere (see next commit) this lead to unreported bgzf_read failures.
This adds the analogue of the hts_idx_amend_last function for bgzf. This is necessary when multi-threading output using --write-index. Fixes samtools/samtools#1197 In theory the change should have no impact as the only difference is whether our virtual offset points to the end of a block or the start of the next block. Either way the two offsets are essentially the same locaiton on disk. However due to a bug elsewhere (see next commit) this lead to unreported bgzf_read failures.
Are you using the latest version of [samtools]
Yes
samtools 1.10
Using htslib 1.10.1
Please describe your environment.
OS
Linux 5.5.2-arch1-1
machine architecture
x86_64
compiler
gcc (GCC) 9.2.0
Please specify the steps taken to generate the issue, the command you are running and the relevant output.
Important Note
This is part of a benchmarking exercise for an analysis pipeline. The pipeline gives the same results across at least 6 different sites with diverse architectures so it is not limited to the envrionment listed above. There is more than enough memory for the operation (the tested systems have between 64GB and 256GB of RAM). While there is a simple workaround (don't use --write-index and use samtools index to generate index), it is clearly worrying when certain contigs are not accessible using the index when no error message was generated during the generation of the index.
Generated merged BAM from 3 individual BAM files using samtools merge. Human data aligned to GRCh38 (GCA_000001405.15_GRCh38_no_alt_analysis_set). If I generate the index using --write-index I get a .csi index that appears to work. However the command:
samtools view testfile.bam chrUn_KI270330v1
gives no output, despite chrUn_KI270330v1 existing in the BAM header and almost 10000 alignments for the contig being present in the BAM file (verified using samtools view without giving a location). If I create a csi index explicitly using samtools index -c testfile.bam I do not see this issue (i.e., I can extract alignments for all contigs).
I have also generated a merged CRAM file from the 3 input BAM files, using --write-index to create a .csi index on the fly, and this index works correctly with all contigs being accessible.
The text was updated successfully, but these errors were encountered: