Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAM index generated with samtools merge --write-index missing a contig #1197

Closed
heathsc opened this issue Feb 15, 2020 · 3 comments · Fixed by samtools/htslib#1028
Closed
Labels

Comments

@heathsc
Copy link

heathsc commented Feb 15, 2020

Are you using the latest version of [samtools]

Yes
samtools 1.10
Using htslib 1.10.1

Please describe your environment.

  • OS
    Linux 5.5.2-arch1-1

  • machine architecture
    x86_64

  • compiler
    gcc (GCC) 9.2.0

Please specify the steps taken to generate the issue, the command you are running and the relevant output.

Important Note
This is part of a benchmarking exercise for an analysis pipeline. The pipeline gives the same results across at least 6 different sites with diverse architectures so it is not limited to the envrionment listed above. There is more than enough memory for the operation (the tested systems have between 64GB and 256GB of RAM). While there is a simple workaround (don't use --write-index and use samtools index to generate index), it is clearly worrying when certain contigs are not accessible using the index when no error message was generated during the generation of the index.

Generated merged BAM from 3 individual BAM files using samtools merge. Human data aligned to GRCh38 (GCA_000001405.15_GRCh38_no_alt_analysis_set). If I generate the index using --write-index I get a .csi index that appears to work. However the command:

samtools view testfile.bam chrUn_KI270330v1

gives no output, despite chrUn_KI270330v1 existing in the BAM header and almost 10000 alignments for the contig being present in the BAM file (verified using samtools view without giving a location). If I create a csi index explicitly using samtools index -c testfile.bam I do not see this issue (i.e., I can extract alignments for all contigs).

I have also generated a merged CRAM file from the 3 input BAM files, using --write-index to create a .csi index on the fly, and this index works correctly with all contigs being accessible.

@jkbonfield jkbonfield added the bug label Feb 17, 2020
@jkbonfield
Copy link
Contributor

Does this happen if you disable multi-threading?

I can see there are subtle differences to indices when running --write-index -@8 vs --write-index although so far I haven't spotted any entire chromosomes going missing. The differences are limited to virtual offsets being the end of the previous block or the start of the next block, which make no real difference I think. That was GRCh37 alignments. I'll try again on GRCh38 as it has many more chromosomes to trigger issues with.

@heathsc
Copy link
Author

heathsc commented Feb 17, 2020 via email

@jkbonfield
Copy link
Contributor

I can reproduce this exact problem with a smaller data set, so thanks for the offer but I have my own data to play with now.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Feb 19, 2020
This adds the analogue of the hts_idx_amend_last function for bgzf.
This is necessary when multi-threading output using --write-index.

Fixes samtools/samtools#1197

In theory the change should have no impact as the only difference is
whether our virtual offset points to the end of a block or the start
of the next block.  Either way the two offsets are essentially the
same locaiton on disk.  However due to a bug elsewhere (see next
commit) this lead to unreported bgzf_read failures.
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Feb 19, 2020
This adds the analogue of the hts_idx_amend_last function for bgzf.
This is necessary when multi-threading output using --write-index.

Fixes samtools/samtools#1197

In theory the change should have no impact as the only difference is
whether our virtual offset points to the end of a block or the start
of the next block.  Either way the two offsets are essentially the
same locaiton on disk.  However due to a bug elsewhere (see next
commit) this lead to unreported bgzf_read failures.
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Feb 19, 2020
This adds the analogue of the hts_idx_amend_last function for bgzf.
This is necessary when multi-threading output using --write-index.

Fixes samtools/samtools#1197

In theory the change should have no impact as the only difference is
whether our virtual offset points to the end of a block or the start
of the next block.  Either way the two offsets are essentially the
same locaiton on disk.  However due to a bug elsewhere (see next
commit) this lead to unreported bgzf_read failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants