Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An empty BGZF block elsewhere in the file stream causes premature EOF. #45

Closed
jkbonfield opened this issue Dec 16, 2013 · 3 comments
Closed
Assignees
Milestone

Comments

@jkbonfield
Copy link
Contributor

Adding ^_\213^H^D^@^@^@^@^@377^F^@bc^B^@^[^@^C^@^@^@^@^@^@^@^@^@ elsewhere in a BAM file (other than the end) causes samtools view to end prematurely. This will either produce a decoding error, or if it happens to coincide with a genuine end of sequence record then it'll silently just truncate the output. This is visible if we add it as the first block after the SAM header.

Picard continues past this and ignores it as an end of file identifier unless it actually occurs at the end. The proposed amendment to the specification states "should by convention finish with a specific empty BGZF block". Note it does not say anything about the behaviour of empty blocks anywhere else, so we should not be interpreting them as a special indicator for EOF.

@ghost ghost assigned jmarshall Dec 18, 2013
@jmarshall jmarshall added this to the 1.1 milestone Mar 24, 2014
@jmarshall jmarshall modified the milestones: 1.2, 1.1 Sep 19, 2014
@jmarshall jmarshall modified the milestones: 1.3, 1.2 Jan 30, 2015
@jmarshall
Copy link
Member

Background: the desired effect of embedded EOF trailer blocks was discussed in December 2013 arising from this issue, resulting in a spec clarification (see samtools/hts-specs@c02ad4c) noting that they should not be interpreted as EOF.

See [Samtools-help] tabix bug on cat'ed vcf.gz for another use case.

@isthisthat
Copy link

Thanks for looking into this John. This bug came up when parallelizing variant calling by chr so it might become more ubiquitous in the future.

@Lenbok
Copy link

Lenbok commented May 14, 2015

RTG uses the same trick internally when doing mapping - individual sub BAM-files are written without header or termination block as appropriate in order to permit a fast concatenation at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants