Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop samtools cat from outputting multiple CRAM EOF markers. #1422

Closed
wants to merge 2 commits into from

Conversation

jkbonfield
Copy link
Contributor

Samtools cat naively copies all incoming containers, and then writes an empty EOF container when closing the file. This leads to multiple internal EOF markers as they're implemented in CRAM as an empty container.

Oddly there was code explicitly dealing with this, which seems redundant. Perhaps it served a purpose in early versions. Regardless internal EOF markers is permitted and are ignored, so this change is just for tidyness sake and to improve robustness incase bad implementations stop on the first EOF.

Samtools cat naively copies all incoming containers, and then writes
an empty EOF container when closing the file.  This leads to multiple
internal EOF markers as they're implemented in CRAM as an empty
container.

Oddly there was code explicitly dealing with this, which seems
redundant.  Perhaps it served a purpose in early versions.  Regardless
internal EOF markers is permitted and are ignored, so this change is
just for tidyness sake and to improve robustness incase bad
implementations stop on the first EOF.
@valeriuo
Copy link
Contributor

I am investigating some large memory allocations ~1 GB. It seems that a cram_read_container(in_c) call after a true cram_container_is_empty(in_c) might lead to some uninitialized container fields, especially num_landmarks, which results in a large allocation.

Amend previous commit following review comments.  (As a distinct
commit temporarily so the change we discussed is visible.)

Empty containers still have a CRAM compression header block.  So EOF
blocks internal to a file need to have their block read even if we're
not going to write it and the container back out again.

This was visible by the following experiment.

// Create a CRAM with internal EOF markers
samtools-1.12 cat in.bam in.bam -o out1.bam

// Concatenate such files together, with code prior to this commit
samtools-previous-commit cat out1.bam out1.bam -o out2.bam
@valeriuo
Copy link
Contributor

Merged into develop as 1ea60ad

@valeriuo valeriuo closed this Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants