Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cram_dependent_data_series when FN is used. #1314

Closed

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Aug 4, 2021

If we get into the primary loop of cram_decode_seq, iterating over feature (count FN) then we have unguarded code that always checks for sequence overlapping the reference. To do this, seq_pos must be set. Some data series we know how seq_pos is adjusted irrespective of whether we decode, eg BS is always +1, but others are strings and the only way we know how to update seq_pos is to decode them.

Hence added FN as having a dependency on SC, IN and BB.

Fixes samtools/samtools#1475

It's more questionable though in that bug report as to why FLAGS (BF data series) ever propogated to needing FN (cigar feature number). It's due to the CRAM_CIGAR check, but I'm unsure why it's there at present. It's obviously a bit overly cautious in this case. That said, it's simply a missing optimisation rather than an error.

Edit: it's also harmless in all modern CRAMs. The cigar check means whenever we ask for BF we get RL added in for free, which is maybe wasted effort, but normally nothing for Illumina and one minor int data series for long read data. The only reason it expands so much on this data file is RL is stored in the CORE block, along with half a dozen other data series. This causes a dependency explosion, but really it's the file that is the most questionable here (it's likely old).

@valeriuo valeriuo self-assigned this Aug 5, 2021
If we get into the primary loop of cram_decode_seq, iterating over
feature (count FN) then we have unguarded code that always checks for
sequence overlapping the reference.  To do this, seq_pos must be set.
Some data series we know how seq_pos is adjusted irrespective of
whether we decode, eg BS is always +1, but others are strings and the
only way we know how to update seq_pos is to decode them.

Hence added FN as having a dependency on SC, IN and BB.

Fixes samtools/samtools#1475
@jkbonfield jkbonfield force-pushed the cram_dependent_data_series_fix branch from 5184f81 to 81cb0dd Compare August 5, 2021 10:43
@valeriuo
Copy link
Contributor

valeriuo commented Aug 5, 2021

Merged as c5aa158

@valeriuo valeriuo closed this Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Erroneous "CRAM CIGAR extends beyond slice reference extents"
2 participants