-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAM and zero length Z, H, and B fields #135
Comments
And similarly for the |
I'd think so. Thanks for reminding me I need to work on this action... after threading hell! :) |
Thinking more on this, H is an interesting one. It's just implemented as a string in the format, so it can go from SAM->BAM->SAM just fine. However practically speaking, how is someone going to decode this? It's likely to be 2 chars at a time until the end of the aux tag. In which case I wonder who uses this and do tools really deal with hex strings containing an odd number of values correctly? (By correct I mean prepending a leading 0 so the nibbles end up in the right place, incase it was generated by a naive printf with %x.) I assume the original rationale behind H was to encode binary data or strings with tabs in. It was there prior to the addition of B, which actually encodes using bytes rather than string form and so hopefully is the method everyone now uses. Edit: samtools is happy to read such data, but I now see Picard fails with a complaint about not having an even number of digits. It does this for both SAM and BAM even with silent validation stringency. Therefore I think we can conclude such data doesn't exist commonly in the wild or it would break so many tools. I'll add this change to my PR therefore. |
It also now forces H type (hex) to be an even number of bytes. Implements samtools/hts-specs#135
supported in BAM implementations but not permitted in SAM. Also changed H to require an even number of bytes as this was (probably) an oversight and is enforced by some implementations (htsjdk, maybe more). Fixes samtools#135
supported in BAM implementations but not permitted in SAM. Also changed H to require an even number of bytes as this was (probably) an oversight and is enforced by some implementations (htsjdk, maybe more). Fixes samtools#135
I believe the only extant H was Also we can pretend that the footnote (For example, a byte array [0x1a,0xe3,0x1] corresponds to a Hex string ‘1AE301’) implies this. 😄 |
rue we have no standard H tags, but I think H existed before FZ anyway. It may be in use for non-standard tags. As you say though, it'd require substantial effort to cope with an odd number of nibbles. I also looked at the footnote and realised it would have been so much more useful had it have chosen 011AE3 instead of 1AE301, but never-mind. As such it doesn't really imply anything. Incidentally, in my investigations I see this (H) is yet another SAM->BAM->SAM round-trip that fails to validate. Picard ViewSam always converts H in BAM to a B array in SAM instead. So I expect there are fewer and fewer uses of H out there in the wild given the automatic type conversion. |
…ts-spec * Added a test to show that htsJDK can handle empty strings as tag values. * - fixed the text codec to enable empty strings - testing that we can code and decode into various formats - Enables the reading and writing of sam/bam/cram records with empty string tags as required by samtools/hts-specs#135
The SAM description of B arrays is
That final Especially as HTSJDK likes to rewrite H tags to B, we may need to relax B to allow zero-length arrays too. (I have not yet reviewed the mailing list archives to see whether this was considered back when B was introduced, or looked at what current tools allow.) |
The introduction of B arrays was discussed in this epic samtools-devel thread, and the question of 0‑length arrays was not considered. |
This may be a little late to the party, but for someone parsing a BAM themselves (e.g. not with samtools), how does one determine the length of type H or type Z "arrays". The B-type array explicitly says this...the others do not. |
They're both C-strings, so nul terminated. You're correct that this doesn't appear to be explicitly stated anywhere. Thanks for reporting this. Regarding this original issue, @jmarshall you closed it and then reopened it. Is this because while Z and H now permit zero length B does not? If so, is the only change needed to replace "+" with "*" in the regexp in the section 1.5 table? (Along with some cautionary checking to see if it actually works with tools and whether it is desireable) |
A SAM file with an aux tag of Is this still something we wish to resolve? If yes the fix is a trivial 1 character modification (+ to *). If not we can close this issue. IMO it should be permitted and we should bug fix htslib. @yfarjoun, @jmarshall? |
Yes, I reopened this because I think B needs to allow zero length too, as explained in my comment of the time. There is a small question of syntax in SAM. Studying the code suggests that if asked to print an alignment record containing a zero-length array (in their in-memory representation), both HTSlib and HTSDJK would print it as A trailing comma like I reckon the latter. |
Good point on not matching the regex, so a blank array should indeed be I tried it, and sadly neither htslib nor htsjdk handle it:
(Neither accept with trailing comma either, which seems to dispute what I found before. Maybe that was ValidateSamFile (accepts it) vs ViewSam differences. I'm happy to make changes to htslib to support blank B fields if the htsjdk devs want to do the same to their code? |
Fixes samtools#135 (for B fields; already fixed for Z and H).
Is there a use-case of an empty array as opposed to a missing tag? Otherwise, I'm not sure what we are trying to solve here. |
It's a corner case, the same as an empty string would be. This issue exists because empty strings were encountered in BAM files in the wild, despite the spec claiming they were forbidden especially in SAM. Empty strings/arrays are otherwise valid values for these items, so it seems a shame for files that happen to include them to fail because the spec over-constrains these items. In particular, I seem to recall HTSJDK converts H to B — and empty H has been valid since PR #155, so empty B needs to be valid too. Suppose you have an array tag that records some special event that happens to occur to a few of the bases in most of your reads, so the tag lists the indices 1…readlength of the bases in the read on which the event occurs. (Methylation, say. 😄) On some reads it happens that no bases experienced the event, and on those reads it's reasonable for the array to be present but zero length. (As listed above, I have a commit to fix this on my misc-sam-fixes branch, which is probably overdue to become a PR.) |
Permit no numeric entries in SAM B array regexp. Fixes samtools#135 (for B fields; already fixed for Z and H). Describe the BAM representation of SEQ '*'. Fixes samtools#49. Also avoid odd "odd numbered length" phrasing. Further clarify that only @HD/SQ/RG/PG/CO are valid headers. Explicitly list the header types in the regexp too. Apparently some were still focussing on the [A-Z][A-Z] regexp and ignoring the surrounding text -- see samtools-help July 2017, 'additional header lines "@ga"', <https://sourceforge.net/p/samtools/mailman/message/35933210/>. Remove CIGAR N intron/skip operator history note. The N operator was in the original SAM document; its description was merely improved in Dec 2010 (51a762b).
I see HTSJDK converting H to B has recently been reported as a bug: broadinstitute/picard#1199. |
Empty B arrays are now addressed by PR #326. |
Permit no numeric entries in SAM B array regexp. Fixes samtools#135 (for B fields; already fixed for Z and H). Describe the BAM representation of SEQ '*'. Fixes samtools#49. Also avoid odd "odd numbered length" phrasing. Further clarify that only @HD/SQ/RG/PG/CO are valid headers. Explicitly list the header types in the regexp too. Apparently some were still focussing on the [A-Z][A-Z] regexp and ignoring the surrounding text -- see samtools-help July 2017, 'additional header lines "@ga"', <https://sourceforge.net/p/samtools/mailman/message/35933210/>. Remove CIGAR N intron/skip operator history note. The N operator was in the original SAM document; its description was merely improved in Dec 2010 (51a762b).
"HTSJDK converting H to B has recently been reported as a bug" because it is a bug:
|
The SAM spec does not prohibit it either, and it does explicitly condone conversions between c/C/s/S/i/I so one can see how an implementation might extrapolate. Anyway, it's irrelevant on this issue. If you feel that you want a ruling from the spec maintainers, please create a new issue. But I would suggest waiting for some resolution either way from the maintainers of the implementation in question first. The spec describes The reality is that |
Yah, I went back through that "epic" thread where byte array semantics were discussed, and I realize now that the history of the H type is interwoven with that of the B type, i.e., it was brought into the discussion as a way of representing an arbitrary array of bytes rather than a hexadecimal value. Oh, well. The H type would have been useful if it represented a binary value. But I agree that as an alternative method of representing a linear list of byte values, it's "ripe for deprecation." I certainly won't use it in the future. Thanks for taking the time to respond! |
We occasionally find a tag like
MC:Z:
with no string after it, ie the zero length string.This tag in BAM can be read by both samtools and picard, but in SAM both tools emit an error.
The SAM specification states Z has to take the form
[ !-~]+
, ie 1 or more characters, so the above MC tag is illegal.However I can't really see any justification for it except accidental over specification. (It's always been this way though.)
Is this something we care about improving, or just accept that the tag was bugged and indeed it shouldn't be possible to specify empty strings.
The text was updated successfully, but these errors were encountered: