Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAM and zero length Z, H, and B fields #135

Closed
jkbonfield opened this issue Mar 10, 2016 · 20 comments · Fixed by #326
Closed

SAM and zero length Z, H, and B fields #135

jkbonfield opened this issue Mar 10, 2016 · 20 comments · Fixed by #326
Labels

Comments

@jkbonfield
Copy link
Contributor

We occasionally find a tag like MC:Z: with no string after it, ie the zero length string.
This tag in BAM can be read by both samtools and picard, but in SAM both tools emit an error.

The SAM specification states Z has to take the form [ !-~]+, ie 1 or more characters, so the above MC tag is illegal.

However I can't really see any justification for it except accidental over specification. (It's always been this way though.)

Is this something we care about improving, or just accept that the tag was bugged and indeed it shouldn't be possible to specify empty strings.

@jmarshall jmarshall added the sam label Mar 11, 2016
@jmarshall jmarshall added this to the June 2016 meeting milestone Jun 6, 2016
@jmarshall
Copy link
Member

And similarly for the H tag type, perhaps?

@jkbonfield
Copy link
Contributor Author

I'd think so. Thanks for reminding me I need to work on this action... after threading hell! :)

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 17, 2016

Thinking more on this, H is an interesting one. It's just implemented as a string in the format, so it can go from SAM->BAM->SAM just fine.

However practically speaking, how is someone going to decode this? It's likely to be 2 chars at a time until the end of the aux tag. In which case 0123 should be XX:H:0123 and not XX:H:123. I'd like to change the regex to ([0-9A-F][0-9A-F])*, but this is a tightening of the spec and so not backwards compatible.

I wonder who uses this and do tools really deal with hex strings containing an odd number of values correctly? (By correct I mean prepending a leading 0 so the nibbles end up in the right place, incase it was generated by a naive printf with %x.)

I assume the original rationale behind H was to encode binary data or strings with tabs in. It was there prior to the addition of B, which actually encodes using bytes rather than string form and so hopefully is the method everyone now uses.

Edit: samtools is happy to read such data, but I now see Picard fails with a complaint about not having an even number of digits. It does this for both SAM and BAM even with silent validation stringency. Therefore I think we can conclude such data doesn't exist commonly in the wild or it would break so many tools. I'll add this change to my PR therefore.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jun 17, 2016
It also now forces H type (hex) to be an even number of bytes.

Implements samtools/hts-specs#135
jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Jun 17, 2016
supported in BAM implementations but not permitted in SAM.

Also changed H to require an even number of bytes as this was
(probably) an oversight and is enforced by some implementations
(htsjdk, maybe more).

Fixes samtools#135
jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Jun 17, 2016
supported in BAM implementations but not permitted in SAM.

Also changed H to require an even number of bytes as this was
(probably) an oversight and is enforced by some implementations
(htsjdk, maybe more).

Fixes samtools#135
@jmarshall
Copy link
Member

I believe the only extant H was FZ, and it was converted to a B-array in a01f3f2. So (1) the problem is rather moot; and (2) I think it's reasonable to assume these things were aways in practice even in length, and nobody ever parsed one by getting to the end, realising they only have half a byte left, and going back and shifting everything by a nybble.

Also we can pretend that the footnote (For example, a byte array [0x1a,0xe3,0x1] corresponds to a Hex string ‘1AE301’) implies this. 😄

@jkbonfield
Copy link
Contributor Author

rue we have no standard H tags, but I think H existed before FZ anyway. It may be in use for non-standard tags. As you say though, it'd require substantial effort to cope with an odd number of nibbles.

I also looked at the footnote and realised it would have been so much more useful had it have chosen 011AE3 instead of 1AE301, but never-mind. As such it doesn't really imply anything.

Incidentally, in my investigations I see this (H) is yet another SAM->BAM->SAM round-trip that fails to validate. Picard ViewSam always converts H in BAM to a B array in SAM instead. So I expect there are fewer and fewer uses of H out there in the wild given the automatic type conversion.

yfarjoun pushed a commit to samtools/htsjdk that referenced this issue Sep 19, 2016
…ts-spec

* Added a test to show that htsJDK can handle empty strings as tag values.

* - fixed the text codec to enable empty strings
- testing that we can code and decode into various formats

- Enables the reading and writing of sam/bam/cram records with empty string tags as required by samtools/hts-specs#135
@jmarshall jmarshall changed the title SAM and zero length Z fields SAM and zero length Z, H, and B fields Oct 20, 2016
@jmarshall
Copy link
Member

The SAM description of B arrays is

B [cCsSiIf](,[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)+ Integer or numeric array

That final + (rather than *) means that a zero-length B array is technically invalid in SAM too. (In BAM, there is an (unbiased) “little-endian 32-bit integer which gives the number of elements in the array”, for which 0 is a valid value.)

Especially as HTSJDK likes to rewrite H tags to B, we may need to relax B to allow zero-length arrays too. (I have not yet reviewed the mailing list archives to see whether this was considered back when B was introduced, or looked at what current tools allow.)

@jmarshall jmarshall reopened this Oct 20, 2016
@jmarshall
Copy link
Member

The introduction of B arrays was discussed in this epic samtools-devel thread, and the question of 0‑length arrays was not considered.

@betteridiot
Copy link

This may be a little late to the party, but for someone parsing a BAM themselves (e.g. not with samtools), how does one determine the length of type H or type Z "arrays". The B-type array explicitly says this...the others do not.

@jkbonfield
Copy link
Contributor Author

They're both C-strings, so nul terminated.

You're correct that this doesn't appear to be explicitly stated anywhere. Thanks for reporting this.

Regarding this original issue, @jmarshall you closed it and then reopened it. Is this because while Z and H now permit zero length B does not? If so, is the only change needed to replace "+" with "*" in the regexp in the section 1.5 table? (Along with some cautionary checking to see if it actually works with tools and whether it is desireable)

@jkbonfield
Copy link
Contributor Author

A SAM file with an aux tag of Bb:B:i, causes samtools to complain with "incomplete B-typed aux field". Picard appears to be happy with it though.

Is this still something we wish to resolve? If yes the fix is a trivial 1 character modification (+ to *). If not we can close this issue.

IMO it should be permitted and we should bug fix htslib. @yfarjoun, @jmarshall?

@jmarshall
Copy link
Member

Yes, I reopened this because I think B needs to allow zero length too, as explained in my comment of the time.

There is a small question of syntax in SAM. Studying the code suggests that if asked to print an alignment record containing a zero-length array (in their in-memory representation), both HTSlib and HTSDJK would print it as Bb:B:i — which would be consistent with the trivial 1 character regexp modification.

A trailing comma like Bb:B:i, would not match the regexp. So either we could make some complicated change to the regexp to allow it; or make the simple change (so Bb:B:i is the canonical SAM representation of zero-length array) and implementations can also accept Bb:B:i, as a QoI nicety if they like (especially if they already accept it!).

I reckon the latter.

@jkbonfield
Copy link
Contributor Author

Good point on not matching the regex, so a blank array should indeed be Bb:B:i.

I tried it, and sadly neither htslib nor htsjdk handle it:

#samtools
[E::sam_parse1] incomplete B-typed aux field
[W::sam_read1] Parse error at line 30
[main_samview] truncated file.
#picard
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Tag of type B should have an element type followed by comma; File _.sam; Line 30

(Neither accept with trailing comma either, which seems to dispute what I found before. Maybe that was ValidateSamFile (accepts it) vs ViewSam differences.

I'm happy to make changes to htslib to support blank B fields if the htsjdk devs want to do the same to their code?

jmarshall added a commit to jmarshall/hts-specs that referenced this issue May 8, 2018
Fixes samtools#135 (for B fields; already fixed for Z and H).
@yfarjoun
Copy link
Contributor

Is there a use-case of an empty array as opposed to a missing tag?

Otherwise, I'm not sure what we are trying to solve here.

@yfarjoun
Copy link
Contributor

@lbergelson

@jmarshall
Copy link
Member

It's a corner case, the same as an empty string would be. This issue exists because empty strings were encountered in BAM files in the wild, despite the spec claiming they were forbidden especially in SAM. Empty strings/arrays are otherwise valid values for these items, so it seems a shame for files that happen to include them to fail because the spec over-constrains these items.

In particular, I seem to recall HTSJDK converts H to B — and empty H has been valid since PR #155, so empty B needs to be valid too.

Suppose you have an array tag that records some special event that happens to occur to a few of the bases in most of your reads, so the tag lists the indices 1…readlength of the bases in the read on which the event occurs. (Methylation, say. 😄) On some reads it happens that no bases experienced the event, and on those reads it's reasonable for the array to be present but zero length.

(As listed above, I have a commit to fix this on my misc-sam-fixes branch, which is probably overdue to become a PR.)

jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jul 23, 2018
Permit no numeric entries in SAM B array regexp.
Fixes samtools#135 (for B fields; already fixed for Z and H).

Describe the BAM representation of SEQ '*'.
Fixes samtools#49. Also avoid odd "odd numbered length" phrasing.

Further clarify that only @HD/SQ/RG/PG/CO are valid headers.
Explicitly list the header types in the regexp too. Apparently some were
still focussing on the [A-Z][A-Z] regexp and ignoring the surrounding
text -- see samtools-help July 2017, 'additional header lines "@ga"',
<https://sourceforge.net/p/samtools/mailman/message/35933210/>.

Remove CIGAR N intron/skip operator history note.
The N operator was in the original SAM document; its description was
merely improved in Dec 2010 (51a762b).
@jmarshall
Copy link
Member

I see HTSJDK converting H to B has recently been reported as a bug: broadinstitute/picard#1199.

@jmarshall
Copy link
Member

Empty B arrays are now addressed by PR #326.

jmarshall added a commit to jmarshall/hts-specs that referenced this issue Jul 26, 2018
Permit no numeric entries in SAM B array regexp.
Fixes samtools#135 (for B fields; already fixed for Z and H).

Describe the BAM representation of SEQ '*'.
Fixes samtools#49. Also avoid odd "odd numbered length" phrasing.

Further clarify that only @HD/SQ/RG/PG/CO are valid headers.
Explicitly list the header types in the regexp too. Apparently some were
still focussing on the [A-Z][A-Z] regexp and ignoring the surrounding
text -- see samtools-help July 2017, 'additional header lines "@ga"',
<https://sourceforge.net/p/samtools/mailman/message/35933210/>.

Remove CIGAR N intron/skip operator history note.
The N operator was in the original SAM document; its description was
merely improved in Dec 2010 (51a762b).
@RWilton
Copy link

RWilton commented Aug 6, 2018

"HTSJDK converting H to B has recently been reported as a bug" because it is a bug:

  • It is gratuitously converting one data type to another. This is unexpected to a human reader of the data and permitted nowhere in the SAM spec -- imagine what might happen if it were!

  • Worse, it makes the assumption that a hexadecimal value is a string. It is not. A hexadecimal value can only be interconverted with a string when one knows the "endianness" of the numeric representation of the value; this information is not implicit in the SAM representation of an H value. For example, if an application writes the 32-bit hexadecimal value 0x80000000 to a SAM file as xx:H:80000000, the corresponding B representation is (in effect) the string "\x80\x00\x00\x00". In a little-endian computer, this string would represent the 32-bit value 0x00000080, which is incorrect.

@jmarshall
Copy link
Member

The SAM spec does not prohibit it either, and it does explicitly condone conversions between c/C/s/S/i/I so one can see how an implementation might extrapolate. Anyway, it's irrelevant on this issue.

If you feel that you want a ruling from the spec maintainers, please create a new issue. But I would suggest waiting for some resolution either way from the maintainers of the implementation in question first.

The spec describes H as a byte array, so as long as it is converted to B,c or B,C (which is what HTSJDK does), your endianness concerns are misplaced.

The reality is that B arrays have superseded H hexadecimal arrays. The H tag type is ripe for deprecation and it's unusual to still be using it in 2018.

@RWilton
Copy link

RWilton commented Aug 6, 2018

Yah, I went back through that "epic" thread where byte array semantics were discussed, and I realize now that the history of the H type is interwoven with that of the B type, i.e., it was brought into the discussion as a way of representing an arbitrary array of bytes rather than a hexadecimal value.

Oh, well. The H type would have been useful if it represented a binary value. But I agree that as an alternative method of representing a linear list of byte values, it's "ripe for deprecation." I certainly won't use it in the future.

Thanks for taking the time to respond!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants