SAM and zero length Z, H, and B fields #135

jkbonfield · 2016-03-10T17:19:11Z

We occasionally find a tag like MC:Z: with no string after it, ie the zero length string.
This tag in BAM can be read by both samtools and picard, but in SAM both tools emit an error.

The SAM specification states Z has to take the form [ !-~]+, ie 1 or more characters, so the above MC tag is illegal.

However I can't really see any justification for it except accidental over specification. (It's always been this way though.)

Is this something we care about improving, or just accept that the tag was bugged and indeed it shouldn't be possible to specify empty strings.

The text was updated successfully, but these errors were encountered:

jmarshall · 2016-06-14T14:43:05Z

And similarly for the H tag type, perhaps?

jkbonfield · 2016-06-14T15:08:23Z

I'd think so. Thanks for reminding me I need to work on this action... after threading hell! :)

jkbonfield · 2016-06-17T08:54:39Z

Thinking more on this, H is an interesting one. It's just implemented as a string in the format, so it can go from SAM->BAM->SAM just fine.

However practically speaking, how is someone going to decode this? It's likely to be 2 chars at a time until the end of the aux tag. In which case 0123 should be XX:H:0123 and not XX:H:123. I'd like to change the regex to ([0-9A-F][0-9A-F])*, but this is a tightening of the spec and so not backwards compatible.

I wonder who uses this and do tools really deal with hex strings containing an odd number of values correctly? (By correct I mean prepending a leading 0 so the nibbles end up in the right place, incase it was generated by a naive printf with %x.)

I assume the original rationale behind H was to encode binary data or strings with tabs in. It was there prior to the addition of B, which actually encodes using bytes rather than string form and so hopefully is the method everyone now uses.

Edit: samtools is happy to read such data, but I now see Picard fails with a complaint about not having an even number of digits. It does this for both SAM and BAM even with silent validation stringency. Therefore I think we can conclude such data doesn't exist commonly in the wild or it would break so many tools. I'll add this change to my PR therefore.

It also now forces H type (hex) to be an even number of bytes. Implements samtools/hts-specs#135

supported in BAM implementations but not permitted in SAM. Also changed H to require an even number of bytes as this was (probably) an oversight and is enforced by some implementations (htsjdk, maybe more). Fixes samtools#135

jmarshall · 2016-06-17T13:47:41Z

I believe the only extant H was FZ, and it was converted to a B-array in a01f3f2. So (1) the problem is rather moot; and (2) I think it's reasonable to assume these things were aways in practice even in length, and nobody ever parsed one by getting to the end, realising they only have half a byte left, and going back and shifting everything by a nybble.

Also we can pretend that the footnote (For example, a byte array [0x1a,0xe3,0x1] corresponds to a Hex string ‘1AE301’) implies this. 😄

jkbonfield · 2016-06-17T14:00:51Z

rue we have no standard H tags, but I think H existed before FZ anyway. It may be in use for non-standard tags. As you say though, it'd require substantial effort to cope with an odd number of nibbles.

I also looked at the footnote and realised it would have been so much more useful had it have chosen 011AE3 instead of 1AE301, but never-mind. As such it doesn't really imply anything.

Incidentally, in my investigations I see this (H) is yet another SAM->BAM->SAM round-trip that fails to validate. Picard ViewSam always converts H in BAM to a B array in SAM instead. So I expect there are fewer and fewer uses of H out there in the wild given the automatic type conversion.

…ts-spec * Added a test to show that htsJDK can handle empty strings as tag values. * - fixed the text codec to enable empty strings - testing that we can code and decode into various formats - Enables the reading and writing of sam/bam/cram records with empty string tags as required by samtools/hts-specs#135

jmarshall · 2016-10-20T15:39:58Z

The SAM description of B arrays is

B	`[cCsSiIf](,[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?)+`	Integer or numeric array

That final + (rather than *) means that a zero-length B array is technically invalid in SAM too. (In BAM, there is an (unbiased) “little-endian 32-bit integer which gives the number of elements in the array”, for which 0 is a valid value.)

Especially as HTSJDK likes to rewrite H tags to B, we may need to relax B to allow zero-length arrays too. (I have not yet reviewed the mailing list archives to see whether this was considered back when B was introduced, or looked at what current tools allow.)

jmarshall · 2016-10-21T14:59:29Z

The introduction of B arrays was discussed in this epic samtools-devel thread, and the question of 0‑length arrays was not considered.

betteridiot · 2017-12-07T22:10:48Z

This may be a little late to the party, but for someone parsing a BAM themselves (e.g. not with samtools), how does one determine the length of type H or type Z "arrays". The B-type array explicitly says this...the others do not.

jkbonfield · 2017-12-07T22:33:49Z

They're both C-strings, so nul terminated.

You're correct that this doesn't appear to be explicitly stated anywhere. Thanks for reporting this.

Regarding this original issue, @jmarshall you closed it and then reopened it. Is this because while Z and H now permit zero length B does not? If so, is the only change needed to replace "+" with "*" in the regexp in the section 1.5 table? (Along with some cautionary checking to see if it actually works with tools and whether it is desireable)

jkbonfield · 2018-04-04T10:39:58Z

A SAM file with an aux tag of Bb:B:i, causes samtools to complain with "incomplete B-typed aux field". Picard appears to be happy with it though.

Is this still something we wish to resolve? If yes the fix is a trivial 1 character modification (+ to *). If not we can close this issue.

IMO it should be permitted and we should bug fix htslib. @yfarjoun, @jmarshall?

jmarshall · 2018-05-04T06:33:51Z

Yes, I reopened this because I think B needs to allow zero length too, as explained in my comment of the time.

There is a small question of syntax in SAM. Studying the code suggests that if asked to print an alignment record containing a zero-length array (in their in-memory representation), both HTSlib and HTSDJK would print it as Bb:B:i — which would be consistent with the trivial 1 character regexp modification.

A trailing comma like Bb:B:i, would not match the regexp. So either we could make some complicated change to the regexp to allow it; or make the simple change (so Bb:B:i is the canonical SAM representation of zero-length array) and implementations can also accept Bb:B:i, as a QoI nicety if they like (especially if they already accept it!).

I reckon the latter.

jkbonfield · 2018-05-04T08:54:01Z

Good point on not matching the regex, so a blank array should indeed be Bb:B:i.

I tried it, and sadly neither htslib nor htsjdk handle it:

#samtools
[E::sam_parse1] incomplete B-typed aux field
[W::sam_read1] Parse error at line 30
[main_samview] truncated file.

#picard
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Tag of type B should have an element type followed by comma; File _.sam; Line 30

(Neither accept with trailing comma either, which seems to dispute what I found before. Maybe that was ValidateSamFile (accepts it) vs ViewSam differences.

I'm happy to make changes to htslib to support blank B fields if the htsjdk devs want to do the same to their code?

Fixes samtools#135 (for B fields; already fixed for Z and H).

yfarjoun · 2018-06-11T17:24:43Z

Is there a use-case of an empty array as opposed to a missing tag?

Otherwise, I'm not sure what we are trying to solve here.

yfarjoun · 2018-06-11T17:24:49Z

@lbergelson

jmarshall · 2018-06-11T18:58:55Z

It's a corner case, the same as an empty string would be. This issue exists because empty strings were encountered in BAM files in the wild, despite the spec claiming they were forbidden especially in SAM. Empty strings/arrays are otherwise valid values for these items, so it seems a shame for files that happen to include them to fail because the spec over-constrains these items.

In particular, I seem to recall HTSJDK converts H to B — and empty H has been valid since PR #155, so empty B needs to be valid too.

Suppose you have an array tag that records some special event that happens to occur to a few of the bases in most of your reads, so the tag lists the indices 1…readlength of the bases in the read on which the event occurs. (Methylation, say. 😄) On some reads it happens that no bases experienced the event, and on those reads it's reasonable for the array to be present but zero length.

(As listed above, I have a commit to fix this on my misc-sam-fixes branch, which is probably overdue to become a PR.)

Permit no numeric entries in SAM B array regexp. Fixes samtools#135 (for B fields; already fixed for Z and H). Describe the BAM representation of SEQ '*'. Fixes samtools#49. Also avoid odd "odd numbered length" phrasing. Further clarify that only @HD/SQ/RG/PG/CO are valid headers. Explicitly list the header types in the regexp too. Apparently some were still focussing on the [A-Z][A-Z] regexp and ignoring the surrounding text -- see samtools-help July 2017, 'additional header lines "@ga"', <https://sourceforge.net/p/samtools/mailman/message/35933210/>. Remove CIGAR N intron/skip operator history note. The N operator was in the original SAM document; its description was merely improved in Dec 2010 (51a762b).

jmarshall · 2018-07-23T22:54:46Z

I see HTSJDK converting H to B has recently been reported as a bug: broadinstitute/picard#1199.

jmarshall · 2018-07-23T23:58:52Z

Empty B arrays are now addressed by PR #326.

Permit no numeric entries in SAM B array regexp. Fixes samtools#135 (for B fields; already fixed for Z and H). Describe the BAM representation of SEQ '*'. Fixes samtools#49. Also avoid odd "odd numbered length" phrasing. Further clarify that only @HD/SQ/RG/PG/CO are valid headers. Explicitly list the header types in the regexp too. Apparently some were still focussing on the [A-Z][A-Z] regexp and ignoring the surrounding text -- see samtools-help July 2017, 'additional header lines "@ga"', <https://sourceforge.net/p/samtools/mailman/message/35933210/>. Remove CIGAR N intron/skip operator history note. The N operator was in the original SAM document; its description was merely improved in Dec 2010 (51a762b).

RWilton · 2018-08-06T20:21:57Z

"HTSJDK converting H to B has recently been reported as a bug" because it is a bug:

It is gratuitously converting one data type to another. This is unexpected to a human reader of the data and permitted nowhere in the SAM spec -- imagine what might happen if it were!
Worse, it makes the assumption that a hexadecimal value is a string. It is not. A hexadecimal value can only be interconverted with a string when one knows the "endianness" of the numeric representation of the value; this information is not implicit in the SAM representation of an H value. For example, if an application writes the 32-bit hexadecimal value 0x80000000 to a SAM file as xx:H:80000000, the corresponding B representation is (in effect) the string "\x80\x00\x00\x00". In a little-endian computer, this string would represent the 32-bit value 0x00000080, which is incorrect.

jmarshall · 2018-08-06T21:44:04Z

The SAM spec does not prohibit it either, and it does explicitly condone conversions between c/C/s/S/i/I so one can see how an implementation might extrapolate. Anyway, it's irrelevant on this issue.

If you feel that you want a ruling from the spec maintainers, please create a new issue. But I would suggest waiting for some resolution either way from the maintainers of the implementation in question first.

The spec describes H as a byte array, so as long as it is converted to B,c or B,C (which is what HTSJDK does), your endianness concerns are misplaced.

The reality is that B arrays have superseded H hexadecimal arrays. The H tag type is ripe for deprecation and it's unusual to still be using it in 2018.

RWilton · 2018-08-06T23:34:07Z

Yah, I went back through that "epic" thread where byte array semantics were discussed, and I realize now that the history of the H type is interwoven with that of the B type, i.e., it was brought into the discussion as a way of representing an arbitrary array of bytes rather than a hexadecimal value.

Oh, well. The H type would have been useful if it represented a binary value. But I agree that as an alternative method of representing a linear list of byte values, it's "ripe for deprecation." I certainly won't use it in the future.

Thanks for taking the time to respond!

jmarshall added the sam label Mar 11, 2016

jmarshall added this to the June 2016 meeting milestone Jun 6, 2016

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jun 17, 2016

Htslib now copes with zero length Z and H aux tags.

3a8d3d3

It also now forces H type (hex) to be an even number of bytes. Implements samtools/hts-specs#135

This was referenced Jun 17, 2016

Changed Z and H aux types to handle zero-length strings. #155

Merged

Htslib now copes with zero length Z and H aux tags. samtools/htslib#392

Merged

jmarshall closed this as completed in #155 Jun 20, 2016

yfarjoun mentioned this issue Sep 16, 2016

Yf add test empty string sam tag samtools/htsjdk#708

Merged

5 tasks

jmarshall changed the title ~~SAM and zero length Z fields~~ SAM and zero length Z, H, and B fields Oct 20, 2016

jmarshall reopened this Oct 20, 2016

jmarshall mentioned this issue Oct 25, 2016

[GFA1] Regexp for Path overlaps GFA-spec/GFA-spec#45

Closed

jmarshall added a commit to jmarshall/hts-specs that referenced this issue May 8, 2018

Permit no numeric entries in SAM B array regexp

a0d3195

Fixes samtools#135 (for B fields; already fixed for Z and H).

jmarshall mentioned this issue Jul 23, 2018

Miscellaneous SAM fixes #326

Merged

jmarshall closed this as completed in #326 Jul 26, 2018

jmarshall mentioned this issue Jun 10, 2023

Can ML Tags be zero length arrays? #728

Closed

jmarshall mentioned this issue Nov 13, 2023

Define Local Alleles in VCF to allow for sparser format #434

Closed

jmarshall mentioned this issue Jun 17, 2024

Allow empty string value in optional field like SAM does GFA-spec/GFA-spec#124

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAM and zero length Z, H, and B fields #135

SAM and zero length Z, H, and B fields #135

jkbonfield commented Mar 10, 2016

jmarshall commented Jun 14, 2016

jkbonfield commented Jun 14, 2016

jkbonfield commented Jun 17, 2016 •

edited

Loading

jmarshall commented Jun 17, 2016

jkbonfield commented Jun 17, 2016

jmarshall commented Oct 20, 2016

jmarshall commented Oct 21, 2016

betteridiot commented Dec 7, 2017

jkbonfield commented Dec 7, 2017

jkbonfield commented Apr 4, 2018

jmarshall commented May 4, 2018

jkbonfield commented May 4, 2018

yfarjoun commented Jun 11, 2018

yfarjoun commented Jun 11, 2018

jmarshall commented Jun 11, 2018

jmarshall commented Jul 23, 2018

jmarshall commented Jul 23, 2018

RWilton commented Aug 6, 2018 •

edited

Loading

jmarshall commented Aug 6, 2018

RWilton commented Aug 6, 2018

SAM and zero length Z, H, and B fields #135

SAM and zero length Z, H, and B fields #135

Comments

jkbonfield commented Mar 10, 2016

jmarshall commented Jun 14, 2016

jkbonfield commented Jun 14, 2016

jkbonfield commented Jun 17, 2016 • edited Loading

jmarshall commented Jun 17, 2016

jkbonfield commented Jun 17, 2016

jmarshall commented Oct 20, 2016

jmarshall commented Oct 21, 2016

betteridiot commented Dec 7, 2017

jkbonfield commented Dec 7, 2017

jkbonfield commented Apr 4, 2018

jmarshall commented May 4, 2018

jkbonfield commented May 4, 2018

yfarjoun commented Jun 11, 2018

yfarjoun commented Jun 11, 2018

jmarshall commented Jun 11, 2018

jmarshall commented Jul 23, 2018

jmarshall commented Jul 23, 2018

RWilton commented Aug 6, 2018 • edited Loading

jmarshall commented Aug 6, 2018

RWilton commented Aug 6, 2018

jkbonfield commented Jun 17, 2016 •

edited

Loading

RWilton commented Aug 6, 2018 •

edited

Loading