-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Header parsing too strict? #111
Comments
In this case, noodles' SAM header parser is not overly strict. It is, however, spec-compliant. From Sequence Alignment/Map Format Specification (2022-08-22) § 1.3 "The header section":
This shows that the values are uppercase. There is also a test case marked invalid that's exactly your example: https://github.com/samtools/hts-specs/blob/59a0d0c8dc01b46192f6da1cec61db59c9b3d1d9/test/sam/failed/hdr.RG6.sam. |
I can't believe I am arguing for this change but here we are. Of course, noodles is doing the right thing here and there is a strong case for having a toolchain that enforces standards and does not enable users to gradually drift away from the specs. That said, like @sc13-bioinf we are sitting on about a petabyte of BAMs on S3 that we'd need to get out of the archive, re-header, and then store again if we want to make them available through noodles/htsget-rs in the future. The temptation to just stick with htslib is there. A 'lenient' option that throws warnings instead of erroring out might help drive adoption - at least when applied to flags such as |
@ohofmann out of curiosity, is your issue with the header the same as @sc13-bioinf‘s (the lowercase platform name)? Internally, we have been brainstorming some computationally feasible ways to provide an easy on-ramp to get things back into spec (eg would not require a reheader in the traditional sense). |
I have also thought about this quite a bit from the other perspective—I think that, philosophically, perhaps the spec and users think of rigidity in the header differently. One could argue (though I’m not yet sure if they should) that the spec should change to allow this, as it wasn’t enforced in htslib for so long. What good is it in keeping the spec so rigid if (a) the tools don’t enforce it and (b) it’s not what the majority of users use it for/there is not a good compelling reason [unless there is one? I cannot think of one off hand]. That would allow @zaeleus to keep things in spec and still support this historical data. I realize this doesn’t solve the problem of “well, technically that ship has already sailed for existing versions of the spec”. Just thinking out loud here. |
Yes, weirdly enough the same issue - this is for a historical backlog of alignments generated with https://github.com/bcbio/bcbio-nextgen which seems to be populating the platform information with lower case labels though this very likely was due to a misconfiguration on our side. |
@zaeleus another question for you—do you plan to support all versions of the spec for all versions of noodles, or will releases of noodles only support the latest version of the spec at any given time (so if you want to use old versions of the spec, you’d have to use old versions of noodles). |
It's possible to reheader BAMs without decompressing and recompressing all the BAM records. Once the header is done it's basically a read/write loop. Lots of I/O, but computationally quick. In theory on some filesystems (zfs maybe?) it's possible to cut and paste files together via filesystem meta-data, so theoretically that read/write loop isn't needed due to "magic", but that's very OS/FS specific and I don't know of tools that do it. It's a pity the BAM implementations didn't include BGZF blocks which consume disk space but produce no output. Such blocks could be consumed or added to in order to permit in-place reheader that doesn't change the downstream file and also avoids read/write loops. (CRAM can do this.) Hmm, in theory it may also be possible to construct a Deflate block the same size as the existing one by deliberately choosing appropriate encodings. Hellishly hard to do for the general case (if not impossible), but maybe for a lowercase to uppercase translation it may be that the compressed block could end up the same size (with a variety of parameters tested). Definitely a challenge for someone! |
If the platform field value is the only blocker when reading, I would suggest preprocessing the raw SAM header before parsing, e.g., 1) just
If there are large enough divergences among the versions, version-specific parsers can be added. (This is how the VCF reader chooses to validate types and cardinalities.) The current SAM 1.6 parser seems backwards-compatible enough though, sans anything that was previously undefined. |
Hi Michael @zaeleus, I checked Sequence Alignment/Map Format Specification (2022-08-22) § 1.3 "The header section":
I think the last sentence is important here. Although lowercase "illumina" is invalid, it should be omitted as a unknown value, rather than raising an error. As @ohofmann suggested, providing a 'lenient' option that throws warnings won't violate the spec, if I didn't misinterpreted the document. I also tried |
I believe this actually is talking about the file itself, not the way that libraries should treat the value. So, IMO, it's still totally valid for noodles to make this choice to strictly follow the spec. For the I tried the preprocessing workflow that @zaeleus suggested, and it worked for me. It's sad that it will come to these hacks being built into command line tools, but I actually think it's a relatively good solution all said and done (particularly when compared to not introducing the hacks into the upstream libraries like noodles).
My thought here was, if you plan on only supporting the latest version of the spec for a given noodles library, then we could simply petition the spec to be changed to allow this. This would mean no one has to change their data to be compliant.
|
There are obviously pros and cons in enforcing the restricted vocabulary. I note that as things stand, you can't use the noodles library to write a program to correct well formed but non-conformant BAM files. |
Here is one workaround: https://gist.github.com/kdauria/420ca40aaa89cfc8156f0e1793c0ebd3. |
noodles 0.61.0 / noodles-sam 0.50.0 no longer models optional header fields. Optional fields are now placed as strings in the other fields map ( Here's an example of a header that was previously invalid: fn main() -> io::Result<()> {
let src = b"@HD\tVN:1.6\n@RG\tID:rg0\tPL:Illumina\n";
let mut reader = sam::io::Reader::new(&src[..]);
let header = reader.read_header()?;
let platform = header
.read_groups()
.first()
.and_then(|(_, read_group)| read_group.other_fields().get(&tag::PLATFORM));
dbg!(platform); // Some("Illumina")
Ok(())
} The specification issue with the platform field value remains unresolved: samtools/hts-specs#679. |
I have headers with:
PL:illumina
'Could not parse header: InvalidReadGroup(InvalidPlatform(Invalid))'
A non strict mode for header parsing would be great. I really don't care about the header, just give me my data. Having to re-header a huge bam to satisfy the library is a bridge too far.
The text was updated successfully, but these errors were encountered: