Header parsing too strict? #111

sc13-bioinf · 2022-08-30T12:38:49Z

I have headers with:
PL:illumina

'Could not parse header: InvalidReadGroup(InvalidPlatform(Invalid))'

A non strict mode for header parsing would be great. I really don't care about the header, just give me my data. Having to re-header a huge bam to satisfy the library is a bridge too far.

zaeleus · 2022-08-30T15:43:53Z

In this case, noodles' SAM header parser is not overly strict. It is, however, spec-compliant.

From Sequence Alignment/Map Format Specification (2022-08-22) § 1.3 "The header section":

Platform/technology used to produce the reads. Valid values: CAPILLARY, DNBSEQ (MGI/BGI), ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Biosciences), SOLID, and ULTIMA.

This shows that the values are uppercase. There is also a test case marked invalid that's exactly your example: https://github.com/samtools/hts-specs/blob/59a0d0c8dc01b46192f6da1cec61db59c9b3d1d9/test/sam/failed/hdr.RG6.sam.

ohofmann · 2022-08-31T00:01:27Z

I can't believe I am arguing for this change but here we are. Of course, noodles is doing the right thing here and there is a strong case for having a toolchain that enforces standards and does not enable users to gradually drift away from the specs.

That said, like @sc13-bioinf we are sitting on about a petabyte of BAMs on S3 that we'd need to get out of the archive, re-header, and then store again if we want to make them available through noodles/htsget-rs in the future. The temptation to just stick with htslib is there. A 'lenient' option that throws warnings instead of erroring out might help drive adoption - at least when applied to flags such as PL which very few downstream methods actively leverage.

claymcleod · 2022-08-31T00:34:52Z

@ohofmann out of curiosity, is your issue with the header the same as @sc13-bioinf‘s (the lowercase platform name)?

Internally, we have been brainstorming some computationally feasible ways to provide an easy on-ramp to get things back into spec (eg would not require a reheader in the traditional sense).

claymcleod · 2022-08-31T00:45:27Z

I have also thought about this quite a bit from the other perspective—I think that, philosophically, perhaps the spec and users think of rigidity in the header differently. One could argue (though I’m not yet sure if they should) that the spec should change to allow this, as it wasn’t enforced in htslib for so long. What good is it in keeping the spec so rigid if (a) the tools don’t enforce it and (b) it’s not what the majority of users use it for/there is not a good compelling reason [unless there is one? I cannot think of one off hand].

That would allow @zaeleus to keep things in spec and still support this historical data. I realize this doesn’t solve the problem of “well, technically that ship has already sailed for existing versions of the spec”.

Just thinking out loud here.

ohofmann · 2022-08-31T00:48:30Z

@ohofmann out of curiosity, is your issue with the header the same as @sc13-bioinf‘s (the lowercase platform name)?

Yes, weirdly enough the same issue - this is for a historical backlog of alignments generated with https://github.com/bcbio/bcbio-nextgen which seems to be populating the platform information with lower case labels though this very likely was due to a misconfiguration on our side.

claymcleod · 2022-08-31T00:49:23Z

@zaeleus another question for you—do you plan to support all versions of the spec for all versions of noodles, or will releases of noodles only support the latest version of the spec at any given time (so if you want to use old versions of the spec, you’d have to use old versions of noodles).

jkbonfield · 2022-08-31T08:56:04Z

It's possible to reheader BAMs without decompressing and recompressing all the BAM records. Once the header is done it's basically a read/write loop. Lots of I/O, but computationally quick. In theory on some filesystems (zfs maybe?) it's possible to cut and paste files together via filesystem meta-data, so theoretically that read/write loop isn't needed due to "magic", but that's very OS/FS specific and I don't know of tools that do it.

It's a pity the BAM implementations didn't include BGZF blocks which consume disk space but produce no output. Such blocks could be consumed or added to in order to permit in-place reheader that doesn't change the downstream file and also avoids read/write loops. (CRAM can do this.)

Hmm, in theory it may also be possible to construct a Deflate block the same size as the existing one by deliberately choosing appropriate encodings. Hellishly hard to do for the general case (if not impossible), but maybe for a lowercase to uppercase translation it may be that the compressed block could end up the same size (with a variety of parameters tested). Definitely a challenge for someone!

zaeleus · 2022-08-31T20:22:56Z

If the platform field value is the only blocker when reading, I would suggest preprocessing the raw SAM header before parsing, e.g., 1) just illumina or 2) generalized.

do you plan to support all versions of the spec for all versions of noodles, or will releases of noodles only support the latest version of the spec at any given time

If there are large enough divergences among the versions, version-specific parsers can be added. (This is how the VCF reader chooses to validate types and cardinalities.) The current SAM 1.6 parser seems backwards-compatible enough though, sans anything that was previously undefined.

ghost · 2022-09-01T05:50:23Z

Hi Michael @zaeleus, I checked Sequence Alignment/Map Format Specification (2022-08-22) § 1.3 "The header section":

PL: Platform/technology used to produce the reads. Valid values: CAPILLARY, DNBSEQ (MGI/BGI), ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Biosciences), SOLID, and ULTIMA. This field should be omitted when the technology is not in this list (though the PM field may still be present in this case) or is unknown.

I think the last sentence is important here. Although lowercase "illumina" is invalid, it should be omitted as a unknown value, rather than raising an error. As @ohofmann suggested, providing a 'lenient' option that throws warnings won't violate the spec, if I didn't misinterpreted the document.

I also tried samtools view <(echo -e "@RG\tID:1\tPL:illumina"), no error returned.

claymcleod · 2022-09-11T01:51:25Z

Hi Michael @zaeleus, I checked Sequence Alignment/Map Format Specification (2022-08-22) § 1.3 "The header section":

PL: Platform/technology used to produce the reads. Valid values: CAPILLARY, DNBSEQ (MGI/BGI), ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Biosciences), SOLID, and ULTIMA. This field should be omitted when the technology is not in this list (though the PM field may still be present in this case) or is unknown.

I think the last sentence is important here. Although lowercase "illumina" is invalid, it should be omitted as a unknown value, rather than raising an error. As @ohofmann suggested, providing a 'lenient' option that throws warnings won't violate the spec, if I didn't misinterpreted the document.

I also tried samtools view <(echo -e "@RG\tID:1\tPL:illumina"), no error returned.

I believe this actually is talking about the file itself, not the way that libraries should treat the value. So, IMO, it's still totally valid for noodles to make this choice to strictly follow the spec. For the samtools view example, I think this is the point: htslib does not correctly enforce the spec, and that's how we got into this situation (Michael gave a talk not too long ago with many other examples of surprising things that samtools lets past). Noodles tries to be different by being a spec-compliant parser.

I tried the preprocessing workflow that @zaeleus suggested, and it worked for me. It's sad that it will come to these hacks being built into command line tools, but I actually think it's a relatively good solution all said and done (particularly when compared to not introducing the hacks into the upstream libraries like noodles).

If there are large enough divergences among the versions, version-specific parsers can be added. (This is how the VCF reader chooses to validate types and cardinalities.) The current SAM 1.6 parser seems backwards-compatible enough though, sans anything that was previously undefined.

My thought here was, if you plan on only supporting the latest version of the spec for a given noodles library, then we could simply petition the spec to be changed to allow this. This would mean no one has to change their data to be compliant.

I don't know if I think this is the best course of action, as version-specific parsers seem more prudent to not require people to update their data to use the latest version of noodles.
While on the topic of the spec, I'd just be curious why this detail of "only these platforms and only all uppercase" is enforced. Perhaps there is a good reason, I just cannot think of any off the top of my head.

drtconway · 2023-05-30T02:32:19Z

There are obviously pros and cons in enforcing the restricted vocabulary.

I note that as things stand, you can't use the noodles library to write a program to correct well formed but non-conformant BAM files.

kdauria · 2023-07-22T22:15:16Z

Here is one workaround: https://gist.github.com/kdauria/420ca40aaa89cfc8156f0e1793c0ebd3.

zaeleus · 2024-01-25T17:06:49Z

noodles 0.61.0 / noodles-sam 0.50.0 no longer models optional header fields. Optional fields are now placed as strings in the other fields map (Map::other_fields).

Here's an example of a header that was previously invalid:

fn main() -> io::Result<()> {
    let src = b"@HD\tVN:1.6\n@RG\tID:rg0\tPL:Illumina\n";

    let mut reader = sam::io::Reader::new(&src[..]);
    let header = reader.read_header()?;

    let platform = header
        .read_groups()
        .first()
        .and_then(|(_, read_group)| read_group.other_fields().get(&tag::PLATFORM));

    dbg!(platform); // Some("Illumina")

    Ok(())
}

The specification issue with the platform field value remains unresolved: samtools/hts-specs#679.

zaeleus added the sam label Aug 30, 2022

claymcleod mentioned this issue Oct 23, 2022

sam: Clarify case-sensitivity of read group platform values samtools/hts-specs#679

Closed

drtconway mentioned this issue May 27, 2023

Read group platform unnecessarily strict #171

Closed

J-Wall mentioned this issue Jun 9, 2023

Unable to parse BED files which have non-integer score values #175

Closed

zaeleus self-assigned this Jan 22, 2024

zaeleus closed this as completed Jan 25, 2024

mmalenic mentioned this issue Feb 28, 2024

build(deps): update noodles to 0.65 umccr/htsget-rs#229

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Header parsing too strict? #111

Header parsing too strict? #111

sc13-bioinf commented Aug 30, 2022

zaeleus commented Aug 30, 2022

ohofmann commented Aug 31, 2022

claymcleod commented Aug 31, 2022

claymcleod commented Aug 31, 2022 •

edited

Loading

ohofmann commented Aug 31, 2022

claymcleod commented Aug 31, 2022

jkbonfield commented Aug 31, 2022

zaeleus commented Aug 31, 2022

ghost commented Sep 1, 2022 •

edited by ghost

Loading

claymcleod commented Sep 11, 2022 •

edited

Loading

drtconway commented May 30, 2023

kdauria commented Jul 22, 2023

zaeleus commented Jan 25, 2024

Header parsing too strict? #111

Header parsing too strict? #111

Comments

sc13-bioinf commented Aug 30, 2022

zaeleus commented Aug 30, 2022

ohofmann commented Aug 31, 2022

claymcleod commented Aug 31, 2022

claymcleod commented Aug 31, 2022 • edited Loading

ohofmann commented Aug 31, 2022

claymcleod commented Aug 31, 2022

jkbonfield commented Aug 31, 2022

zaeleus commented Aug 31, 2022

ghost commented Sep 1, 2022 • edited by ghost Loading

claymcleod commented Sep 11, 2022 • edited Loading

drtconway commented May 30, 2023

kdauria commented Jul 22, 2023

zaeleus commented Jan 25, 2024

claymcleod commented Aug 31, 2022 •

edited

Loading

ghost commented Sep 1, 2022 •

edited by ghost

Loading

claymcleod commented Sep 11, 2022 •

edited

Loading