Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header parsing too strict? #111

Closed
sc13-bioinf opened this issue Aug 30, 2022 · 13 comments
Closed

Header parsing too strict? #111

sc13-bioinf opened this issue Aug 30, 2022 · 13 comments
Assignees
Labels

Comments

@sc13-bioinf
Copy link

I have headers with:
PL:illumina

'Could not parse header: InvalidReadGroup(InvalidPlatform(Invalid))'

A non strict mode for header parsing would be great. I really don't care about the header, just give me my data. Having to re-header a huge bam to satisfy the library is a bridge too far.

@zaeleus zaeleus added the sam label Aug 30, 2022
@zaeleus
Copy link
Owner

zaeleus commented Aug 30, 2022

In this case, noodles' SAM header parser is not overly strict. It is, however, spec-compliant.

From Sequence Alignment/Map Format Specification (2022-08-22) § 1.3 "The header section":

Platform/technology used to produce the reads. Valid values: CAPILLARY, DNBSEQ (MGI/BGI), ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Biosciences), SOLID, and ULTIMA.

This shows that the values are uppercase. There is also a test case marked invalid that's exactly your example: https://github.com/samtools/hts-specs/blob/59a0d0c8dc01b46192f6da1cec61db59c9b3d1d9/test/sam/failed/hdr.RG6.sam.

@ohofmann
Copy link

I can't believe I am arguing for this change but here we are. Of course, noodles is doing the right thing here and there is a strong case for having a toolchain that enforces standards and does not enable users to gradually drift away from the specs.

That said, like @sc13-bioinf we are sitting on about a petabyte of BAMs on S3 that we'd need to get out of the archive, re-header, and then store again if we want to make them available through noodles/htsget-rs in the future. The temptation to just stick with htslib is there. A 'lenient' option that throws warnings instead of erroring out might help drive adoption - at least when applied to flags such as PL which very few downstream methods actively leverage.

@claymcleod
Copy link
Contributor

@ohofmann out of curiosity, is your issue with the header the same as @sc13-bioinf‘s (the lowercase platform name)?

Internally, we have been brainstorming some computationally feasible ways to provide an easy on-ramp to get things back into spec (eg would not require a reheader in the traditional sense).

@claymcleod
Copy link
Contributor

claymcleod commented Aug 31, 2022

I have also thought about this quite a bit from the other perspective—I think that, philosophically, perhaps the spec and users think of rigidity in the header differently. One could argue (though I’m not yet sure if they should) that the spec should change to allow this, as it wasn’t enforced in htslib for so long. What good is it in keeping the spec so rigid if (a) the tools don’t enforce it and (b) it’s not what the majority of users use it for/there is not a good compelling reason [unless there is one? I cannot think of one off hand].

That would allow @zaeleus to keep things in spec and still support this historical data. I realize this doesn’t solve the problem of “well, technically that ship has already sailed for existing versions of the spec”.

Just thinking out loud here.

@ohofmann
Copy link

@ohofmann out of curiosity, is your issue with the header the same as @sc13-bioinf‘s (the lowercase platform name)?

Yes, weirdly enough the same issue - this is for a historical backlog of alignments generated with https://github.com/bcbio/bcbio-nextgen which seems to be populating the platform information with lower case labels though this very likely was due to a misconfiguration on our side.

@claymcleod
Copy link
Contributor

@zaeleus another question for you—do you plan to support all versions of the spec for all versions of noodles, or will releases of noodles only support the latest version of the spec at any given time (so if you want to use old versions of the spec, you’d have to use old versions of noodles).

@jkbonfield
Copy link

It's possible to reheader BAMs without decompressing and recompressing all the BAM records. Once the header is done it's basically a read/write loop. Lots of I/O, but computationally quick. In theory on some filesystems (zfs maybe?) it's possible to cut and paste files together via filesystem meta-data, so theoretically that read/write loop isn't needed due to "magic", but that's very OS/FS specific and I don't know of tools that do it.

It's a pity the BAM implementations didn't include BGZF blocks which consume disk space but produce no output. Such blocks could be consumed or added to in order to permit in-place reheader that doesn't change the downstream file and also avoids read/write loops. (CRAM can do this.)

Hmm, in theory it may also be possible to construct a Deflate block the same size as the existing one by deliberately choosing appropriate encodings. Hellishly hard to do for the general case (if not impossible), but maybe for a lowercase to uppercase translation it may be that the compressed block could end up the same size (with a variety of parameters tested). Definitely a challenge for someone!

@zaeleus
Copy link
Owner

zaeleus commented Aug 31, 2022

If the platform field value is the only blocker when reading, I would suggest preprocessing the raw SAM header before parsing, e.g., 1) just illumina or 2) generalized.

do you plan to support all versions of the spec for all versions of noodles, or will releases of noodles only support the latest version of the spec at any given time

If there are large enough divergences among the versions, version-specific parsers can be added. (This is how the VCF reader chooses to validate types and cardinalities.) The current SAM 1.6 parser seems backwards-compatible enough though, sans anything that was previously undefined.

@ghost
Copy link

ghost commented Sep 1, 2022

Hi Michael @zaeleus, I checked Sequence Alignment/Map Format Specification (2022-08-22) § 1.3 "The header section":

PL: Platform/technology used to produce the reads. Valid values: CAPILLARY, DNBSEQ (MGI/BGI), ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Biosciences), SOLID, and ULTIMA. This field should be omitted when the technology is not in this list (though the PM field may still be present in this case) or is unknown.

I think the last sentence is important here. Although lowercase "illumina" is invalid, it should be omitted as a unknown value, rather than raising an error. As @ohofmann suggested, providing a 'lenient' option that throws warnings won't violate the spec, if I didn't misinterpreted the document.

I also tried samtools view <(echo -e "@RG\tID:1\tPL:illumina"), no error returned.

@claymcleod
Copy link
Contributor

claymcleod commented Sep 11, 2022

Hi Michael @zaeleus, I checked Sequence Alignment/Map Format Specification (2022-08-22) § 1.3 "The header section":

PL: Platform/technology used to produce the reads. Valid values: CAPILLARY, DNBSEQ (MGI/BGI), ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Biosciences), SOLID, and ULTIMA. This field should be omitted when the technology is not in this list (though the PM field may still be present in this case) or is unknown.

I think the last sentence is important here. Although lowercase "illumina" is invalid, it should be omitted as a unknown value, rather than raising an error. As @ohofmann suggested, providing a 'lenient' option that throws warnings won't violate the spec, if I didn't misinterpreted the document.

I also tried samtools view <(echo -e "@RG\tID:1\tPL:illumina"), no error returned.

I believe this actually is talking about the file itself, not the way that libraries should treat the value. So, IMO, it's still totally valid for noodles to make this choice to strictly follow the spec. For the samtools view example, I think this is the point: htslib does not correctly enforce the spec, and that's how we got into this situation (Michael gave a talk not too long ago with many other examples of surprising things that samtools lets past). Noodles tries to be different by being a spec-compliant parser.

I tried the preprocessing workflow that @zaeleus suggested, and it worked for me. It's sad that it will come to these hacks being built into command line tools, but I actually think it's a relatively good solution all said and done (particularly when compared to not introducing the hacks into the upstream libraries like noodles).

If there are large enough divergences among the versions, version-specific parsers can be added. (This is how the VCF reader chooses to validate types and cardinalities.) The current SAM 1.6 parser seems backwards-compatible enough though, sans anything that was previously undefined.

My thought here was, if you plan on only supporting the latest version of the spec for a given noodles library, then we could simply petition the spec to be changed to allow this. This would mean no one has to change their data to be compliant.

  • I don't know if I think this is the best course of action, as version-specific parsers seem more prudent to not require people to update their data to use the latest version of noodles.
  • While on the topic of the spec, I'd just be curious why this detail of "only these platforms and only all uppercase" is enforced. Perhaps there is a good reason, I just cannot think of any off the top of my head.

@drtconway
Copy link

There are obviously pros and cons in enforcing the restricted vocabulary.

I note that as things stand, you can't use the noodles library to write a program to correct well formed but non-conformant BAM files.

@kdauria
Copy link

kdauria commented Jul 22, 2023

@zaeleus zaeleus self-assigned this Jan 22, 2024
@zaeleus
Copy link
Owner

zaeleus commented Jan 25, 2024

noodles 0.61.0 / noodles-sam 0.50.0 no longer models optional header fields. Optional fields are now placed as strings in the other fields map (Map::other_fields).

Here's an example of a header that was previously invalid:

fn main() -> io::Result<()> {
    let src = b"@HD\tVN:1.6\n@RG\tID:rg0\tPL:Illumina\n";

    let mut reader = sam::io::Reader::new(&src[..]);
    let header = reader.read_header()?;

    let platform = header
        .read_groups()
        .first()
        .and_then(|(_, read_group)| read_group.other_fields().get(&tag::PLATFORM));

    dbg!(platform); // Some("Illumina")

    Ok(())
}

The specification issue with the platform field value remains unresolved: samtools/hts-specs#679.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants