Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sam: Add PL value "MGIG400" as seen in the wild #718

Closed
brainstorm opened this issue Apr 28, 2023 · 5 comments
Closed

sam: Add PL value "MGIG400" as seen in the wild #718

brainstorm opened this issue Apr 28, 2023 · 5 comments

Comments

@brainstorm
Copy link
Contributor

brainstorm commented Apr 28, 2023

The BAM header below was refusing to be parsed by noodles due to the PL field (via htsget-rs), as reported by NESI. This BAM field came from this machine. Non-related fields have been manually masked/anonymized with ****:

% samtools head t001.bam | grep PL
@RG	ID:****	SM:****	PL:MGIG400
@PG	ID:bwa	PN:bwa	VN:0.7.17-r1188	CL:bwa mem -M -Y -R @RG\tID:***
\tSM:****\tPL:MGIG400 -t 24 /some/nobackup/path/DeepVariantAnalysis/resources/genome.fna data/****.fq.gz data/****.fq.gz

Interestingly, this PL field should have been DNBSEQ but the software machine outputs its model (MGIG400) instead, nowadays? Perhaps the vendor should be notified and prompted to update their tooling instead of introducing this change to the standard.

Tangentially related issue worth exploring along this one: #679

/cc @ohofmann @mmalenic @zaeleus

@brainstorm
Copy link
Contributor Author

brainstorm commented Apr 28, 2023

Nevermind, MGI doesn't seem to ship aligners. PL field most probably introduced on third party pipeline downstream erroneously.

@brainstorm brainstorm closed this as not planned Won't fix, can't repro, duplicate, stale Apr 28, 2023
@jkbonfield
Copy link
Contributor

It's also worth noting even if this was being generated by MGI and we hadn't previously decided on DNBSEQ, it would be rejected. The problem with MGIG400 is it conflates platform with model, which are two different tags.

@jkbonfield
Copy link
Contributor

We did also discuss this sort of issue in the most recent conference call.

The issue was one of validation. Having an invalid field here does not invalidate the rest of the file. Syntacically it would all make sense. The point was raised what if we check these fields and reject files that don't match, but the specification then gets updated? We haven't updated the SAM version number when we've added extra fields here as the syntax is identical, so programs cannot check that either. That's a valid point. We felt the correct process would be, if validation is performed, to make it a warning only. This fits with the point above that unknown data here does not invalidate any remainder of the file.

You could argue then what's the point of having a controlled vocabulary, as it doesn't stop people from just adding anything there anyway (as demonstrated). We feel there is still merit in having PL as a controlled vocabulary, as it gives vendors and users alike a clue as to what is expected. Without it we're highly likely to get ONT, OxfordNanopore, OxfordNanoporeTechnology, Oxford_Nanopore_Technology, etc. That's even ignoring the issue of case sensitivity. With it, well we may still get invalid fields, but hopefully it is significantly reduced. Note that this also ties in with sequence submissions as the archives have a controlled vocabularly in their schemas.

@zaeleus
Copy link

zaeleus commented Apr 28, 2023

Having an invalid field here does not invalidate the rest of the file.

A file can either be well-formed or not. I'm not sure why you would want or trust a somewhat valid file w.r.t. a specification, especially in the scientific domain.

We haven't updated the SAM version number when we've added extra fields here as the syntax is identical

Appending to a list of known values changes the syntax. PL:(CAPILLARY|DNBSEQ|etc) is not identical to PL:[ -~]+.

We feel there is still merit in having PL as a controlled vocabulary, as it gives vendors and users alike a clue as to what is expected. Without it we're highly likely to get ONT, OxfordNanopore, OxfordNanoporeTechnology, Oxford_Nanopore_Technology, etc.

In this case, the spec shouldn't define them as valid values but as suggested values.

@jmarshall
Copy link
Member

The syntax of header lines is described in the first paragraph of §1.3: the fields are TAB-separated, and the line matches the regexp shown (notwithstanding the minor UTF-8-related issue you noted elsewhere). So regardless of whatever characters are in the PL field value, there is no difficulty parsing it: there is a TAB before the PL: and the value extends to (but does not include) either end-of-line or the next TAB, whichever comes first.

The list of keywords in the PL description is a list of semantically valid values. The syntax of the header line is unchanged when this list is appended to, as doing that does not affect parsing of that line or of the rest of that file or even (generally speaking) the interpretation of the rest of the file.

Note for example that ULTIMA was added to the spec when the SAM VN version number was already 1.6. Nonetheless a SAM file that says @HD VN:1.3 // @RG … PL:ULTIMA … is a perfectly valid SAM file. This is very intentional: parsing and understanding (the remainder of) the file is unaffected, so it would be silly for it to be invalid.

We've discussed this any number of times, e.g. on #454. I don't see any particular need to relitigate it, but if we do let's do it on a new open issue rather than a closed WONTFIX based on someone mistakenly specifying MGIG400 on their bwa command line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants