Added BGISEQ as legal value for the PL tag #454

yfarjoun · 2019-10-17T21:50:09Z

in the @rg SAM header line.

fixes #178

hts-specs-bot · 2019-10-17T21:51:55Z

Changed PDFs as of d0170ac: SAMv1 (diff).

nh13 · 2019-10-17T22:35:39Z

Can we not have “other” and just not specify the tag?

yfarjoun · 2019-10-18T02:21:17Z

In the call today there was a discussion discussion about this. The point here is to distinguish between not saying anything (like "unknown") and saying "other" which is stating that it is not one of the preceding values.

You may disagree on the merits of being able to distinguish between these two states....

colinhercus · 2019-10-18T03:20:35Z

How about adding MGI as well. See https://en.mgitech.cn/

nh13 · 2019-10-18T03:59:56Z

@yfarjoun I like it, how about adding that if the platform is unknown, the tag should not be other?

jmarshall · 2019-10-18T08:46:40Z

Re BGI vs BGISEQ: the existing list of values mostly reflects underlying technologies rather than manufacturers (though it's a bit of a mishmash), so I'd prefer going with what was requested — but that request was three years ago of unknown provenance, so hopefully a knowledgeable source may add further information to #178.

The list of values is currently more or less ordered by vintage/dominance rather than alphabetical. So if we're adding this new entry, either add it at the end or fully alphabetise the list.

jmarshall · 2019-10-18T09:18:25Z

My problem with OTHER is that it's not something that anyone would ever choose to write in this field. “I know I'm using a Frobnitz-500, so why can't I put Frobnitz there?”

We list a well-known vocabulary in the spec so that tools can infer the underlying technology's error model etc without having to look for 18 different spellings/capitalisations of ILLUMINA/HISEQ/etc. That doesn't mean that this needs to be a fully-controlled vocabulary or that it's useful to validate that the PL value is one of those listed.

What if instead we open this up via the usual lowercase value mechanism:

Value may be one of CAPILLARY, […], PACBIO, or a locally-defined lowercase keyword matching /[a-z0-9]+/ [footnote: Contact the spec maintainers to add uppercase versions of new values that may be of general interest]. If the platform is unknown, the PL tag should be omitted.

Possibly the last sentence can be left out (as without the temptation of OTHER it should be obvious), as that's taken for granted for all these fields.

(This would also solve the BGI/MGI problem (FYI @colinhercus reports on #178 that they're actually the same thing): we could leave it off the list until further demand produces another request.)

colinhercus · 2019-10-18T09:46:28Z

@jmarshall, From my experience HiSeq and NextSeq have very different sequencing error models so I'm not sure how helpful PL:ILLUMINA is!

jmarshall · 2019-10-18T09:53:40Z

@colinhercus: Indeed. That's the historical reason for having a well-known PL but free-form PM, but it already broke down when we didn't differentiate 2-colour Illumina, alas — so you get to fuzzy match NextSeq/NovaSeq/etc in PM after all…

jkbonfield · 2019-10-18T10:15:58Z

I could go with upper/lowercase namespacing as an alternative to OTHER.

yfarjoun · 2019-10-18T13:40:56Z

@jmarshall as long as we only allow certain values, there's merit in allowing "OTHER" for exactly the case where there's no valid value, but there's another reason: there are nascent technologies that are still stealthy enough that they would not want to declare what the technology is, but for the purposes of collaboration would like to say something in the PL field...

jmarshall · 2019-10-18T14:25:31Z

By “for the purposes of collaboration” do you mean that Picard in fact does not allow you to omit PL to indicate that the platform is unknown or undisclosed?

Well…

yfarjoun · 2019-11-03T14:00:09Z

@jmarshall what's the point of specifying in the spec what the VALID VALUES are, and then allowing sam files that have other values there?

yfarjoun · 2019-11-03T14:01:18Z

@jkbonfield for some historical reason, htsjdk ignores case when comparing to the allowed values...so lower-casing isn't going to help...

hts-specs-bot · 2019-11-14T11:59:48Z

Changed PDFs as of f651eec: SAMv1 (diff).

jkbonfield · 2019-11-14T14:20:30Z

@jmarshall what's the point of specifying in the spec what the VALID VALUES are, and then allowing sam files that have other values there?

It at least means people don't invent other synonymous valid values when they see a term already in use.

I still think there is merit in uppercase vs lowercase. I'm sure htsjdk could be modified to not complain about lowercase ones when it's doing the validation? I don't think we need to worry about case insensitivity (infact it'd be a bonus to have it) as we're not proposing having both upper and lowercase tags with the same text meaning different things. It's simply a way of pointing out to people that it's not yet "blessed" in the official dictionary.

jkbonfield · 2019-11-14T15:12:21Z

How about adding MGI as well. See https://en.mgitech.cn/

MGI is a subsidiary of BGI. The "PL"atform field is an unfortunate mishmash of all sorts and it's a bit vague where a platform finishes and a model starts. For example NovaSeq is a model and ILLUMINA is the platform despite it also being the company name. Conversely SOLID is a platform, not ABI which was the company involved, perhaps because ABI were also ubiquitous with earlier sequencing platforms too (eg CAPILLARY based Sanger sequencing).

For MGI vs BGI, I can see benefits either way, but we only want one and not both as they're basically the same instrument rebranded. If they're already starting to be populated in the wild then we should bless that one in this spec.

jmarshall · 2019-11-14T15:35:12Z

In fact, extant HTSJDK validates that PL is present and is one of the existing known values (see samtools/htsjdk@db13cd3) and since samtools/htsjdk@1e6d054 does so case-insensitively (as an unexplained extension over the spec). So in fact HTSJDK is already accepting values outwith the spec.

As @jkbonfield replied and as I said in #454 (comment), the point of specifying the WELL-KNOWN VALUES but nevertheless allowing other values is that tools can depend on SOLID implying that the reads have Solid properties and similarly for other well-known values, but allowing other things to be specified. However I take your point that the world is full of bears of little brain who may miss this point and start writing solid or solidish or illuminaseq.

It is unfortunate that HTSJDK has allowed lowercase PL values equivalently to the well-known uppercase ones, but this does not itself prevent extending the value space to allow locally-defined lowercase values.

I remain opposed to adding OTHER as it is a meaningless additional defined value. To my mind, we have two sensible approaches to choose between:

(The status quo option.) Keep the list of valid values as is (adding BGI/BGISEQ/MGI/DNBSEQ separately as appropriate, as per Add 'BGISEQ' as a valid value of PL in @RG header record #178), and explicitly allow the omission of PL and spell out what that indicates:

Platform/technology used to produce the reads. Valid values: CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, and PACBIO. Omitted for platforms not in this list (in which case PM can be used to indicate the platform [to human readers]) or when the platform is unknown.

This would require HTSJDK changes to allow PL to be omitted, but there was never any justification in the spec for validating its presence anyway. (Can any Broadites look up PO-326's justification for validating PL presence in all cases? Presumably it's another case of “it should be present in our usual sequencing protocols, hence enforce”.)

(The complicated extensible option.) Valid values are the existing uppercase list or locally-defined lowercase keywords, as suggested in Added BGISEQ as legal value for the PL tag #454 (comment). This has the disadvantage of complexity compared to the very simple unambiguous status quo of “if the reads are known to be one of these well-known technologies, tools can easily detect this by the exact keyword in PL”.

Me, I now favour (1). It's basically equivalent to OTHER in its effect on implementations: either HTSJDK needs modification to allow OTHER or HTSJDK needs modification to allow omitted PL.

(1) has the advantage of not adding a meaningless keyword value to the spec, while still distinguishing between the unknown and other states (cf #454 (comment)) via PM.

jmarshall · 2019-11-14T15:43:59Z

Re #454 (comment), we should also discuss belatedly adding ILLUMINA:2COLOR (and perhaps ILLUMINA:4COLOR) or similar alongside the post-nextseq ambiguous ILLUMINA, so that centres can move to something that specifies which of the colour chemistry error models applies.

yfarjoun · 2019-11-14T16:38:45Z

from call: split into two PRs, OTHER and BGISEQ

daviesrob · 2019-11-14T16:58:02Z

A list of platforms and instrument models can be found in ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.common.xsd (part of the EBI's specification for XML metadata files that accompany data uploaded to the ENA). Look for <xs:complexType name="PlatformType"> to find the list of platforms. Model names start from <xs:simpleType name="type454Model">.

jkbonfield · 2019-11-14T17:20:05Z

A bit of a diversion. IMO htsjdk/picard is perfectly entitled to emit warnings about fields it would prefer to have, and perhaps even have a mode where those warnings become hard failure (like -Werror on compiler), but the default really should be to accept any legal file without failure. We certainly shouldn't be considering the behaviour of such tools when writing the spec.

Regarding this PR though, as @yfarjoun said above we'll remove OTHER for now and have that discussion separate, but the remaining issue is BGI vs MGI. If anyone in the know has any key insight to this now is the time to raise it. (We don't want both.)

jkbonfield · 2019-11-14T17:23:31Z

A list of platforms and instrument models can be found in ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.common.xsd (part of the EBI's specification for XML metadata files that accompany data uploaded to the ENA). Look for <xs:complexType name="PlatformType"> to find the list of platforms. Model names start from <xs:simpleType name="type454Model">.

Thanks Rob. That's useful and it's good to see the PL fields at least match up so we haven't forgotten something.

The list of choices for PM look quite thorough. If we wish to add this to our spec it should probably be an appendix (unless it doesn't look too long as a footnote) and cite the source so any revisions can be discovered in case we're laggy on updating it.

lbergelson · 2019-11-15T14:52:23Z

It looks like that list isn't entirely exhaustive, it looks like it's missing most BGI / MGI models.

kbergin · 2019-11-20T22:29:34Z

Hi! I wanted to mention another option that could be considered for the PM field controlled vocabulary list. There exists sequencing platform standard terminology in EBI's OBI ontology here https://www.ebi.ac.uk/ols/ontologies/obi/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FOBI_0400103&viewMode=All&siblings=false

This is a public and controlled set of ontologies, and offers both specific names and IDs. New ontology fields can be requested as needed.

The ENCODE project uses the same Ontology for their metadata, for example here: https://www.encodeproject.org/files/ENCFF432YLJ/?format=json.

One downside is the naming seems to duplicate the PL field a bit. i.e. "Illumina HiSeq 2500"

jkbonfield · 2019-11-21T14:54:27Z

That looks very out of date. Maybe it should be populated with the information from the other EBI ftp link @daviesrob linked above.

Eg there's no ONT, no capillary sequencing instruments, only has 1 of the 4 PacBios, no BGI/MGI, etc.

I think we'd be better linking with the ENA ontology as they have a much more vested interest in keeping theirs up to date - they cannot accept data submissions without doing so.

raskoleinonen · 2019-11-21T15:12:18Z

ENA shares accepted instrument models with the other two INSDC members: NCBI and DDBJ. As has been pointed our previously in the thread, they can be extracted from the SRA.common.xsd that is hosted, among other possible locations, here:

ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.common.xsd

PlatformType has all supported sequencing platforms. Instrument model enumerations are in the INSTRUMENT_MODEL element under each platform.

During submissions processing ENA also supports a number of synonyms for some of the instrument models, e.g. we could support both 'NextSeq 500' and 'Illumina NextSeq 500'. We would then map synonyms to the XML Schema enumerations.

I agree that it would be great to have these as an ontology somewhere to link to. If someone connects me with the right people I could maintain the list of instruments in an ontology as they get added to the SRA schema.

It looks like that list isn't entirely exhaustive, it looks like it's missing most BGI / MGI models.

This is also true and the reason is that the INSDC has not been asked by submitters yet to support depositions from these models. Normally, INSDC adds instruments as needed by submitters, but we could look at changing this process slightly and add the instruments first in a ontology and then later, as needed, add them to the SRA.common.xsd.

kbergin · 2019-11-21T16:31:06Z

Ah that's a great point - I'm definitely not an expert in this, I thought it'd be great to leverage an existing ontology, but I am reading that the main difference between ontology and controlled vocabulary is the relationships between terms, the management of the list, and how the information is expressed i.e. ontologies being in an ontology representation language. It's possible a representation language is not useful here?

I missed that the ones in the SRA field were pulled from an ontology, the person I asked at EBI about which ontology is used for SRA said that SRA uses a controlled vocabulary and that "INSDC works in a slightly different way for these things". She did say she's happy to have a call to explain more, but I didn't follow up.

In the end the main win is if there is another group who's job it already is to keep the list of terms up to date so this community doesn't have to, right?

hts-specs-bot · 2020-03-05T15:27:57Z

Changed PDFs as of 786f5e0: SAMv1 (diff).

yfarjoun · 2020-03-05T15:28:26Z

hmmm. I actually prefer technology names over company names...companies change (think solexa, and pacbio almost) but the technology stays the same...It's unfortunate that we have so many company names as valid values in PL, but we can start putting in the "correct" values as new technologies emerge, rather than simply asserting that everything from illumina is effectively the same....

jkbonfield · 2020-03-05T15:28:23Z

SAMv1.tex

-  & {\tt PM} & Platform model. Free-form text providing further details of the
-  platform/technology used.\\\cline{2-3}
+  & {\tt PL} & Platform/technology used to produce the reads. \emph{Valid values are}:
+ {\tt CAPILLARY},  {\tt DNB} (MGI/BGI's DNA NanoBalls), {\tt HELICOS}, {\tt ILLUMINA} (can be used for either 4- or 2-color chemistry, patterned or unpatterned flowcells), {\tt IONTORRENT}, {\tt LS454}, {\tt ONT} (Oxford Nanopore), {\tt PACBIO} (Pacific Biotechnology), {\tt SOLID}.


"DNB" is very nonspecific. MGI I think acquired this from Complete Genomics, but the instruments are very different. It's more like a data prep thing, which we tend to not describe here. Eg rolling circle amplification, mate pair vs paired-end sequencing, etc. I think it's not a good term for the PLatform field.

I also dislike all the extra verbiage which IMO isn't necessary here.

yfarjoun · 2020-03-05T15:30:38Z

I went with DNB, since it seems silly to add "SEQ" to all the SEQuencing platforms...

jkbonfield · 2020-03-05T15:30:48Z

hmmm. I actually prefer technology names over company names...companies change (think solexa, and pacbio almost) but the technology stays the same...It's unfortunate that we have so many company names as valid values in PL, but we can start putting in the "correct" values as new technologies emerge, rather than simply asserting that everything from illumina is effectively the same....

Maybe, but technology is shared across multiple vendors and often coupled with very different instruments. DNB being an example of this. Also what would you do if another Nanopore company came into the market? ONT defines the instrument, but if we'd labelled them as a generic "Nanopore" it would then be confusing and misleading.

Changed since approval.

yfarjoun · 2020-03-05T15:32:16Z

It never ceases to amaze me how difficult it is to keep everyone happy :-)

let's duke it out in the call.....

hts-specs-bot · 2020-03-05T15:32:57Z

Changed PDFs as of 00af2ba: SAMv1 (diff).

hts-specs-bot · 2020-03-05T16:30:50Z

Changed PDFs as of bff98cc: SAMv1 (diff).

jkbonfield · 2020-03-05T17:12:10Z

A quick summary from conference call, incase anyone is watching.

We agreed DNBSEQ would work. It's the name of their whole suite of instruments so is close enough to "platform", and it's likely to be specific enough to that company that it's not going to be reused by anyone else (we never had a platform field for Complete Genomics anyway). It also means if MGI comes up with a radically different technology then we haven't painted ourselves into a corner. (Using the company name only was probably an error in the past and doesn't justify continued repetition.)

To that ilk there was some discussion on Illumina 2-colour vs 4-colour and whether they should be separated out, and also where is appropriate to note chemistry specific mods (eg PCR-free vs with PCR on the same instrument/model), but that'll be a topic for a future PR and shouldn't hold this up.

jmarshall · 2020-04-02T16:44:32Z

I've taken the liberty of pushing a suggested minor typographic commit with fixes for the two remaining comments on bff98cc. It also adds an entry to the version history appendix.

👍 from me for squash'n'merge with these minor fixes or an equivalent. Thanks for persevering and finally dragging this over the finish line… 😄

Applied previous suggestions: Just "Valid values:" like the other field descriptions; s/tag/field and omit misplaced comma. Also add a history item.

hts-specs-bot · 2020-04-02T16:48:08Z

Changed PDFs as of fca16bb: SAMv1 (diff).

yfarjoun · 2020-04-02T18:03:10Z

@jkbonfield your thumb?

jkbonfield · 2020-04-27T13:00:14Z

@jkbonfield your thumb?

Apologies for missing that notification. Yes it gets the thumbs up for me. Thanks for the changes.

yfarjoun · 2020-04-30T11:50:37Z

🎉

splaisan · 2020-04-30T12:40:08Z

Thanks a lot @yfarjoun, great fix, will it be in the next release and when is this planned? (I prefer not to install the dev versions)

jmarshall added the sam label Oct 18, 2019

jmarshall mentioned this pull request Dec 12, 2019

Clarify that RG-PL is omitted for unknown/other platforms (alternative to PL:OTHER) #461

Closed

added explanitory text to some platforms

786f5e0

yfarjoun force-pushed the yf_allow_new_platforms branch from 5ed1e1b to 786f5e0 Compare March 5, 2020 15:25

jkbonfield reviewed Mar 5, 2020

View reviewed changes

added 'and,'

00af2ba

added 'tighten

bff98cc

[SQUASH ME] Suggested updates for your consideration

fca16bb

Applied previous suggestions: Just "Valid values:" like the other field descriptions; s/tag/field and omit misplaced comma. Also add a history item.

jmarshall force-pushed the yf_allow_new_platforms branch from 4361524 to fca16bb Compare April 2, 2020 16:45

samtools deleted a comment from hts-specs-bot Apr 3, 2020

yfarjoun changed the title ~~Added BGI and OTHER as legal values for the PL tag~~ Added BGISEQ as legal value for the PL tag Apr 30, 2020

yfarjoun merged commit 82e60fc into samtools:master Apr 30, 2020

yfarjoun deleted the yf_allow_new_platforms branch April 30, 2020 11:50

jmarshall mentioned this pull request May 31, 2022

Adding Ultima Genomics to the list of allowed SAM platforms #648

Closed

jmarshall mentioned this pull request Oct 31, 2022

sam: Clarify case-sensitivity of read group platform values #679

Open

jmarshall mentioned this pull request Apr 29, 2023

sam: Add PL value "MGIG400" as seen in the wild #718

Closed

jmarshall mentioned this pull request May 10, 2023

sam: Consider more conventional format versioning #725

Open

Added BGISEQ as legal value for the PL tag #454

Added BGISEQ as legal value for the PL tag #454

Conversation

yfarjoun commented Oct 17, 2019

hts-specs-bot commented Oct 17, 2019

nh13 commented Oct 17, 2019

yfarjoun commented Oct 18, 2019

colinhercus commented Oct 18, 2019

nh13 commented Oct 18, 2019

jmarshall commented Oct 18, 2019

jmarshall commented Oct 18, 2019

colinhercus commented Oct 18, 2019

jmarshall commented Oct 18, 2019 • edited

jkbonfield commented Oct 18, 2019

yfarjoun commented Oct 18, 2019

jmarshall commented Oct 18, 2019

yfarjoun commented Nov 3, 2019

yfarjoun commented Nov 3, 2019

hts-specs-bot commented Nov 14, 2019

jkbonfield commented Nov 14, 2019

jkbonfield commented Nov 14, 2019 • edited

jmarshall commented Nov 14, 2019 • edited

jmarshall commented Nov 14, 2019 • edited

yfarjoun commented Nov 14, 2019

daviesrob commented Nov 14, 2019

jkbonfield commented Nov 14, 2019 • edited

jkbonfield commented Nov 14, 2019

lbergelson commented Nov 15, 2019

kbergin commented Nov 20, 2019

jkbonfield commented Nov 21, 2019

raskoleinonen commented Nov 21, 2019

kbergin commented Nov 21, 2019

hts-specs-bot commented Mar 5, 2020

yfarjoun commented Mar 5, 2020

jkbonfield Mar 5, 2020

Choose a reason for hiding this comment

yfarjoun commented Mar 5, 2020

jkbonfield commented Mar 5, 2020

yfarjoun commented Mar 5, 2020

hts-specs-bot commented Mar 5, 2020

hts-specs-bot commented Mar 5, 2020

jkbonfield commented Mar 5, 2020

jmarshall commented Apr 2, 2020

hts-specs-bot commented Apr 2, 2020

yfarjoun commented Apr 2, 2020

jkbonfield commented Apr 27, 2020

yfarjoun commented Apr 30, 2020

splaisan commented Apr 30, 2020

jmarshall commented Oct 18, 2019 •

edited

jkbonfield commented Nov 14, 2019 •

edited

jmarshall commented Nov 14, 2019 •

edited

jmarshall commented Nov 14, 2019 •

edited

jkbonfield commented Nov 14, 2019 •

edited