Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added BGISEQ as legal value for the PL tag #454

Merged
merged 8 commits into from
Apr 30, 2020

Conversation

yfarjoun
Copy link
Contributor

in the @rg SAM header line.

fixes #178

@hts-specs-bot
Copy link

Changed PDFs as of d0170ac: SAMv1 (diff).

@nh13
Copy link
Member

nh13 commented Oct 17, 2019

Can we not have “other” and just not specify the tag?

@yfarjoun
Copy link
Contributor Author

In the call today there was a discussion discussion about this. The point here is to distinguish between not saying anything (like "unknown") and saying "other" which is stating that it is not one of the preceding values.

You may disagree on the merits of being able to distinguish between these two states....

@colinhercus
Copy link

How about adding MGI as well. See https://en.mgitech.cn/

@nh13
Copy link
Member

nh13 commented Oct 18, 2019

@yfarjoun I like it, how about adding that if the platform is unknown, the tag should not be other?

@jmarshall
Copy link
Member

Re BGI vs BGISEQ: the existing list of values mostly reflects underlying technologies rather than manufacturers (though it's a bit of a mishmash), so I'd prefer going with what was requested — but that request was three years ago of unknown provenance, so hopefully a knowledgeable source may add further information to #178.

The list of values is currently more or less ordered by vintage/dominance rather than alphabetical. So if we're adding this new entry, either add it at the end or fully alphabetise the list.

@jmarshall
Copy link
Member

My problem with OTHER is that it's not something that anyone would ever choose to write in this field. “I know I'm using a Frobnitz-500, so why can't I put Frobnitz there?”

We list a well-known vocabulary in the spec so that tools can infer the underlying technology's error model etc without having to look for 18 different spellings/capitalisations of ILLUMINA/HISEQ/etc. That doesn't mean that this needs to be a fully-controlled vocabulary or that it's useful to validate that the PL value is one of those listed.

What if instead we open this up via the usual lowercase value mechanism:

Value may be one of CAPILLARY, […], PACBIO, or a locally-defined lowercase keyword matching /[a-z0-9]+/ [footnote: Contact the spec maintainers to add uppercase versions of new values that may be of general interest]. If the platform is unknown, the PL tag should be omitted.

Possibly the last sentence can be left out (as without the temptation of OTHER it should be obvious), as that's taken for granted for all these fields.

(This would also solve the BGI/MGI problem (FYI @colinhercus reports on #178 that they're actually the same thing): we could leave it off the list until further demand produces another request.)

@jmarshall jmarshall added the sam label Oct 18, 2019
@colinhercus
Copy link

@jmarshall, From my experience HiSeq and NextSeq have very different sequencing error models so I'm not sure how helpful PL:ILLUMINA is!

@jmarshall
Copy link
Member

jmarshall commented Oct 18, 2019

@colinhercus: Indeed. That's the historical reason for having a well-known PL but free-form PM, but it already broke down when we didn't differentiate 2-colour Illumina, alas — so you get to fuzzy match NextSeq/NovaSeq/etc in PM after all…

@jkbonfield
Copy link
Contributor

I could go with upper/lowercase namespacing as an alternative to OTHER.

@yfarjoun
Copy link
Contributor Author

@jmarshall as long as we only allow certain values, there's merit in allowing "OTHER" for exactly the case where there's no valid value, but there's another reason: there are nascent technologies that are still stealthy enough that they would not want to declare what the technology is, but for the purposes of collaboration would like to say something in the PL field...

@jmarshall
Copy link
Member

By “for the purposes of collaboration” do you mean that Picard in fact does not allow you to omit PL to indicate that the platform is unknown or undisclosed?

Well…

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Nov 3, 2019

@jmarshall what's the point of specifying in the spec what the VALID VALUES are, and then allowing sam files that have other values there?

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Nov 3, 2019

@jkbonfield for some historical reason, htsjdk ignores case when comparing to the allowed values...so lower-casing isn't going to help...

@hts-specs-bot
Copy link

Changed PDFs as of f651eec: SAMv1 (diff).

@jkbonfield
Copy link
Contributor

@jmarshall what's the point of specifying in the spec what the VALID VALUES are, and then allowing sam files that have other values there?

It at least means people don't invent other synonymous valid values when they see a term already in use.

I still think there is merit in uppercase vs lowercase. I'm sure htsjdk could be modified to not complain about lowercase ones when it's doing the validation? I don't think we need to worry about case insensitivity (infact it'd be a bonus to have it) as we're not proposing having both upper and lowercase tags with the same text meaning different things. It's simply a way of pointing out to people that it's not yet "blessed" in the official dictionary.

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 14, 2019

How about adding MGI as well. See https://en.mgitech.cn/

MGI is a subsidiary of BGI. The "PL"atform field is an unfortunate mishmash of all sorts and it's a bit vague where a platform finishes and a model starts. For example NovaSeq is a model and ILLUMINA is the platform despite it also being the company name. Conversely SOLID is a platform, not ABI which was the company involved, perhaps because ABI were also ubiquitous with earlier sequencing platforms too (eg CAPILLARY based Sanger sequencing).

For MGI vs BGI, I can see benefits either way, but we only want one and not both as they're basically the same instrument rebranded. If they're already starting to be populated in the wild then we should bless that one in this spec.

@jmarshall
Copy link
Member

jmarshall commented Nov 14, 2019

In fact, extant HTSJDK validates that PL is present and is one of the existing known values (see samtools/htsjdk@db13cd3) and since samtools/htsjdk@1e6d054 does so case-insensitively (as an unexplained extension over the spec). So in fact HTSJDK is already accepting values outwith the spec.

As @jkbonfield replied and as I said in #454 (comment), the point of specifying the WELL-KNOWN VALUES but nevertheless allowing other values is that tools can depend on SOLID implying that the reads have Solid properties and similarly for other well-known values, but allowing other things to be specified. However I take your point that the world is full of bears of little brain who may miss this point and start writing solid or solidish or illuminaseq.

It is unfortunate that HTSJDK has allowed lowercase PL values equivalently to the well-known uppercase ones, but this does not itself prevent extending the value space to allow locally-defined lowercase values.

I remain opposed to adding OTHER as it is a meaningless additional defined value. To my mind, we have two sensible approaches to choose between:

  1. (The status quo option.) Keep the list of valid values as is (adding BGI/BGISEQ/MGI/DNBSEQ separately as appropriate, as per Add 'BGISEQ' as a valid value of PL in @RG header record #178), and explicitly allow the omission of PL and spell out what that indicates:

Platform/technology used to produce the reads. Valid values: CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, and PACBIO. Omitted for platforms not in this list (in which case PM can be used to indicate the platform [to human readers]) or when the platform is unknown.

This would require HTSJDK changes to allow PL to be omitted, but there was never any justification in the spec for validating its presence anyway. (Can any Broadites look up PO-326's justification for validating PL presence in all cases? Presumably it's another case of “it should be present in our usual sequencing protocols, hence enforce”.)

  1. (The complicated extensible option.) Valid values are the existing uppercase list or locally-defined lowercase keywords, as suggested in Added BGISEQ as legal value for the PL tag #454 (comment). This has the disadvantage of complexity compared to the very simple unambiguous status quo of “if the reads are known to be one of these well-known technologies, tools can easily detect this by the exact keyword in PL”.

Me, I now favour (1). It's basically equivalent to OTHER in its effect on implementations: either HTSJDK needs modification to allow OTHER or HTSJDK needs modification to allow omitted PL.

(1) has the advantage of not adding a meaningless keyword value to the spec, while still distinguishing between the unknown and other states (cf #454 (comment)) via PM.

@jmarshall
Copy link
Member

jmarshall commented Nov 14, 2019

Re #454 (comment), we should also discuss belatedly adding ILLUMINA:2COLOR (and perhaps ILLUMINA:4COLOR) or similar alongside the post-nextseq ambiguous ILLUMINA, so that centres can move to something that specifies which of the colour chemistry error models applies.

@yfarjoun
Copy link
Contributor Author

from call: split into two PRs, OTHER and BGISEQ

@daviesrob
Copy link
Member

A list of platforms and instrument models can be found in ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.common.xsd (part of the EBI's specification for XML metadata files that accompany data uploaded to the ENA). Look for <xs:complexType name="PlatformType"> to find the list of platforms. Model names start from <xs:simpleType name="type454Model">.

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 14, 2019

A bit of a diversion. IMO htsjdk/picard is perfectly entitled to emit warnings about fields it would prefer to have, and perhaps even have a mode where those warnings become hard failure (like -Werror on compiler), but the default really should be to accept any legal file without failure. We certainly shouldn't be considering the behaviour of such tools when writing the spec.

Regarding this PR though, as @yfarjoun said above we'll remove OTHER for now and have that discussion separate, but the remaining issue is BGI vs MGI. If anyone in the know has any key insight to this now is the time to raise it. (We don't want both.)

@jkbonfield
Copy link
Contributor

A list of platforms and instrument models can be found in ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.common.xsd (part of the EBI's specification for XML metadata files that accompany data uploaded to the ENA). Look for <xs:complexType name="PlatformType"> to find the list of platforms. Model names start from <xs:simpleType name="type454Model">.

Thanks Rob. That's useful and it's good to see the PL fields at least match up so we haven't forgotten something.

The list of choices for PM look quite thorough. If we wish to add this to our spec it should probably be an appendix (unless it doesn't look too long as a footnote) and cite the source so any revisions can be discovered in case we're laggy on updating it.

@lbergelson
Copy link
Member

It looks like that list isn't entirely exhaustive, it looks like it's missing most BGI / MGI models.

@kbergin
Copy link

kbergin commented Nov 20, 2019

Hi! I wanted to mention another option that could be considered for the PM field controlled vocabulary list. There exists sequencing platform standard terminology in EBI's OBI ontology here https://www.ebi.ac.uk/ols/ontologies/obi/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FOBI_0400103&viewMode=All&siblings=false

This is a public and controlled set of ontologies, and offers both specific names and IDs. New ontology fields can be requested as needed.

The ENCODE project uses the same Ontology for their metadata, for example here: https://www.encodeproject.org/files/ENCFF432YLJ/?format=json.

One downside is the naming seems to duplicate the PL field a bit. i.e. "Illumina HiSeq 2500"

@jkbonfield
Copy link
Contributor

That looks very out of date. Maybe it should be populated with the information from the other EBI ftp link @daviesrob linked above.

Eg there's no ONT, no capillary sequencing instruments, only has 1 of the 4 PacBios, no BGI/MGI, etc.

I think we'd be better linking with the ENA ontology as they have a much more vested interest in keeping theirs up to date - they cannot accept data submissions without doing so.

@raskoleinonen
Copy link
Contributor

ENA shares accepted instrument models with the other two INSDC members: NCBI and DDBJ. As has been pointed our previously in the thread, they can be extracted from the SRA.common.xsd that is hosted, among other possible locations, here:

ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.common.xsd

PlatformType has all supported sequencing platforms. Instrument model enumerations are in the INSTRUMENT_MODEL element under each platform.

During submissions processing ENA also supports a number of synonyms for some of the instrument models, e.g. we could support both 'NextSeq 500' and 'Illumina NextSeq 500'. We would then map synonyms to the XML Schema enumerations.

I agree that it would be great to have these as an ontology somewhere to link to. If someone connects me with the right people I could maintain the list of instruments in an ontology as they get added to the SRA schema.

It looks like that list isn't entirely exhaustive, it looks like it's missing most BGI / MGI models.

This is also true and the reason is that the INSDC has not been asked by submitters yet to support depositions from these models. Normally, INSDC adds instruments as needed by submitters, but we could look at changing this process slightly and add the instruments first in a ontology and then later, as needed, add them to the SRA.common.xsd.

@kbergin
Copy link

kbergin commented Nov 21, 2019

Ah that's a great point - I'm definitely not an expert in this, I thought it'd be great to leverage an existing ontology, but I am reading that the main difference between ontology and controlled vocabulary is the relationships between terms, the management of the list, and how the information is expressed i.e. ontologies being in an ontology representation language. It's possible a representation language is not useful here?

I missed that the ones in the SRA field were pulled from an ontology, the person I asked at EBI about which ontology is used for SRA said that SRA uses a controlled vocabulary and that "INSDC works in a slightly different way for these things". She did say she's happy to have a call to explain more, but I didn't follow up.

In the end the main win is if there is another group who's job it already is to keep the list of terms up to date so this community doesn't have to, right?

@hts-specs-bot
Copy link

Changed PDFs as of 786f5e0: SAMv1 (diff).

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Mar 5, 2020

hmmm. I actually prefer technology names over company names...companies change (think solexa, and pacbio almost) but the technology stays the same...It's unfortunate that we have so many company names as valid values in PL, but we can start putting in the "correct" values as new technologies emerge, rather than simply asserting that everything from illumina is effectively the same....

SAMv1.tex Outdated
& {\tt PM} & Platform model. Free-form text providing further details of the
platform/technology used.\\\cline{2-3}
& {\tt PL} & Platform/technology used to produce the reads. \emph{Valid values are}:
{\tt CAPILLARY}, {\tt DNB} (MGI/BGI's DNA NanoBalls), {\tt HELICOS}, {\tt ILLUMINA} (can be used for either 4- or 2-color chemistry, patterned or unpatterned flowcells), {\tt IONTORRENT}, {\tt LS454}, {\tt ONT} (Oxford Nanopore), {\tt PACBIO} (Pacific Biotechnology), {\tt SOLID}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"DNB" is very nonspecific. MGI I think acquired this from Complete Genomics, but the instruments are very different. It's more like a data prep thing, which we tend to not describe here. Eg rolling circle amplification, mate pair vs paired-end sequencing, etc. I think it's not a good term for the PLatform field.

I also dislike all the extra verbiage which IMO isn't necessary here.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Mar 5, 2020

I went with DNB, since it seems silly to add "SEQ" to all the SEQuencing platforms...

@jkbonfield
Copy link
Contributor

hmmm. I actually prefer technology names over company names...companies change (think solexa, and pacbio almost) but the technology stays the same...It's unfortunate that we have so many company names as valid values in PL, but we can start putting in the "correct" values as new technologies emerge, rather than simply asserting that everything from illumina is effectively the same....

Maybe, but technology is shared across multiple vendors and often coupled with very different instruments. DNB being an example of this. Also what would you do if another Nanopore company came into the market? ONT defines the instrument, but if we'd labelled them as a generic "Nanopore" it would then be confusing and misleading.

@jkbonfield jkbonfield dismissed their stale review March 5, 2020 15:32

Changed since approval.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Mar 5, 2020

It never ceases to amaze me how difficult it is to keep everyone happy :-)

let's duke it out in the call.....

@hts-specs-bot
Copy link

Changed PDFs as of 00af2ba: SAMv1 (diff).

@hts-specs-bot
Copy link

Changed PDFs as of bff98cc: SAMv1 (diff).

@jkbonfield
Copy link
Contributor

A quick summary from conference call, incase anyone is watching.

We agreed DNBSEQ would work. It's the name of their whole suite of instruments so is close enough to "platform", and it's likely to be specific enough to that company that it's not going to be reused by anyone else (we never had a platform field for Complete Genomics anyway). It also means if MGI comes up with a radically different technology then we haven't painted ourselves into a corner. (Using the company name only was probably an error in the past and doesn't justify continued repetition.)

To that ilk there was some discussion on Illumina 2-colour vs 4-colour and whether they should be separated out, and also where is appropriate to note chemistry specific mods (eg PCR-free vs with PCR on the same instrument/model), but that'll be a topic for a future PR and shouldn't hold this up.

@jmarshall
Copy link
Member

I've taken the liberty of pushing a suggested minor typographic commit with fixes for the two remaining comments on bff98cc. It also adds an entry to the version history appendix.

👍 from me for squash'n'merge with these minor fixes or an equivalent. Thanks for persevering and finally dragging this over the finish line… 😄

Applied previous suggestions: Just "Valid values:" like the other
field descriptions; s/tag/field and omit misplaced comma.

Also add a history item.
@hts-specs-bot
Copy link

Changed PDFs as of fca16bb: SAMv1 (diff).

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Apr 2, 2020

@jkbonfield your thumb?

@samtools samtools deleted a comment from hts-specs-bot Apr 3, 2020
@jkbonfield
Copy link
Contributor

@jkbonfield your thumb?

Apologies for missing that notification. Yes it gets the thumbs up for me. Thanks for the changes.

@yfarjoun yfarjoun changed the title Added BGI and OTHER as legal values for the PL tag Added BGISEQ as legal value for the PL tag Apr 30, 2020
@yfarjoun yfarjoun merged commit 82e60fc into samtools:master Apr 30, 2020
@yfarjoun yfarjoun deleted the yf_allow_new_platforms branch April 30, 2020 11:50
@yfarjoun
Copy link
Contributor Author

🎉

@splaisan
Copy link

Thanks a lot @yfarjoun, great fix, will it be in the next release and when is this planned? (I prefer not to install the dev versions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add 'BGISEQ' as a valid value of PL in @RG header record