-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added BGISEQ as legal value for the PL tag #454
Conversation
Can we not have “other” and just not specify the tag? |
In the call today there was a discussion discussion about this. The point here is to distinguish between not saying anything (like "unknown") and saying "other" which is stating that it is not one of the preceding values. You may disagree on the merits of being able to distinguish between these two states.... |
How about adding MGI as well. See https://en.mgitech.cn/ |
@yfarjoun I like it, how about adding that if the platform is unknown, the tag should not be other? |
Re BGI vs BGISEQ: the existing list of values mostly reflects underlying technologies rather than manufacturers (though it's a bit of a mishmash), so I'd prefer going with what was requested — but that request was three years ago of unknown provenance, so hopefully a knowledgeable source may add further information to #178. The list of values is currently more or less ordered by vintage/dominance rather than alphabetical. So if we're adding this new entry, either add it at the end or fully alphabetise the list. |
My problem with We list a well-known vocabulary in the spec so that tools can infer the underlying technology's error model etc without having to look for 18 different spellings/capitalisations of What if instead we open this up via the usual lowercase value mechanism:
Possibly the last sentence can be left out (as without the temptation of (This would also solve the BGI/MGI problem (FYI @colinhercus reports on #178 that they're actually the same thing): we could leave it off the list until further demand produces another request.) |
@jmarshall, From my experience HiSeq and NextSeq have very different sequencing error models so I'm not sure how helpful PL:ILLUMINA is! |
@colinhercus: Indeed. That's the historical reason for having a well-known PL but free-form PM, but it already broke down when we didn't differentiate 2-colour Illumina, alas — so you get to fuzzy match NextSeq/NovaSeq/etc in PM after all… |
I could go with upper/lowercase namespacing as an alternative to OTHER. |
@jmarshall as long as we only allow certain values, there's merit in allowing "OTHER" for exactly the case where there's no valid value, but there's another reason: there are nascent technologies that are still stealthy enough that they would not want to declare what the technology is, but for the purposes of collaboration would like to say something in the PL field... |
By “for the purposes of collaboration” do you mean that Picard in fact does not allow you to omit PL to indicate that the platform is unknown or undisclosed? Well… |
@jmarshall what's the point of specifying in the spec what the VALID VALUES are, and then allowing sam files that have other values there? |
@jkbonfield for some historical reason, htsjdk ignores case when comparing to the allowed values...so lower-casing isn't going to help... |
It at least means people don't invent other synonymous valid values when they see a term already in use. I still think there is merit in uppercase vs lowercase. I'm sure htsjdk could be modified to not complain about lowercase ones when it's doing the validation? I don't think we need to worry about case insensitivity (infact it'd be a bonus to have it) as we're not proposing having both upper and lowercase tags with the same text meaning different things. It's simply a way of pointing out to people that it's not yet "blessed" in the official dictionary. |
MGI is a subsidiary of BGI. The "PL"atform field is an unfortunate mishmash of all sorts and it's a bit vague where a platform finishes and a model starts. For example NovaSeq is a model and ILLUMINA is the platform despite it also being the company name. Conversely SOLID is a platform, not ABI which was the company involved, perhaps because ABI were also ubiquitous with earlier sequencing platforms too (eg CAPILLARY based Sanger sequencing). For MGI vs BGI, I can see benefits either way, but we only want one and not both as they're basically the same instrument rebranded. If they're already starting to be populated in the wild then we should bless that one in this spec. |
In fact, extant HTSJDK validates that PL is present and is one of the existing known values (see samtools/htsjdk@db13cd3) and since samtools/htsjdk@1e6d054 does so case-insensitively (as an unexplained extension over the spec). So in fact HTSJDK is already accepting values outwith the spec. As @jkbonfield replied and as I said in #454 (comment), the point of specifying the WELL-KNOWN VALUES but nevertheless allowing other values is that tools can depend on It is unfortunate that HTSJDK has allowed lowercase PL values equivalently to the well-known uppercase ones, but this does not itself prevent extending the value space to allow locally-defined lowercase values. I remain opposed to adding
This would require HTSJDK changes to allow PL to be omitted, but there was never any justification in the spec for validating its presence anyway. (Can any Broadites look up PO-326's justification for validating PL presence in all cases? Presumably it's another case of “it should be present in our usual sequencing protocols, hence enforce”.)
Me, I now favour (1). It's basically equivalent to (1) has the advantage of not adding a meaningless keyword value to the spec, while still distinguishing between the unknown and other states (cf #454 (comment)) via PM. |
Re #454 (comment), we should also discuss belatedly adding |
from call: split into two PRs, OTHER and BGISEQ |
A list of platforms and instrument models can be found in ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.common.xsd (part of the EBI's specification for XML metadata files that accompany data uploaded to the ENA). Look for |
A bit of a diversion. IMO htsjdk/picard is perfectly entitled to emit warnings about fields it would prefer to have, and perhaps even have a mode where those warnings become hard failure (like -Werror on compiler), but the default really should be to accept any legal file without failure. We certainly shouldn't be considering the behaviour of such tools when writing the spec. Regarding this PR though, as @yfarjoun said above we'll remove OTHER for now and have that discussion separate, but the remaining issue is BGI vs MGI. If anyone in the know has any key insight to this now is the time to raise it. (We don't want both.) |
Thanks Rob. That's useful and it's good to see the PL fields at least match up so we haven't forgotten something. The list of choices for PM look quite thorough. If we wish to add this to our spec it should probably be an appendix (unless it doesn't look too long as a footnote) and cite the source so any revisions can be discovered in case we're laggy on updating it. |
It looks like that list isn't entirely exhaustive, it looks like it's missing most BGI / MGI models. |
Hi! I wanted to mention another option that could be considered for the PM field controlled vocabulary list. There exists sequencing platform standard terminology in EBI's OBI ontology here https://www.ebi.ac.uk/ols/ontologies/obi/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FOBI_0400103&viewMode=All&siblings=false This is a public and controlled set of ontologies, and offers both specific names and IDs. New ontology fields can be requested as needed. The ENCODE project uses the same Ontology for their metadata, for example here: https://www.encodeproject.org/files/ENCFF432YLJ/?format=json. One downside is the naming seems to duplicate the PL field a bit. i.e. "Illumina HiSeq 2500" |
That looks very out of date. Maybe it should be populated with the information from the other EBI ftp link @daviesrob linked above. Eg there's no ONT, no capillary sequencing instruments, only has 1 of the 4 PacBios, no BGI/MGI, etc. I think we'd be better linking with the ENA ontology as they have a much more vested interest in keeping theirs up to date - they cannot accept data submissions without doing so. |
ENA shares accepted instrument models with the other two INSDC members: NCBI and DDBJ. As has been pointed our previously in the thread, they can be extracted from the SRA.common.xsd that is hosted, among other possible locations, here: ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.common.xsd PlatformType has all supported sequencing platforms. Instrument model enumerations are in the INSTRUMENT_MODEL element under each platform. During submissions processing ENA also supports a number of synonyms for some of the instrument models, e.g. we could support both 'NextSeq 500' and 'Illumina NextSeq 500'. We would then map synonyms to the XML Schema enumerations. I agree that it would be great to have these as an ontology somewhere to link to. If someone connects me with the right people I could maintain the list of instruments in an ontology as they get added to the SRA schema.
This is also true and the reason is that the INSDC has not been asked by submitters yet to support depositions from these models. Normally, INSDC adds instruments as needed by submitters, but we could look at changing this process slightly and add the instruments first in a ontology and then later, as needed, add them to the SRA.common.xsd. |
Ah that's a great point - I'm definitely not an expert in this, I thought it'd be great to leverage an existing ontology, but I am reading that the main difference between ontology and controlled vocabulary is the relationships between terms, the management of the list, and how the information is expressed i.e. ontologies being in an ontology representation language. It's possible a representation language is not useful here? I missed that the ones in the SRA field were pulled from an ontology, the person I asked at EBI about which ontology is used for SRA said that SRA uses a controlled vocabulary and that "INSDC works in a slightly different way for these things". She did say she's happy to have a call to explain more, but I didn't follow up. In the end the main win is if there is another group who's job it already is to keep the list of terms up to date so this community doesn't have to, right? |
5ed1e1b
to
786f5e0
Compare
hmmm. I actually prefer technology names over company names...companies change (think solexa, and pacbio almost) but the technology stays the same...It's unfortunate that we have so many company names as valid values in PL, but we can start putting in the "correct" values as new technologies emerge, rather than simply asserting that everything from illumina is effectively the same.... |
SAMv1.tex
Outdated
& {\tt PM} & Platform model. Free-form text providing further details of the | ||
platform/technology used.\\\cline{2-3} | ||
& {\tt PL} & Platform/technology used to produce the reads. \emph{Valid values are}: | ||
{\tt CAPILLARY}, {\tt DNB} (MGI/BGI's DNA NanoBalls), {\tt HELICOS}, {\tt ILLUMINA} (can be used for either 4- or 2-color chemistry, patterned or unpatterned flowcells), {\tt IONTORRENT}, {\tt LS454}, {\tt ONT} (Oxford Nanopore), {\tt PACBIO} (Pacific Biotechnology), {\tt SOLID}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"DNB" is very nonspecific. MGI I think acquired this from Complete Genomics, but the instruments are very different. It's more like a data prep thing, which we tend to not describe here. Eg rolling circle amplification, mate pair vs paired-end sequencing, etc. I think it's not a good term for the PLatform field.
I also dislike all the extra verbiage which IMO isn't necessary here.
I went with DNB, since it seems silly to add "SEQ" to all the SEQuencing platforms... |
Maybe, but technology is shared across multiple vendors and often coupled with very different instruments. DNB being an example of this. Also what would you do if another Nanopore company came into the market? ONT defines the instrument, but if we'd labelled them as a generic "Nanopore" it would then be confusing and misleading. |
It never ceases to amaze me how difficult it is to keep everyone happy :-) let's duke it out in the call..... |
A quick summary from conference call, incase anyone is watching. We agreed DNBSEQ would work. It's the name of their whole suite of instruments so is close enough to "platform", and it's likely to be specific enough to that company that it's not going to be reused by anyone else (we never had a platform field for Complete Genomics anyway). It also means if MGI comes up with a radically different technology then we haven't painted ourselves into a corner. (Using the company name only was probably an error in the past and doesn't justify continued repetition.) To that ilk there was some discussion on Illumina 2-colour vs 4-colour and whether they should be separated out, and also where is appropriate to note chemistry specific mods (eg PCR-free vs with PCR on the same instrument/model), but that'll be a topic for a future PR and shouldn't hold this up. |
I've taken the liberty of pushing a suggested minor typographic commit with fixes for the two remaining comments on bff98cc. It also adds an entry to the version history appendix. 👍 from me for squash'n'merge with these minor fixes or an equivalent. Thanks for persevering and finally dragging this over the finish line… 😄 |
Applied previous suggestions: Just "Valid values:" like the other field descriptions; s/tag/field and omit misplaced comma. Also add a history item.
4361524
to
fca16bb
Compare
@jkbonfield your thumb? |
Apologies for missing that notification. Yes it gets the thumbs up for me. Thanks for the changes. |
🎉 |
Thanks a lot @yfarjoun, great fix, will it be in the next release and when is this planned? (I prefer not to install the dev versions) |
in the @rg SAM header line.
fixes #178