New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxon, Taxon Concept and Taxon Name Usage: definitions and relationships #1

Closed
ghwhitbread opened this Issue Aug 27, 2018 · 95 comments

Comments

Projects
None yet
@ghwhitbread
Contributor

ghwhitbread commented Aug 27, 2018

You write:
A taxonomic concept is a taxonomic name instance establishing or circumscribing a taxonomic entity - often linking synonymic inclusions and adding annotations, description…

I think it's cleaner to say that the taxonomic concept is a theory of a certain taxonomy identity. And then "taxonomic concept label" (name sec. source) is the "name" for that theory.

More or less like here: https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-017-0174-5
...

Best, Nico

@jgerbracht

This comment has been minimized.

jgerbracht commented Aug 29, 2018

I come at the definition of Taxonomic Concept from a different direction, which is from population perspectives, i.e. one or more population(s) of taxonomically related individuals. As opposed to approaching this from the taxonomic names direction. I see taxonomic names as labels to taxonomic concepts and while the names will change from author to author and taxonomy to taxonomy, the underlying concept must be unchanging through time (though, of course, individuals within population(s) are born, reproduce and die). 'A rose by any other name ...'

Denis Lepage et al. in https://dx.doi.org/10.3897/zookeys.420.7089 describe concepts and the issues though I don't immediately see an actual definition in his paper.

I do believe that our first task is to coin a definition that should include the immutability and disassociation of a concept with specific scientific names, author and publication of each concept. I can 'mostly agree with this definition
"A taxonomic concept is the underlying meaning, or referential extension, of a scientific name as stated by a particular author in a particular publication. It represents the author’s full-blown view of how the name is supposed to reach out to objects in nature."
I'm not suggesting this as the definition we adopt but it comes close to what I mean when I say 'Taxonomic Concept' One key point is that I see this describing only the 'original definition' of the concept and when future authors apply different names to the exact same view of the 'objects in nature', the same taxonomic concept should be utilized.
Cheers,
Jeff

@jgerbracht

This comment has been minimized.

jgerbracht commented Aug 29, 2018

One thought that will certainly cause discussion and controversy is that from my perspective, a taxonomic concept doesn't actually require a name. At the most basic, it requires some form of reference (even unpublished) and description of the associated population(s). There are cases that eBird must manage every year where an unpublished species still is a valid taxonomic concept, it doesn't have a name, author or citation yet, but it is being observed and recorded in the natural world and therefore, most have a taxonomic concept ID for it to be useful in the eBird world.
Thoughts?

@deepreef

This comment has been minimized.

deepreef commented Aug 30, 2018

I actually don't find that controversial at all. When a tree falls in the woods and nobody is there to hear it, it still creates sound waves. Likewise, when groups of organisms are known to exist in nature but have not been formally assigned an Linnean-style scientific name, they still exist as taxon concepts.

As for the actual definition of "taxon concept", I would go with something simpler: "A circumscribed set of organisms asserted to represent a taxon". It's not circular, because the word "concept" is what we're trying to define here. A slightly more elaborated version might qualify "organisms" as "inclusive of individuals living, recently dead, and yet to be born, "

Technically, Code-governed Linnean-style names are labels attached to name-bearing type specimens. To use a Linnean-style name as a label for a concept, it's necessary to include a "sensu" or "SEC" qualifier, which by convention is some form of Reference citation (typically author + year). I agree that names shouldn't be part of the definition of a taxon/concept, and I wouldn't include this convention for labeling concepts as part of the definition of a concept either. Instead, I would keep the definition of "taxon"/"concept" as the simple version above, then strongly support a standard human-friendly concept labeling format along the lines of:

[LinneanStyleName] [NomenclaturalAuthority] SEC [ReferenceAssertingOrDefiningConcept]

As for identifiers, applied to names and concepts, my own views are well-documented: we need a system of persistent, shared identifiers for taxon-name usage instances, and then apply those identifiers in context as proxy anchor-points for both taxon names and taxon concepts. But I imagine that would be a different thread...

Aloha,
Rich

@mdoering

This comment has been minimized.

mdoering commented Aug 30, 2018

@deepreef

This comment has been minimized.

deepreef commented Aug 30, 2018

Thanks, @mdoering . Yeah... I was a little queasy about that as well. My hope is that "Taxon" is reasonably well understood, given that it is the basis of an entire field of study (i.e., taxonomy). But you're right -- while my proposed definition may not be circular per se, it does somewhat dodge and obfuscate a clean definition by leaning too heavily on an equally abstract and vague term.

I suppose the definition could simply be "A circumscribed set of organisms", but there are other reasons for circumscribing sets of organisms that are non-taxonomic (e.g., "marine organisms", "organisms in Hawaii", etc.). That's why I felt the definition needed the additional refinement of "asserted to represent a taxon". I think as Jeff and others have said, the "asserted" part is key, because any taxon concept really inherits its meaning from an assertion put forth by taxonomists (or non-taxonomists). My sense of "taxon" is that the word implies a set of organisms that more or less share an evolutionary history. I wanted to avoid such specifics, however, to sidestep the whole monophyletic/holophyletic/paraphyletic issue which, while interesting in its own right, is outside the scope of what we're trying to achieve here.

So... my feeling is that the definition I proposed stands as it is even without a clear/agreed definition of what a "taxon" is. Different people may agree or disagree on what is implied by "taxon", but what matters to the definition is that someone asserted a set of organisms to represent a taxon -- by whatever notion of "taxon" that someone had in mind. Linneaus predated Darwin by a century and was himself a creationist; but I think it's fair to say that he asserted circumscribed set of organisms to represent taxa. In his mind, taxa were created by God, which is not consistent with the view of modern evolutionary biologists; yet despite this fundamental gap in the essence of a "taxon", both Linneaus and modern evolutionary biologists still assert circumscribed sets of organisms to represent taxa in ways that are fundamentally comparable, and fall within the scope of what I think we're circling around for defining what we mean by "taxon concept" in this context.

Sorry for the ramblings....

Aloha,
Rich

P.S -- Sorry - I accidentally clicked the wrong button....

@deepreef deepreef closed this Aug 30, 2018

@deepreef deepreef reopened this Aug 30, 2018

@nielsklazenga

This comment has been minimized.

Member

nielsklazenga commented Aug 30, 2018

I like Rich's definition. We need to work out how Taxon, Taxon Concept, Name Usage and Instance relate to each other (I'll create a new issue for that tomorrow; it's in the discussion document that Greg and I wrote), but I would say that the Taxon is the actual group of organisms that is out there (or we think is out there), while the Taxon Concept is the abstraction, or what is in our heads.

@deepreef

This comment has been minimized.

deepreef commented Aug 30, 2018

Thanks @nielsklazenga -- I agree with your distinction between "taxon" as being the actual set of organisms, and "concept" as being our abstract human interpretation of it. In that context, I would probably apply my proposed definition to "taxon", and parse out the other terms as follows:

Taxon: "A circumscribed set of organisms, inclusive of individuals living, recently dead, and yet to be born, asserted to represent a natural cohesive biological unit" [This may need some elaboration on "natural cohesive biological unit", but again the key is that in order to exist, it must asserted to be such.]

Taxon Concept: "A set of physical, genealogical, phylogenetic or other biological properties or characters of organisms used to define the abstract boundaries of a taxon circumscription that collectively distinguish it from other taxa." [What I'm trying to suggest here is that the "concept" is derived from the actual properties used to describe the abstract boundaries of taxon circumscriptions, which is the way that taxonomists determine whether any particular organism/individual is or is not an instance of an asserted Taxon.]

For my own understanding of "Taxon Name Usage" and associated terms (e.g., "Reference", "Name-String", "Appearance", etc, see: Taxonomic name usage files.

I'm not a big fan of defining the term "Instance" by itself within this context, because that word is so broad and vague that we shouldn't try to co-opt it to have a more specific meaning.

@nielsklazenga

This comment has been minimized.

Member

nielsklazenga commented Aug 30, 2018

Awesome.

@deepreef, in terms of the relationship between Taxon Concept and Taxon Name Usage, would you agree that Taxon Name Usage can be an operationalisation of Taxon Concept?

@nielsklazenga nielsklazenga changed the title from Taxon Concept Label to Taxon, Taxon Concept and Taxon Name Usage: definitions and relationships Aug 31, 2018

@deepreef

This comment has been minimized.

deepreef commented Aug 31, 2018

I guess my answer to that depends on what you mean by "operationalisation".

The way I have characterized it in the past, is that a "Taxon Name Usage" (TNU) encompasses all of the text, numbers, figures, data, etc. associated with the implied taxon concept asserted within a Reference. An identifier assigned to that TNU includes all of that associated information collectively as the "thing" that is identified. Thus, I guess I would say that the TNU identifier implies the full set of information used in asserting a Taxon Concept. In this sense, I think it's fine and appropriate to regard the TNU as the "operationalisation" of the Taxon Concept, in the sense that it encompasses all of the documented information used in the Reference to define the boundaries of that Taxon Concept.

One of the caveats, however, is that I think that a TNU can be used to operationalise more than just the Taxon Concept. For example, a subset of TNUs are Protonyms (i.e., those that create new scientific names, or "nomenclatural novelties"). In some contexts, the TNU (=Protonym) can also simultaneously be the operationalisation of the "taxon name" entity (important for nomenclators, but devoid of any connection to taxon concepts other than the name-bearing type specimen), as well as the operationalsation of the implied taxon concept associated with that name within that Reference (no different from any non-Protonym TNU).

I personally don't see a problem with that, because the distinction of whether or not a particular TNU identifier implies (or serves as proxy for) the nomenclatural bits of the TNU or the taxon concept bits of the TNU depends on the context in which the identifier is cited. The identifier identifies the TNU (i.e., the collective set of text, numbers, figures, data, etc. associated with the implied taxon concept asserted within a Reference); but the TNU serves as a very useful proxy for both nomenclatural actions, and taxon concept definitions.

Man, this stuff is hard enough to think about, let alone write about! And for those who argue that these sorts of discussions are too deep into the weeds to be useful in this context; I would counter that the reason we've been unable to solve these issues after decades of discussing and debating them is because we have thus far failed, as a community, to dive this deep into the weeds previously.

@nielsklazenga

This comment has been minimized.

Member

nielsklazenga commented Aug 31, 2018

You are very good at writing about it though. I agree with all that. At a later stage we can probably come up with a list of types of Taxon Name Usages and how they relate to Taxon Names and Taxon Concepts.

I agree that it is important to have these discussions, as I think that, once we've nailed down the core concepts, the rest will become more straightforward.

@jgerbracht

This comment has been minimized.

jgerbracht commented Aug 31, 2018

If we use Taxon as being "A circumscribed set of organisms, inclusive of individuals living, recently dead, and yet to be born, asserted to represent a natural cohesive biological unit" then a taxon_identifier would be an identifier that is persistent and always means the same 'circumscribed set of organisms' regardless of what taxonomic name is applied, what taxon authority is applied and what taxonomic level is applied. Isn't taxonomic id already utilized with and generally closely tied to a name? as opposed to a 'set of organisms'? Maybe I have a basic misunderstanding that can be corrected.

@deepreef

This comment has been minimized.

deepreef commented Aug 31, 2018

Yes, that is my understanding conceptually. However, for practical purposes, I'm not sure how one would ever know that two circumscribed sets of organisms asserted by two different authorities (accordingTo), with the same or different names, and the same or different taxonomic levels, represent the same taxon concept (at least with enough confidence to utilize the same taxon_identifier). An example we wrestled with in the early days of discussing this is suppose you have Smith 1950 asserting a taxon concept, with various information delimiting the boundaries of that concept (e.g., characters, junior synonyms, geographic distributions, etc.). Then Jones 1980 uses the same name, same synonymy, but adds some additional characters (not mentioned by Smith), and perhaps adds a geographic range extension. Can we confidently assume that both are the same taxon concept, and therefore both can utilize or reference the same taxon_identifier? That would require expert knowledge of the group to assert, and even then what would be required for Smith herself and Jones himself to mutually agree that they are referring to the same implied circumscribed set of organisms?

This is why I never felt there was much practical value in creating taxon_identifiers that are independent of the underlying TNU(s) that assert the taxon concept(s). It's also why TCS went with the notion of "TaxonRelationshipAssertions". That is to say, while we may be able confidently document that Brown 2000 asserted that taxonConcept sensu Smith 1950 is congruent with taxonConcept sensu Smith 1980, we cannot "know" they actually are congruent with enough confidence that we can share the same identifiers for both concepts.

This is why I think anchoring everything to TNUs (rather than taxon_identifiers of some sort) is more practical, and instead of asserting concept congruence via shared taxon_identifiers, we assert some sort of set-theory relationship between the concepts represented by two TNUs (e.g., as congruent, or includes, or overlaps or whatever). Sure there may be some cases where we can universally accept congruence in taxon concept from separate TNUs with enough confidence that we could anchor both to the same taxon_identifier; but I wager such cases would represent the vast (VAST) minority, and in that context does it really make sense to define and maintain and utilize yet ANOTHER class of identifiers (in a domain that is already overflowing with subtly different classes of identifiers)?

On the other hand, if we lower the "bar" for what we accept as "congruent" concepts (e.g., sets of distinct name-bearing type specimens -- aka heterotypic/subjective synonomies), then we're in a much better place to aggregate sets of TNUs into congruent taxon concepts more objectively, in which case a dedicated class of taxon_identifier might well be useful.

Sorry for the extended ramblings...

@mdoering

This comment has been minimized.

mdoering commented Aug 31, 2018

@jgerbracht

This comment has been minimized.

jgerbracht commented Aug 31, 2018

@deepreef

This comment has been minimized.

deepreef commented Aug 31, 2018

In reply to @mdoering: "Is it worthwhile to differ between the attempt to list defined (and unique?) concepts and the simple referring to a name used in some publication? If it is only about the later I much prefer the term NameUsage which does not pretend to be more that just that."

Short reply: I agree with your second sentence!

Longer reply: The way I look at it, NameUsage instances come in many flavors -- ranging from a mere mention of a name within a Reference to full-blown treatments with full synonymies, robust material examined and character descriptions, phylogentic analyses, geographic distributions, etc., etc. The degree to which one can divine the boundaries of a taxon concept circumscription will likewise vary tremendously as well. There may be some value in drawing a line between NameUsage instances that include a full heterotypic synonymy, and those that do not. The former can be used to algorithmically compare NameUsage instances as to the sets of type specimens they include and determine them to be congruent, includes, included in, etc. (per our discussions in Dave Remsen's house a couple years ago). While circumscription boundaries drawn using collective sets of type specimens (i.e., complete asserted heterotypic synonymies) are not as granular as those marked by character states and/or enumerated specimens & populations; they are FAR more practical in terms of determining (approximate) concept congruity. As such, we can do all the reasoning we need using only NameUsage instances, without the need to separately mint identifiers for Taxon Concepts as entities that exist independently of the individual usage instances.

I agree with @jgerbracht that the concept exists (at least in the abstract) independently of the extent to which it is described or fleshed out within the documented Name-Usage instance; but the problem as I mentioned before is that, beyond comparing heterotypic synonomies, expert knowledge is necessary to assert the congruency (or not) in concept circumscription between any two given name-usage instances. In cases where that expert knowledge is available, I think it's better to capture something along the lines of a TaxonRelationshipAssertion (sensu TCS) to map the relationship between two Name-Usage instances, rather than mint some sort of identifier for the abstract concept itself, then link both name usages to it.

Also, the precision and granularity of what the concept boundaries are will vary, and as such the decision to regard them as the "same" concept or "different" concepts will change. In some cases range extensions do not represent a change in concept, but in other cases they do. Take for example that Species A SEC Ref1 is described from specimens in Hawaii. Then someone finds a population in the Marshall Islands (range extension; Species A SEC Ref2). Because it's a range extension, there is no change in concept. However, later genetic data and other evidence convince someone else that they're actually different species, so we have Species A SEC Ref3 from Hawaii, and Species B Sec Ref3 from the Marshall Islands. Now... what is the relationship between Species A SEC Ref1 and Species A SEC Ref3? If the author of Ref1 (who was unaware of the Marshalls population) was a splitter, her concept might be sensu stricto and hence the same as Ref3. Or she might have been a lumper, in which case her concept would be sensu lato and congruent with Ref2.

I think it's much better to anchor our "concepts" as 1:1 with individual name-usage instances, then add a separate layer for assertions about how those concepts relate to each other in terms of congruency/etc.

@jgerbracht : one possible solution for what you describe is to establish a system analogous to type specimens but for Name-Usage instances that define taxon concepts. Instead of minting a new taxon_identifier to represent the concept (independent of the individual name usages that collectively define it) and linking all relevant TNUs to that separate identifier, you could (eBird could) have a system where they pick one TNU among several that relate to the same Concept, then brand that the "type TNU" for the concept, and link the other TNUs to it. This "Type TNU" effectively serves the same role as a taxon_identifier would, but without needing to deal with a new class of identifiers.

Think about it this way: even if we do mint taxon_identifiers to represent abstract concepts independent of the name usages, then you still need some reference point for that concept instance. Suppose there are four TNUs linked to the same concept instance, but then later someone realizes that a mistake was made, and that two of the TNUs refer to a slightly different concept than the other two. What happens to the concept instance? Does it disappear and two new ones are minted? Or does the original concept instance stay with two of the TNUs, and another concept instance is minted to represent the other two? What if three go with one concept, and one with the other? What if 49 go with one concept and 1 goes with the other? If we mint two new ones and "retire" the original concept, what happens to all the external data linked to that "retired" concept? If we maintain the original concept with one subset of TNUs and mint only one new one for the other, then there will need to be some mechanism for deciding which set of TNUs the original concept remains with (e.g., a "type usage" instance, analogous to a type specimen).

Again, I apologize for the long post here; but there's a reason we've never quite sorted all this stuff out before. The good news is that this conversation seems genuinely fresh to me, and I honestly think we're making good progress!

@baskaufs

This comment has been minimized.

Contributor

baskaufs commented Sep 4, 2018

I'm a bit behind on this thread due to traveling at the end of the TDWG meeting. But I had several items that I wanted to add for the record.

  1. Several years ago, there was a complaint that extensive, substantive conversations happen on email lists and that what comes out of those conversations does not get captured - causing the conversations to happen over and over. So I actually took the time to record a summary of the exhaustive TCS-related thread that started on 2012-11-01. Since we seem to be starting in on this subject all over again, with some of the same participants, perhaps we could start by reviewing the previous conversation and refer to the URLs of relevant posts there rather than writing them all over again. The page I've linked also refers to an earlier thread in 2009 that also repeats some of the same conversation about taxon concepts.

  2. Niels posted info from an email I sent as Issue #3, so I won't repeat that here. However, I'd like to include it in this conversation by reference. What I wanted to note was that the graph diagram it includes came about during the creation of the Darwin Core RDF Guide. In writing the guide, we considered it out of scope to thrash out the issue of "taxonomic entities", assuming that such thrashing would be handled by a future TCS 2.0 task group (which I guess is pretty much this group). Nevertheless, Section 2.7.4 of the guide was written with the recognition that the dwc:Taxon class "convenience terms" effectively describe some kind of entity (an instance of the dwc:Taxon class that might be a taxon, taxon concept, or TNU). The RDF guide mints the object property dwciri:toTaxon to enable linking from a determination (dwc:Identification instance) to that entity at such future time when the nature of that "taxonomic entity" got fleshed out. I recommend reading section 2.7.4 if you want to understand how the RDF Guide sees the relationship between the DwC Taxon class terms and the dwc:Identification and dwc:Taxon classes themselves.

  3. I understand the desire to clearly define what a taxon/taxon concept/TNU is. However, this discussion is reminding me of the very long discussion that took place when we tried to come up with a definition for the dwc:Organism class. Although it seems like it should be easy to define an organism, we ended up with a definition that may seem strange at first, since it included not only individual biological organisms, but also things like clones, colonies, and packs of animals. The reason that we ended up with such an odd definition is because we ended up defining the class in a way so that it "did" what we wanted it to do, rather than defining it to "be" what we thought it should be. Let me explain what I mean by that. In that long and painful discussion, the need for even having an organism class was questioned because there were very few properties that we actually wanted to assign to instances of the class. The epiphany came when it was suggested that the real purpose of the organism class was not to be a thing onto which we attached properties, but rather to be a thing to connect one-to-many determinations to one-to-many occurrences. In database terms, it was like a join table. In graph language, it served as a node to link multiple other nodes. Once it was clear that this was the function of the class, then defining dwc:Organism was easier: it was defined to include all things that can have one-to-many occurrrences and to which we would like to assign one-to-many determinations. That's how weird stuff like wolf packs got included in the definition. I think the situation of taxon/taxon concepts/TNUs is similar. What we need is a "thing" that connects identification instances to names and references. In graph terms, this thing is the a node that connects a determination to zero-to-one names and zero-to-one references. Anything that we can imagine to fulfill that role (taxon concepts, TNUs or whatever) can be included in the definition of that thing. Once we have established the "thing", we can assert additional properties to flesh out the meaning of the thing - taxon concepts might have properties that TNUs don't and vice versa, just as a wolf pack might have different properties than an aspen clone or an individual elephant. But the basic linking function will be there regardless. Given our previous experience, I highly recommend starting with a functional definition (we want this "thing" to connect references to names), rather than starting off by getting hung up on a conceptual definition.

It's possible that this node could also connect names to things like sets of specimens or organism occurrences rather than to a reference if that is an acceptable alternative way to define the taxon.

@deepreef

This comment has been minimized.

deepreef commented Sep 4, 2018

Many thanks, @baskaufs ! Your post reminded me of our very animated discussions of "dwc:Organism", which in the end was, in my opinion, an extremely useful exercise. Evidently it was also successful, in that unlike this never-ending discussion about taxon (which parallels the never-ending debates about "What is a species?"), the "Organism" discussion seemed to come to a stable close (or perhaps no one cares enough about it to debate it anymore?)

In any case, I really like (and agree with) your point that "we ended up defining the class in a way so that it "did" what we wanted it to do, rather than defining it to "be" what we thought it should be." To be honest, I think that applies to the definitions of all of our terms (not just dwc:Organism). We like to think we're modelling nature as it is; but that's not what we're doing. We're modelling how to track information about nature in a way that makes it easier for us to answer the diverse set of questions we want to ask about it.

In that context, and having participated in the "taxon definition" discussions since the 1990's (the discussions began earlier than that), I actually feel that this discussion is making some novel progress, which I think is a good sign that we may be able to achieve some consensus in moving forward. Your post above made me realize why I think we're getting somewhere: in the past, the debate always got bogged down in "what IS a taxon?" (~= "What IS a species?") However, I think you captured a key point that I hadn't been able to put my finger on before, which is that we shouldn't spin our wheels endlessly trying to define what IS a taxon, and instead focus on how we want to define a taxon entity such that it fulfills our desires to answer the diverse set of questions we want to ask about nature.

We seem to have mostly stabilized on what a TNU is (and how its used). The outstanding question is whether there ought to be a separate entity (with a separate pool of identifiers) to represent a "Taxon Concept". The role such an entity/identifier would play is as an aggregator of TNUs that all represent the same circumscribed set of organisms. Similarly to "Organism", the "Concept" entity would not have many (any?) properties of its own, but rather would serve the function of linking clusters of TNUs together for the convenience of using one identifier to represent a collection of many TNUs.

In principle, I understand the value & simplicity of having such a defined entity (and corresponding identifier). In practice, though, I fear that it will end up as a hodgepodge of fuzzily-defined (to varying degrees) instances whereby different people will aggregate different sets of TNUs differently into concepts. The only way I can see it working effectively is via an additional "join" entity similar in many ways to Identifications for assertions about which TNUs map to which concepts (and that will start to get messy). The problem is that I'm not sure how effective that will be in helping us to answer the diverse set of questions we want to ask about nature.

Instead, I'd like to see us pin down the definition of TNU (and its various flavors, including Protonyms, Treatments, etc.), then flesh out a few million instances of them with their core properties (especially heterotypic synonym mapping), then allow the need for a "Concept" entity to emerge (or not) from that.

Again, sorry for the long diatribe...

@baskaufs

This comment has been minimized.

Contributor

baskaufs commented Sep 5, 2018

Cool! After spending time looking at how other standards organizations work, I'm increasingly convinced that the effective way to work is to define the use cases first, then develop the standards while testing the proposed features of the standard against those use cases. That's basically what you've proposed - define what we would like for TNUs and taxon concepts to do, then try to build the system to make them work. Keep the features that work, discard the ones that don't. THEN write the standard describing how the features were successfully implemented.

@deepreef

This comment has been minimized.

deepreef commented Sep 5, 2018

OK, then maybe one way to establish use cases is to enumerate some questions we would like to ask about organisms in nature, specifically related to taxa and their names (starting with the pedantic ones and moving on to more general ones):

Nomenclature
In what publication was a scientific name first established?
Is a scientific name available/validly published in the sense of the Code?
Is a scientific name a homonym (either within a Code or across Codes)?
What spelling variants have been used for a scientific name?
What objective (Code-governed) synonyms exist for a scientific name?
Where is the type specimen for a scientific name?

Taxonomy
What other names has a scientific name been regarded as a subjective synonym of?
What other names have been regarded as a subjective synonym of a given scientific name?
Is a scientific name considered valid according to a specified Meta-Authority?
What other names are considered as subjective synonyms of a scientific name/what other name is a scientific name considered a subjective synonym of, according to a specified Meta-Authority?
How stable has the subjective synonymy or validity of a scientific name been over time?
How do the circumscriptions of the same scientific name by two different authorities compare to each other?
How many type specimens (and of what names) are included within a particular circumscription?
What other circumscriptions are congruent with/include/are included in/overlap with a particular circumscription?

Classification
What different genera has a species epithet been combined with?
What parent taxon has a child taxon been included within?
What child taxa have ever been included within a parent taxon?
What child taxa are included within a parent taxon according to a specified Meta-Authority?
How stable has the classification for a given taxon been over time?

Biodiversity
What taxon name is an Organism (specimen/occurrence) currently identified as?
What taxon names has an Organism ever been identified as?
What is the currently accepted scientific name of a particular Organism, according to a specified Meta-Authority?
What Organisms (with their respective occurrence metadata, such as locality, etc.) are currently identified to a scientific name or regarded as falling within a taxon circumscription, according to a specified Meta-Authority?
[....]

OK, I got tired of writing these questions, but there are obviously many of these kinds of questions we would like to be able to answer.

In my mind, use cases involve sets of these questions to allow is to traverse from a given set of inputs to a given set of outputs.

For example, a use case might be:
"Give me a list of all species and associated occurrences recorded for a given geographic region, including both the accepted name according to the most recent Catalog of Life, as well as the names that the occurrences are currently identified as."

Another might be:
"For a given scientific name, let me know what homonyms exist, and for each homonym give me the current status and classification of the name according to different Meta-Authorities, a complete list of all names that have ever been regarded as a synonym (either junior or senior) as well as all known spelling variations and combinations."

To fulfill these use cases, we'd need to be able to answer several of the questions above.

I don't know if this is the right strategy to identify how best to proceed on this discussion and its desired outcome, but it seems to me that enumerating questions of this sort both builds the foundations for addressing Use Cases (or, perhaps, enumerating the Use Cases allow us to figure out what questions we need to answer to fulfill them), and allows us to be more specific about what entities we need to define, and what properties for each entity we need to capture.

Hoping that was at least somewhat helpful....

@nfranz

This comment has been minimized.

nfranz commented Sep 7, 2018

Hi all. I'd like to be part of this, at some level. I'd also like to suggest that doing taxonomic concepts well is in an important sense a shift in value system, or value assignment. Technical definitions may be somewhat secondary, and agreeing on them is not necessarily critical to my mind. The value shift is this though: a commitment to taxonomic concepts is a commitment to support the process of systematic research/products, with particular emphasis on making the provisional, evolving, and frequently locally and temporally conflicting aspects of systematic inference and product use explicit, and indeed prioritizing software design and functions to showcase the provisional, evolving, and conflicting aspects of systematic inference making and usage. To the extent that this group can make such a commitment, I'd be excited to contribute.

@ianengelbrecht

This comment has been minimized.

ianengelbrecht commented Sep 8, 2018

@baskaufs, thank you for your summary of the TCS discussion thread - really fantastic. Very helpful to be able to see that history. I strongly agreed that much valuable insight is often lost in the transience of internet forum and email discussions.

@nfranz

This comment has been minimized.

nfranz commented Sep 8, 2018

I'd like to again point to this publication https://doi.org/10.1186/s13326-017-0174-5 which is on top of the thread. Please consider reading it in full. This is an ontology (proposal, if you will) that is also pilot-implemented here: http://openbiodiv.net/. It was part of a Ph.D. thesis, sponsored also by a biodiversity data publishing house, whose aims are well aligned with those of the TNC. It has a lengthy section "Domain Description" in which the issue of representing taxonomic concepts is tackled. I am not saying that there are no other important efforts, but if I had to point to a single most indicated descendent of the 2005 TCS, this just is it. I believe that if we take this paper and approach as a pragmatic foundation and begin to understand what services it can provide and which it cannot, we have a strategy to advance effectively.

@deepreef

This comment has been minimized.

deepreef commented Sep 8, 2018

Many thanks for re-linking this publication, @nfranz! I thought I had clicked on your original link, but evidently not as this is the first I'm seeing the full publication. Although I do have some minor philosophical quibbles (e.g., I still fail to understand how a taxon concept can justifiably be called a "hypothesis", rather than an asserted opinion -- I don't agree with the arguments put forth about falsifiability), once I got past those I found the article to be very useful in framing the problem we're up against with this discussion. It's definitely worth carefully reading by anyone interested in this sort of stuff.

I do have a couple of technical questions that are most likely due to my ignorance of OpenData, (SPAR Ontologies, etc.; but I'm going to take a risk and ask them anyway. Perhaps you can help clarify these.

The article states that "Taxonomic Article is a subclass of FaBiO’s Journal Article". However, several other subclasses of FaBiO's Expression class (e.g., books, chapters,, etc.) also contain taxonomic treatments. Is this a problem for implementation, or are we only interested in treatments that appear in articles, or...?

The article states "In OpenBiodiv-O, a taxonomic name usage is the mentioning of a taxonomic name in the text, optionally followed by a taxonomic status." If a name is mentioned several times within a single treatment, does that represent more than one TNU sensu OpenBiodiv-O? Or are they collectively contained within a signe TNU (e.g., represented by the NomenclatureHeading)? The reason I ask is that there is a subtle but important distinction between a TNU (which encompasses the entire treatment in cases where the TNU is the NomenclatureHeading), and what James Ytow referred to as "Appearances" (individual mentions of name-strings, often with abbreviated genus), which may appear many times within the context of a single TNU. I ask because, in the paragraph that follows ('For example, “Heser stoevi Deltschev 2016, sp. n.” is a taxonomic name usage.'), it seems that the TNU is the raw text string, not the Treatment as a whole, in which case the definition of TNU as asserted in the context of OpenBiodiv-O is a significant departure from how it has been defined elsewhere.

An important aspect of TNUs is that there is generally a 1:1 correspondence between a Treatment and the TNU representing the NomenclatureHeading for the Treatment. However, as implied by Figure 1 of the article, a treatment often contains other TNUs (e.g. within the NomenclatureCitationList). Thus, while every Treatment has exactly one corresponding TNU, not all TNUs are treatments.

I very-much like the way that "TaxonomicConceptLabel" (TCL) is defined. However, I'm not entirely sure I understand why the need for establishing OperationalTaxonomicUnit as a super class of TaxonomicConcept. In my mind, Taxonomic Concepts represent a circumscription of organisms, regardless of whether that circumscription happens to include a specimen (or more than one specimen, when heterotypic synonymy is involved) designated as a name-bearing type for a Linnean-style taxonomic name (i.e., regardless of whether the concept has a formal scientific name to label it with). Can you provide examples of instances of OperationalTaxonomicUnit that would not be regarded as instances of TaxonomicConcept? I.e., what other subclasses of OperationalTaxonomicUnit are there, and what function do they serve?

Regarding the two patterns, replacement name and related name, is the former a susbset of the latter? Or are these mutually exclusive? It seems that replacement name implies congruence of concept/circumscription, whereas related name could apply to all five RCC-5 relations (or only the other four, excluding congruence), or...?

Sorry for the long post -- just trying to make sure I understand the contents of and assertions in the paper correctly.

@rdmpage

This comment has been minimized.

rdmpage commented Sep 14, 2018

I may live to regret this, but can I suggest another way of tackling this topic? I'm going try and be disciplined and avoid a WTF rant, and instead sketch out a way I think we can create something simple, and which might lead to some tools that people might find useful. I'm a fan of keeping things simple, reusing things, and trying to take into account what is going on elsewhere. For example, the http://schema.org vocabulary is gaining momentum, and covers a lot of things we care about (publications, people, places, etc.). I make extensive use of it in my latest toy https://ozymandias-demo.herokuapp.com.

Interestingly, there is a community project to extend http://schema.org to include more life-science specific entities BioSchemas (a number of people on this list will be aware of this already). So it seems to me there's a case to be made for avoiding domain-specific vocabularies as much as possible, and trying to make our stuff as interoperable with the wider world as we can.

Taxa

I regard taxa as nodes in a tree. What a taxon "is" is defined by its place in that tree (although identifiers don't change if the composition changes, that way lies madness). A taxon in NCBI is ultimately all the organisms that yielded the sequences in the subtree rooted at that node. A taxon in GBIF is ultimately all the occurrences in the subtree rooted at that node.

There's a proposal by @frmichel for taxa in BioSchemas](https://github.com/BioSchemas/specifications/tree/master/Taxon). This seems pretty straightforward and uses terms that will be familiar. If we use this for taxa (i.e., nodes in a classification) then we have a simple vocabulary that anyone can use, from people working in genomics with the NCBI taxonomy, to people building little taxon-specific web sites and who want to increase their visibility to Google by including structured markup (the primary driver behind schema.org).

Lots of people care about taxa, let's give them a simple way to talk about them.

Names, usages, etc.

It seems to me that the core idea here is the pair ('a name string', 'a bibliographic locator'). The bibliographic locator can be at the level of a "work" (e.g., an article or book), in which case a identifier like a DOI is the obvious candidate. If we want metadata, the schema.org has terms to cover pretty much any aspect of an article or other publication.

If we want more granularity, then the W3C Web Annotation Data Model covers pretty much everything, see https://www.w3.org/TR/2017/REC-annotation-model-20170223/#selectors. So we can refer to whole work, individual pages, XPath fragments in an XML document from, say, Pensoft, regions on a scanned page, etc. A further advantage of this is that tools such as hypothes.is use these selectors to locate annotations, and many academic publishers are adopting hypothes.is as their annotation tool.

So, nomenclators are essentially lists of annotations (think of IPNI where each record is basically a name and a page location). Treating "usages" as annotations makes it easy to integrate projects such as BHL - indexing all the pages for names, record their locations as annotations, flag those annotations that have some special significance (e.g., the first publication of a name). Imagine developing a tool that overlays BHL (or any literature database) and says "here the the names on this page, and by the way this is where this species name was published".

Some people care about names, many more people care about searching for information anchored to a name, use one to drive the other, and use a model that can handle both automatic text indexing as well as manual annotation. Name usages are basically annotations. The LSIDs in databases such as IPNI, Index Fungorum, ZooBank, and ION are identifiers for annotations (not "names" as such). It seems to me that name usages in the National Species Lists (NSL) are essentially annotations (with rather a lot of administrative cruff attached)

Taxonomic concepts

This seems to be the third-rail of this discussion. I'd argue that few people care about this topic, despite the acres of space devoted to it. The reason for that is that most people use whatever taxonomic classification is available to navigate the data they care about (e.g., the NCBI taxonomy if you work with sequences), and a taxonomic classification is essentially also a taxonomic concept (arguably they are the only concepts that are actually defined in any operational way). So, as a user, most people don't care. The proof of this is that science gets done without taxonomic concepts (we can argue about whether that's a good thing or not).

The one version of taxonomic concept that seems tractable is the "accordingTo" idea, in other words if I'm writing a paper I can say "when I use this name I mean this". This could be something as simple as saying "subgenus Stegomyia NCBI:53541" for NCBI's view of mosquito taxonomy. If I want to refer to a different concept of what Stegomyia is (and this is a very touchy subject in mosquito taxonomy) I could cite another work, in other words (Stegomyia, DOI:xxxxx). So, a taxonomic concept is a set of one or more (name, bibliographic locator) pairs. Hence, we just need a way to represent a set (or ordered list if we think of it as a list of synonyms), and schema.org has ways to represent those.

So, in its simplest form, the NCBI taxonomic concept of Stegomyia is (Stegomyia, NCBI:53541) (i.e., itself). I think this is the model also used by the Australian Faunal Directly where the authority for each taxon in the AFD classification is, of course, the AFD. We could expand the concept by listing all the synonyms, to make it more useful. If I understand the NSL model correctly, they link each node in their classification to a (name, reference) pair that corresponds to the concept in the tree.

People who care about taxonomic concepts (e.g., doing taxonomy, building classifications and trying to make sense of the literature) can describe these concepts as sets of (name,reference) pairs, which seems to me to be pretty much what taxonomists actually do.

Summary

I don't claim much originally here, and may well have completely misunderstood the discussion. But it seems to me there's a chance to adopt a simple, workable approach that builds on existing projects that have traction (e.g., schema.org, the W3C annotation model, bioschemas?) and hence get to the point where we, you know, build stuff that people want and need.

@deepreef

This comment has been minimized.

deepreef commented Sep 14, 2018

@jgerbracht

This comment has been minimized.

jgerbracht commented Sep 14, 2018

Conceptually, I agree with most of what Richard and Rod describe, taxa are nodes on a tree, though what happens when the tree branches are completely rearranged and/or there are multiple trees made up of the same branches but in a different arrangement (as currently is the case with birds). These are the scenarios that I think the Taxonomic Concept or Name accordingTo really helps to organize accurately, especially for any data aggregator, be it GBIF, EOL, Wikipedia or a researcher bringing together data on the same Taxonomic Concept from different domains.

A clarification on "What a taxon "is" is defined by its place in that tree (although identifiers don't change if the composition changes, that way lies madness)."
A taxon, or Taxonomic Concept isn't defined so much by it's place on the tree, but what branches and leaves are under that node. And I want to make sure I understand the "although identifiers don't change if the composition changes, that way lies madness" statement. If you are saying that a node has ID 123 and if branches under that node get added or removed, the node ID should still remain as ID 123, than I would agree, Madness!!

The reason I think an ID is needed to identify each Taxonomic Concept as opposed to a Name accordingTo, is that with the ID, users of these data don't need to go through the mapping exercise of their Name with Names from other providers. All instances of Name accordingTo would have the same Taxonomic Concept ID, so that the ID can be used to aggregate data. If there is one thing I've learned, the harder it is to aggregate data, the less likely it is to be aggregated by the users. I'm really thinking of this from the end user perspective, if we don't make that part simple, it won't be used.
Jeff

@nielsklazenga

This comment has been minimized.

Member

nielsklazenga commented Sep 20, 2018

@jgerbracht: I note that your are not watching the repository, which is why you missed the meeting invitation (#4).

@nfranz

This comment has been minimized.

nfranz commented Sep 20, 2018

Thanks, @deepreef . Mostly yes. I think we as information managers have a special role to focus on and design for long-term, deep-time biological data integration. And that should be reflected in our notion of Taxonomic Concept. Other sections of biology are more focused on utilizing instant snapshots of the taxonomic data landscape and behave as if these can be taken to represent knowledge of nature, i.e. Taxa (example of that kind of behavior: https://www.jstor.org/stable/3496386). So I'm saying, let's not appropriate Taxon from that kind of important use; let's give those sections of biology the benefit to make confident snapshot knowledge claims, and be the complement to our more integration-focused representations. Even though the lines are not clear-cut, we actually need both manners of speaking. And for the TCS, we need to signal clearly where we information managers stand.
Notice that this can be said without getting into more ontological issues (in this sense https://plato.stanford.edu/entries/logic-ontology/#Ont). Allowing Taxon to be what knowledge-claiming biologists need it to be at a given time, should not mean that the TCS has to make ontological claims regarding Taxa.

@deepreef

This comment has been minimized.

deepreef commented Sep 20, 2018

Great! Thanks, @nfranz! I COMPLETELY agree with what you say above about the special role and focus. My query was really a semantic issue -- trying to confirm your meanings of the word "Taxon" vs. Taxonomic Concept". In that regard, I think we should all be consistent in referring to the word "Taxonomic Concept" as representing an asserted circumscription, and leave the unqualified term "Taxon" out of our conversations and documentation as much as possible (except when mapping to dwc:Taxon).

My only caution is that we can't assume that everyone in our audience will inherently understand the difference between "Taxa/Taxon" and "Taxonomic Concepts" the way you have distinguished them. I suspect that outside of this particular (TCS) context, many/most people (including me) regard them as loosely synonymous with each other. But I fully agree that within this TCS discussion context, we should adopt more precise (though sufficiently flexible) meanings to these terms and use them consistently. The same will apply to all sorts of "Name" related terms (e.g., the difference between a Taxonomic Name, Protonym, text-string/literal, etc.)

@jgerbracht

This comment has been minimized.

jgerbracht commented Sep 20, 2018

@deepreef I agree completely, let's agree on a definition of Taxonomic Concept and try to leave Taxon out of our documentation. We have a hard enough time, I think, clearly communicating the difference between a Taxonomic Concept and a Taxon Name Usage. @nfranz I would, in principle, agree that keeping a definition of Taxonomic Concept as 'general' as possible is likely a good thing, unless it becomes so general that it remains a term used both for concepts and for name usages. I think that's one of the core reasons we're revisiting this, because the Taxonomic Concept in TCS wasn't clear enough and was open to broad interpretation.
I'll go back to an earlier post, can we pen and agree to some working definitions of these two? That will certainly help me think about these things.

Re the example I gave of 100+ TNUs mapped to a single TC id, i.e. using/modeling for TC ids vs TNUs and relationships. @deepreef brings up some real life issues that we need to tackle and those real life issues are EXTREMELY difficult to resolve and require someone intimately familiar with the taxonomies at handle. I THINK the underlying issue of changes in either the TNU to TC mapping as I proposed or the TNU to TNU mappings are the same. i.e. both approaches are "a subjective assertion that is prone to subsequent alteration/revision" and how the mapping is fixed or how the relationships are fixed is still the same problem. From this perspective, I don't see and advantage of one over the other, but I'm happy to be wrong.

@deepreef

This comment has been minimized.

deepreef commented Sep 20, 2018

Thanks @jgerbracht -- I agree completely. The fundamental problem (and the reason we've never really solved this issue before) is because there are some extremely complex and subtle/nuanced relationships between organisms, names, and taxonomic relationships/classifications, and these complex issues have been further confounded by confused and inconsistent terms to describe some fundamental things.

As for Taxonomic Concepts and TNUs, I think the best way to characterize this goes back to Walter Berendsohn's notion of a "Potential Taxon" -- which in our terminology would be a "Potential Taxonomic Concept". A TNU represents the cloud of information and properties for how a particular Reference treated a particular Protonym (=Name-as-object). A reasonably well-defined subset of TNUs represent "Potential Taxonomic Concepts".

One of the key questions we need to figure out, with respect to the second paragraph of your post above, is whether it makes sense to collapse a set of TNUs representing confidently congruent Taxonomic Concept circumscriptions into a single "Taxonomic Concept Instance" with its own identifier and properties. I definitely think it's worth exploring, but it might make sense to first clearly define TNUs and the relationships among them; then figure out what a secondary layer of aggregated congruent TNUs into a single defined object instance. In this sense, it's important that TNUs are defined in such a way that they can be easily aggregated in this fashion, if it ends up making sense to do so.

@rdmpage

This comment has been minimized.

rdmpage commented Sep 22, 2018

Reading through these threads I keep trying to figure out what problems we are trying to solve? I confess that I struggle with abstractions that don’t readily translate into something that I could imagine using and/or building. I also find it helpful to have actual examples to focus on.

Looking at eBird as an example (and @jgerbracht can correct me if I’ve misunderstood) there seem to be several problems to tackle:

  1. How do we represent a given classification (e.g., bird classification for August 2017).
  2. How do we enable users of a taxonomy to refer to a particular taxonomy (e.g. August 2017)?
  3. If users refer to a taxon without reference to a particular taxonomy, how we we resolve that reference?
  4. How do we compute and represent the changes between the August 2017 and August 2018 taxonomies?

It seems to me that 1 is straightforward, we simply define a way to represent a tree. Many biodiversity informatics projects use trees (classifications) to help users navigate through data. Note that the tree could be explicitly defined (e.g., as a tree structure in a file) or implicitly (say, as a checklist in a paper).

2 is also straightforward if we have identifiers for classifications, and optionally some way of locating a node in a tree, again, either explicitly in a tree structure, or on a page in a published checklist. (I could see and obvious role for GBIF here in that you could publish a checklist on GBIF and use the resulting DOI to identify that taxonomy.) So I think what would be useful here is a convention for explicitly citing a given taxonomy (formalising “sec”). There is scope for exploring the best way to identify nodes in a tree (e.g., do we simply cite a node name and tree version, or do we have identifiers like eBirds that remain unchanged between trees if node is the “same”)

3 Is either trivial or difficult, depending on how you approach it. Given that the vast majority of references to taxa will be by name, we either accept the ambiguity and treat this as a effectively a search (find me every taxonomy with that name) or endeavour to work out what particular classifications a publication at a certain date may apply to (e.g., what versions of bird taxonomy were in use at that time?)

4 Is perhaps the most interesting topic, and we have seen at least two ways to think about this, either do pairwise mappings between nodes in the two trees, or compute edit operations between the two trees.

Given that we are having the discussion on GitHub it may come as no surprise that I view 4 as essentially versioning. If the 2017 tree was in GitHub, we could imagine editing it as each new paper on avian taxonomy comes out, then freezing the tree and releasing a new version in 2018. The “diff” between tree 2017 and tree 2018 defines the differences between the two trees.

So, I see three “products” that would be useful:

  1. standard for describing a classification
  2. a standard for citing a classification and/or location in a classification (I’m using “classification” so we can include both trees, networks, and publications)
  3. standards for describing relationships between trees (e.g., mappings and edit operations)

For me a really interesting test case would be to take, say, the August 2017 eBird classification, take all the taxonomic work between 2017 and 2018 (listed on the eBird cite), represent those works in terms of 2 and 4 above, that is, they reference the 2017 classification, and they describe the changes made (e.g., subspecies x is now a full species in a different genus if you think in terms of edit operations, or the equivalent set relationships if you think in terms of mapping), and see if we can then compute the August 2018 tree using just that information. This would mean we could have a way to describe taxonomic information that was computable and could be used to generate new classifications.

If taxonomic information was described in that way then it would seem that the goals of aggregators and taxonomists could be aligned: the aggregator’s task is easier because the data is well described in nice, computable, citable chunks, which means the taxonomist’s work gets quickly incorporated into the aggregation in a way that gives them credit and visibility.

@nfranz

This comment has been minimized.

nfranz commented Sep 23, 2018

Going to point to this as an example of doing 4: https://doi.org/10.1093/sysbio/syw023.

@rdmpage

This comment has been minimized.

rdmpage commented Sep 23, 2018

@nfranz Thanks! Maybe we should assemble a set of relevant examples, such as the primate study you linked to, the eBird classifications, etc., and use those as test cases? For example, given the two MSW primate classifications an obvious question is how we can represent MSW2, MSW3, and the relationships between them using a simple vocabulary. Related to that goal, can we then link names and literature to those, so we could imagine giving someone a set of files and saying "here is the history of primate classification linked to all the relevant publications, enjoy!".

@baskaufs

This comment has been minimized.

Contributor

baskaufs commented Sep 23, 2018

+1 for assembling use cases

@nielsklazenga

This comment has been minimized.

Member

nielsklazenga commented Sep 23, 2018

+2

@deepreef It's probably best not to do this in the issues as all. I have created a folder 'use-cases'. Put them in there in any form you like. We can make them consistent filetype- and design-wise) later.

@deepreef

This comment has been minimized.

deepreef commented Sep 23, 2018

+3 :-)

@nfranz

This comment has been minimized.

nfranz commented Sep 23, 2018

I can provide links to these, if or as needed.
tcs_use_cases

@baskaufs

This comment has been minimized.

Contributor

baskaufs commented Sep 23, 2018

This is a response to @frmichel's comments on the pull request. @frmichel noted problems with the Darwin Core dwciri: terms and with Darwin-SW. Just to clarify about those two things: the DwC RDF Guide (which minted the dwciri: terms) recognized that there were problems with the taxon/taxon concept/TNU in Darwin Core, but did not consider "fixing" them to be within its scope. It simply provided guidance on how to use the existing DwC terms (or their dwciri: analogs) but did not generally suggest how to clarify their meaning or add any new terms that were missing. It assumed that some future group (like this one) would fix that problem.

Darwin-SW was not an TDWG effort, so it has no official standing in TDWG. It suggested a fix for the missing object properties needed to connect the Darwin Core classes, but also basically dodged the issue of clarifying taxon/taxon concept/TNU.

So really, neither of those two efforts should be looked at as a solution. As far as updating the TDWG Ontologies (TaxonConcept and TaxonName) is concerned, I think it would probably be better to just focus our efforts on incorporating the good parts of them into what we build here. Although those two ontologies don't have any official standing within TDWG either, they do reflect one attempt to translate an actual TDWG Standard (TCS 1.0) into the Linked Data/Semantic Web world, and should therefore have some weight in the discussion - particularly since some members of this group have experience trying to implement them. That's really useful information.

@jgerbracht

This comment has been minimized.

jgerbracht commented Sep 28, 2018

@rdmpage Re
"1. How do we represent a given classification (e.g., bird classification for August 2017).
2. How do we enable users of a taxonomy to refer to a particular taxonomy (e.g. August 2017)?
3. If users refer to a taxon without reference to a particular taxonomy, how we we resolve that reference?
4. How do we compute and represent the changes between the August 2017 and August 2018 taxonomies?"

I would add a 5th one. How do we track a particular taxonomic concept through time/taxonomies.
This cannot be done by computing changes between the two taxonomies, that approach would accurately cover a number of taxonomic changes between version and it critical to have in our tool set. However, there are also a variety of taxonomic changes that cannot be computed and must be done by the taxonomy experts.

@rdmpage

This comment has been minimized.

rdmpage commented Sep 28, 2018

@jgerbracht

In a sense it seems to be solved for eBird by the use of stable identifiers between classifications (e.g., radshe1, although it's not clear what rules are used to carry those identifiers across trees. But yes, the success of comparing trees to computing changes does depend on how well labelled the trees are.

However, there are also a variety of taxonomic changes that cannot be computed and must be done by the taxonomy experts.

Can you give an example? I'm not sure that there are things which can't be computed, I suspect it's more a question of whether the changes made (and/or the reasons) are represented with enough precision to be easily converted into something a computer can handle. Taxonomy is a pretty simple affair in many ways, we have sets, we have notions of relationships among those sets, and we have collections of labels to be assigned to those sets. I think it's eminently computable.

@jgerbracht

This comment has been minimized.

jgerbracht commented Sep 30, 2018

The tracking of taxonomic changes I'm referring to is the tracking of concepts and in cases where concepts are added or removed, the taxonomist is the one who knows the path from taxonomy 2017 to 2018 and to retroactively calculate those paths using only the starting and ending taxonomies is currently problematic at best. I agree completely with your statement that it's "more a question of whether the changes made (and/or the reasons) are represented with enough precision to be easily converted into something a computer can handle", and is something we can and should strive to help the taxonomic communities where we can (though that's certainly a very different but interesting topic for another day). I was referring to the status of taxonomies today, which do not provide those necessary details (Clement's comes close).

@nfranz

This comment has been minimized.

nfranz commented Sep 30, 2018

Hi @jgerbracht. Yes, this is why - as I suggested here #1 (comment) - it will be hard to come to an agreement about the scope of TCS2 without resolving at least these two issues upfront:

  1. To what extent does the TCS2 effort not only aspire to model mainstream systematic practice - "what most systematists tended to do or tend to do now" - but also channel or even syntactically enforce an evolution in systematic practice. In other words, does TCS2 have mostly just representative, or maybe also normative (rule setting) aspirations towards systematics? Even more bluntly, is TDWG prepared to get into systematists' grill? I believe we can live with a "yes" or "no" better than with a "maybe, we'll see".

  2. Even if the answer is "yes", what exactly is the role of a standard in this context? My own sense is that expectations towards a standard are too high among some of us. I believe the right level of expectation is that a standard should be designed to facilitate a fairly wide range of practices, ranging from ideal to very real. The challenge here for DwC is that it cannot actually represent something as Taxonomic Concept/Relationship heavy as this: https://doi.org/10.3233/SW-160220. DwC fails to provide the minimally needed syntactic structure for this kind of multi-taxonomy alignment work, and offering such a structure - where the data reflect it - is one function of TCS2. But it would be too much to ask of a standard to be much more than allowing the ideal, and also asking of the standard to enforce the ideal at all times. I've suggested previously that I believe that is mostly the role of specific implementations and communities.

In summary, I think the way to resolve discussions about scope is to first agree on any normative aspirations of TCS2, i.e., whether we are putting this out partly also to help make future systematics practice better, somewhat regardless of the field's legacy. We have sufficient use cases to indicate that "better" is feasible. But must acknowledge that it remains rare today. [Having done many hundreds of RCC-5 alignments myself, I believe that this is more limited by current incentive structures than by the nature of the data. But that is not so relevant for us now.] Then we need to decide how much of that "making it better" must be allowed by TCS2, versus how much of that must be enforced by it (as opposed to being enforced by TCS2-utilizing implementations).

@rdmpage

This comment has been minimized.

rdmpage commented Oct 1, 2018

@nfranz It's not clear to me who TCS2 is for, or at least, there seem to be multiple possible audiences, and I'm not sure taxonomists are likely the be either the biggest nor the most important.

Indeed, playing devils advocate, I'm not entirely convinced there is even a need for TCS2, given that taxonomists, biodiversity informatics projects, and genomics databases (e.g., NCBI) seem pretty happy to pump out taxonomies and lists of names without any vocabularies at all! In other words, it's not clear that people are banging on TDWG's door saying "we can't do our science without TCS2". One can certainly make a case that things could be done better if we had a better way of representing taxonomic information, but what we have at the moment seems to work OK for most purposes.

So I wonder if it would be helpful to have some notion of who the users are, both of TCS2, and of products that use TCS2. At the moment much of the focus seems to be on database builders who:

  • are interested in publishing data (e.g., nomenclators generating data feeds)
  • aggregate data from multiple sources to synthesise taxonomies (e.g., GBIF)
  • serve larger communities whose users want a taxonomic framework for navigation but don't care much about the details of how that classification was arrived at (e.g., GBIF, eBird, NCBI).

Now, there is certainly a case that working taxonomists could make their work more accessible to machines by marking up their work, and providing easy means to do that would be a great TCS2 use case, although the vast majority of taxonomic work is not published in journals that support any kind of mark up. Likewise, being able to provide TCS2-enabled things that taxonomists would find useful would be great (e.g., for any taxon give a summary of the current and past taxonomies, a complete bibliography - linked to digitial versions where possible, a list of relevant specimens, especially types, essentially a "project in a box").

So I think in part any expectation of what a standard can achieve depends on who you think it is for. I don't think taxonomists care at all about 99.9% of what TDWG does, they will care about anything which makes their life easier, and which helps increase the visibility of their work. I think the people who care about TCS2 will be mostly much limited to those dealing with large chunks of data, either publishing it, aggregating it, or both.

@mdoering

This comment has been minimized.

mdoering commented Oct 1, 2018

thanks @rdmpage, fully agree. And I can give you at least a very concrete request from the CoL+ project which seeks a new standard to share nomenclatural and taxonomic data in CSV files. DwC-A has various issues, TCS XML is actually quite alright but hard to work with, the TDWG ontology is even harder yet.

I would love to see something compatible with datapackages which could replace your custom dwc archives and free us from the "star" restriction

@nfranz

This comment has been minimized.

nfranz commented Oct 1, 2018

Thanks, @rdmpage. When Jessie Kennedy led the TCS1 effort, the scope of users was inclusive; see:
http://seek.ecoinformatics.org/attach%3Fpage=ScienceTaxon_12_May_2004%252FWhy_do_we_need_a_taxonomic_concept_transfer.ppt (particularly slides 6-8).

And the primary underlying motivation for TCS1 was the systemic inability of name-based systems to be taxonomically precise enough: https://www.napier.ac.uk/~/media/worktribe/output-255552/scientific-names-are-ambiguous-as-identifiers-for-biological-taxa-their-context-and.pdf

Also echoed here: https://www.researchgate.net/publication/6886479_A_Standard_Data_Model_Representation_for_Taxonomic_Information

I vote for preserving that still very much valuable problem diagnosis legacy of the 2005 TDWG-ratified TCS1. The primary purpose was and still is to do name/relationship management as as well as possible, and do better where possible with TCS2-facilitated syntax.

In that context, I think the right long-term strategy is to be more engaging towards the systematic expert community. Jessie Kennedy's history with TDWG and TCS1 possibly began with this, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.2436&rep=rep1&type=pdf, which is in line with the trajectory of supporting expert systematic workflows.

TCS2 can be viewed as an opportunity to bring TDWG and the systematic research community closer together.

@deepreef

This comment has been minimized.

deepreef commented Oct 1, 2018

@nfranz

Even more bluntly, is TDWG prepared to get into systematists' grill? I believe we can live with a "yes" or "no" better than with a "maybe, we'll see"."

I vote a resounding "no", in the context of TCS2.

whether we are putting this out partly also to help make future systematics practice better, somewhat regardless of the field's legacy.

It's not too hard to make the standard accommodate the future/ideal (when the information is available), without drowning out an effective basic mechanism for capturing what we can capture from less-than-ideal legacy sources. The enforced components should be kept to a minimum.

@rdmpage

but what we have at the moment seems to work OK for most purposes.

Hmm... what we have at the moment doesn't allow anyone to filter GBIF data on all taxa identified as X or identified as something regarded by authority Y as being a synonym of taxon X (to take an extremely over-simplified example use case that I think "most" users would like to be able to do). I think the role of TCS2 should be to allow us to capture taxonomic metadata associated with biological datasets in a way that enables automated data-enrichment through various online services (e.g., CoL+). In short, what we should be aiming for is allowing a non-taxonomist user-base to get the answers they want/need without having taxonomic expertise themselves. The status quo definitely does NOT allow this (it only makes people think they have it because they are only using text strings to represent scientific names, and are blissfully unaware of how anemic the results sets are because of it).

@jgerbracht

This comment has been minimized.

jgerbracht commented Oct 2, 2018

@deepreef Well said
"I think the role of TCS2 should be to allow us to capture taxonomic metadata associated with biological datasets in a way that enables automated data-enrichment through various online services (e.g., CoL+). In short, what we should be aiming for is allowing a non-taxonomist user-base to get the answers they want/need without having taxonomic expertise themselves. The status quo definitely does NOT allow this (it only makes people think they have it because they are only using text strings to represent scientific names, and are blissfully unaware of how anemic the results sets are because of it)."

@nfranz "channel or even syntactically enforce an evolution in systematic practice" I agree that TCS2 should force systematic practice, though if TCS2 is done well, the standards and eventually, the tools will be available to enable the evolution of systematic practice in regards to how one thinks about and manages taxon concepts.

@nfranz

This comment has been minimized.

nfranz commented Oct 2, 2018

@deepreef writes: "In short, what we should be aiming for is allowing a non-taxonomist user-base to get the answers they want/need without having taxonomic expertise themselves."

But, is that not very often the happy secondary effect or by-product of this more primary cause? Expert systematists have been enabled (TCS2 design), empowered (decentralization => implementation design), and incentivized (accreditation => implementation design) to transfer our knowledge via TCS2 syntax into aggregating environments. In a world where most scientists operate within a merit-based framework, how can a non-expert user base benefit lastingly if the expert contributor base does not benefit first or foremost?

@deepreef

This comment has been minimized.

deepreef commented Oct 2, 2018

Did he die because his brain went hypoxic? Or because his lungs were full of water (causing his brain to go hypoxic)? Or because he went unconscious underwater (causing his lungs to fill with water)? Or because he had a seizure (causing him to go unconscious)? Or because he was breathing too much oxygen under pressure (causing him to have a seizure)? Or because his rebreather provided too much oxygen (causing him to breathe to much oxygen)? Or because he set up the rebreather incorrectly (causing it to provide too much oxygen)? Or because it was a bad rebreather design (making it too easy for him to set it up incorrectly)? Why did he die?

Sorry for that weird/morbid analogy, but it sounds like we're making the same point at slightly different levels. My statement about what we should be aiming for isn't the "secondary effect" (happy or otherwise), it's what I see as the terminal goal (within the scope of TCS2). There are many things that need to happen in order to achieve that terminal goal. Certainly among them are steps that enable, empower, and incentivize scientists to to play their role in extracting and synthesizing information from raw data (occurrence records, literature information, etc.) and transforming it in a way (TCS2) that serves a function to non-scientists (or scientists lacking specific expertise). The point has been made many times over many years that if all we achieve with TCS is the goal of allowing taxonomists easier access to data to help them achieve their taxonomic goals, then we have failed. We certainly do need to do that, but in a way that facilitates something useful to a much broader audience.

@vsenderov

This comment has been minimized.

vsenderov commented Oct 20, 2018

I realize the issue has been closed but I would like to nevertheless answer the questions @deepreef raised on Sep. 8. I apologize for the late reply but other commitments prevented me from writing a detailed response. I am copying Lyubo's new PhD student Maria (mdimitrova095 at gmail.com) as well, as she is slowly transitioning to maintaining the pioneering biodiversity knowledge graph OpenBiodiv.

Many thanks for re-linking this publication, @nfranz! I thought I had clicked on your original link, but evidently not as this is the first I'm seeing the full publication. Although I do have some minor philosophical quibbles (e.g., I still fail to understand how a taxon concept can justifiably be called a "hypothesis", rather than an asserted opinion -- I don't agree with the arguments put forth about falsifiability), once I got past those I found the article to be very useful in framing the problem we're up against with this discussion. It's definitely worth carefully reading by anyone interested in this sort of stuff.

If a taxonomic concept is an unfalsiable opinion, it must logically follow that taxonomic circumscription does not follow the scientific process. If you want the taxonomic process to contend to describe the real-world in a Popperian fashion, then it is necessary that the opinion can be checked against some form of experiment. In the case of taxonomic concepts, a single taxonomic concept can be checked as to whether or not it follows some species concept.

I do have a couple of technical questions that are most likely due to my ignorance of OpenData, (SPAR Ontologies, etc.; but I'm going to take a risk and ask them anyway. Perhaps you can help clarify these.

Please, feel free to get back to me per email or Skype whenever you wish---I am more than willing to discuss this should this explanation fall short.

The article states that "Taxonomic Article is a subclass of FaBiO’s Journal Article". However, several other subclasses of FaBiO's Expression class (e.g., books, chapters,, etc.) also contain taxonomic treatments. Is this a problem for implementation, or are we only interested in treatments that appear in articles, or...?

Neither. While Taxonomic Article is a subclass of Journal Article, a Treatment is a subclass of Discourse Element. From the guide:

:Treatment a owl:Class ;
  rdfs:subClassOf deo:DiscourseElement ,
                  [ rdf:type owl:Restriction ;
                    owl:onProperty :isContainedBy ;
                    owl:someValuesFrom :TaxonomicArticle ] ;
  rdfs:label "Taxonomic Treatment"@en ;
  rdfs:comment "A rhetorical element of a taxonomic publication, where taxon
    			circumscription takes place."@en ;
  rdfs:comment "Таксономично пояснение или само Пояснение е риторчна част
                от таксономичната статия, където се случва описанието
                на дадена таксономична концепция."@bg .

The above code is in OWL. Without going into too much detail it is the standard way Peroni and Shotton deal with discourse elements such as special sections in the article (e.g. Introduction, Methods, Discussion, etc.).

The article states "In OpenBiodiv-O, a taxonomic name usage is the mentioning of a taxonomic name in the text, optionally followed by a taxonomic status." If a name is mentioned several times within a single treatment, does that represent more than one TNU sensu OpenBiodiv-O?

Yes. Each text area is a single TNU with a unique identifier. This is modelled after the Mention class of the base ontology PROTON Extensions module.

Or are they collectively contained within a signe TNU (e.g., represented by the NomenclatureHeading)?

No.

it seems that the TNU is the raw text string, not the Treatment as a whole, in which case the definition of TNU as asserted in the context of OpenBiodiv-O is a significant departure from how it has been defined elsewhere.

Possibly. However, in the broader Natural Language Processing (NLP) community, this is how "mentions" of particular entities are modeled. E.g. if I have text about Germany, I will have in it
a) the concept of the Germany (with a URI, say http://dbpedia.org/page/Germany);
b) text areas that mention Germany.
Note that the strings of these text areas might be slightly different due to grammatical and semantic considerations. The NLP task is to map these mentions to the dbpedia:Germany. In our case we link particular text areas to URI's of taxonomic names. Note that as names are different from concepts, there is yet another mapping from a name to URI. Thus, should I adopt a yet another layer of indirection for TNU's I risk to make the model too complicated. Therefore, I have strived for the most parsimonious model and defined Mention as it is defined in the NLP world. Here is the definition of the superclass from PROTON: "An area of a document that can be considered a mention of something."

An important aspect of TNUs is that there is generally a 1:1 correspondence between a Treatment and the TNU representing the NomenclatureHeading for the Treatment.

In a system, where there is a bijective mapping between Treatment and TNU, one of these two classes is extraneous. This is not the case in OpenBiodiv-O as it tries to provide only way to express any given statement.

However, as implied by Figure 1 of the article, a treatment often contains other TNUs (e.g. within the NomenclatureCitationList). Thus, while every Treatment has exactly one corresponding TNU, not all TNUs are treatments.

True. Treatments are specialized discourse elements. Treamtents are expressions of the more abstract class class concept. Think of this like this: a treatment is the "writing down" of the idea that the concept represents. In order to fully appreciate this, please refer to page 6 of the FRBR model.

I very-much like the way that "TaxonomicConceptLabel" (TCL) is defined.

Thanks. This is @taxonbytes idea.

However, I'm not entirely sure I understand why the need for establishing OperationalTaxonomicUnit as a super class of TaxonomicConcept. In my mind, Taxonomic Concepts represent a circumscription of organisms, regardless of whether that circumscription happens to include a specimen (or more than one specimen, when heterotypic synonymy is involved) designated as a name-bearing type for a Linnean-style taxonomic name (i.e., regardless of whether the concept has a formal scientific name to label it with). Can you provide examples of instances of OperationalTaxonomicUnit that would not be regarded as instances of TaxonomicConcept? I.e., what other subclasses of OperationalTaxonomicUnit are there, and what function do they serve?

This is a point of modeling and different ways to do this are possible without sacrificing expressivity. My idea was, however, to make taxonomic concepts the biodiversity-grouping concepts that are formed by taxonomists and that can be identified with taxonomic concept labels (Aus bus sec. X). Clearly, one may form a biodiversity-grouping concept in a non-traditional way: e.g. a BOLD BIN would be an example of that. Such a "taxonomic concept" will not have, at least initially, a taxonomic concept label. However, The BOLD BIN is clearly a falsifiable hypothesis about a unit of biodiversity. In a different example, may I bring up my current work on an entirely new system of grouping organisms on the basis of integrative information and Deep Neural Networks. The biodiversity operational units that BOLD or my system form will be biodiversity-grouping concepts, as well. In order to distinguish such circumscription from the more traditional Linnean one, I have restricted taxonomic concept to denote the set of biodiversity-grouping concepts that can be formed with traditional means, and relaxed operational taxonomic unit to denote the set of all concepts about units of biodiversity. Note, I could have used the clunky biodiveristy-grouping concept that I am using in this paragraphs, but I decided to defer to Sokal and use the established term OTU, which has already been used for numerical circumscriptions and will not suffer by this extension.

Regarding the two patterns, replacement name and related name, is the former a susbset of the latter?

Replacement name and related name are properties, i.e. binary relations:

:relatedName rdf:type owl:ObjectProperty, owl:TransitiveProperty, owl:ReflexiveProperty ;
  rdfs:label "has related name"@en ;
  rdfs:domain :TaxonomicName ;
  rdfs:range :TaxonomicName ;
  rdfs:comment "'has related name' is an object property that we
    use in order to indicate that two taxonomic names are related somehow. This
    relationship is purposely vague as to encompass all situations where two
    taxonomic names co-occur in a text. It is transitive and reflexive."@en.
:replacementName rdf:type owl:ObjectProperty ,
                          owl:TransitiveProperty ;
  rdfs:label "has replacement name"@en ;
  rdfs:domain :LatinName ;
  rdfs:range :LatinName ;
  rdfs:comment "This is a uni-directional property. Its meaning
    is that one Linnaean name links to a different Linnaean name via the
    usage of this property, then the object name is more accurate and should be
    preferred given the information that system currently holds. This property is only
    defined for Linnaean names."@en.

It is a little hard for me to parse "replacement name is a subset of related name." Neither of these two objects are sets: they are binary relations. What is true, though, is

a) related name is a reflexive property. I.e. if related_name(A,B) holds, so does related_name(B,A)
b) replacent name is not. The idea of replacement name is to follow the chain of replacement names to the currently valid name. In my dissertation, Section 3.4.2---Comptency question answering, I show how one can do these types of "validation queries" in the pioneering biodiveristy knowledge graph, OpenBiodiv.
c) I have written additional rules (not part of the ontology) but part of the dissertation (Section 3.5.4---Post-processing) that say that if a name A replaces name B, so then related_name(A,B) holds, and necessarily due to reflexivity related_name(B,A). Thus, for any set of names $A_1$, $A_2$, $A_3$, ..., so that replacement_name(A_1, A_2), replacement_name(A_2, A_3), and so on, there exists a related_name relations between any two names of the set. The inverse is not necessarily true.

Or are these mutually exclusive?

No. One implies the other (not in the ontology but in the extension), but not the inverse.

It seems that replacement name implies congruence of concept/circumscription, whereas related name could apply to all five RCC-5 relations (or only the other four, excluding congruence), or...?

Both of these relations are weak and underdetermined as they describe relationships between names that are unsuitable proxies for taxonomic concepts. They may imply something about the taxonomic concept aligments, but mostly they only imply nomenclatural statements. @taxonbytes has done some logic (Franz, Nico M., Chao Zhang, and Joohyung Lee. "A logic approach to modelling nomenclatural change." Cladistics 34.3 (2018): 336-357.) to model how one can be deduced from the other.

Sorry for the long post -- just trying to make sure I understand the contents of and assertions in the paper correctly.

Sorry as well for the long. This stuff is very hard to describe formally but there is no way around it if you want to make a computer systems that reasons about it.

@deepreef

This comment has been minimized.

deepreef commented Oct 20, 2018

Thanks, @vsenderov! I will reply via email to the CC list. If anyone following the GitHub thread is interested in this discussion, please let me know and I'll forward my reply to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment