Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New term - verbatimIdentification #181

Closed
Daphnisd opened this issue Feb 19, 2018 · 30 comments
Closed

New term - verbatimIdentification #181

Daphnisd opened this issue Feb 19, 2018 · 30 comments

Comments

@Daphnisd
Copy link

Daphnisd commented Feb 19, 2018

New term

  • Submitter: John Wieczorek, following request initiated by Daphnis de Pooter (@Daphnisd)
  • Justification (why is this change necessary?): There is currently no simple way to capture the verbatim scientificName given in an identification/determination - it has to be separated out in parts and corrected.
  • Proponents (who needs this change): OBIS, Global Names, SANBI

Proposed new attributes of the term:

  • Term name (in lowerCamelCase): verbatimIdentification
  • Organized in Class (e.g. Location, Taxon): Identification
  • Definition of the term: A string representing the taxonomic identification as it appeared in the original record.
  • Usage comments (recommendations regarding content, etc.): This term is meant to allow the capture of an unaltered original identification/determination, including identification qualifiers, hybrid formulas, uncertainties, etc. This term is meant to be used in addition to scientificName (and identificationQualifier etc.), not instead of it.
  • Examples: Peromyscus sp., Ministrymon sp. nov. 1, Anser anser X Branta canadensis, Pachyporidae?
  • Refines (identifier of the broader term this term refines, if applicable): None
  • Replaces (identifier of the existing term that would be deprecated and replaced by this term, if applicable): None
  • ABCD 2.06 (XPATH of the equivalent term in ABCD or EFG, if applicable): not in ABCD

From Daphnis de Pooter (@Daphnisd)
Term needed to record how a taxon was originally recorded in an unprocessed dataset.
Motivation: tdwg/dwc-qa#109

@mdoering
Copy link
Contributor

I would suggest to review all "verbatim" terms and come up with a general strategy.
In theory all terms can be interpreted and there is a need to deal with both verbatim and interpreted/cleaned values. GBIF or ALA for example have a long list of interpreted terms. We decided against new terms though and rather use the same term in a different context.

I can't find the issue now, but it was also suggested to have a new rowType that could indicate verbatim values.

@cgendreau
Copy link

@mdoering it overlaps with discussion in gbif/occurrence#24

@mdoering
Copy link
Contributor

Thanks @cgendreau thats exactly what I was looking for!

@ansell
Copy link
Contributor

ansell commented Feb 20, 2018

A new convention is my preference so far, per the discussion in gbif/occurrence#24

@claudenozeres
Copy link

I agree with Daphnis that a new term and/or strategy is needed to make this more explicit. In the past for marine datasets, practice was to publish using a valid name from an interpreted original. Despite the recommendation to submit using the original (because it gets too messy, WoRMS can't always suggest matches for obvious names). Then the original (verbatim) information is lost. Currently this is a challenge for me with specimen label names (although I imagine it is similar for observation records). So the matter of original names sometimes gets mixed in with issues of identification. What I need is to record verbatim name. Displaying a valid scientificName comes after, because as it appears in the ALA/GBIF discussion, this can be open to interpretation. Having verbatim as part of a history extension (rather than core) does seems fragile to me.

@mdoering
Copy link
Contributor

I would argue DwC should not have any specific verbatim terms but rather recommend other ways of dealing with data provenance. Often we also have longer lineages with multiple steps that alter the content so a single verbatim term is difficult to apply. For example W3C offers a rather complete PROV Ontology although we should probably look for sth far more simple.

@qgroom
Copy link
Member

qgroom commented Feb 20, 2018

I tend to agree with Markus. Essentially every field could have a verbatim term and it would be better if we could chain versions of an observation together. My only doubt is that Darwin Core has to be kept reasonably simple otherwise people will not use it. Therefore, I'm OK with maintaining some verbatim fields as long as there was a gold standard way to handle these data.

@mdoering
Copy link
Contributor

Well, we could also create a verbatim term for each dwc term. There is not really a restriction on number of terms, just increased complexity. But if verbatim terms always have a prefix "verbatim" its not adding much to the confusion. It might even help cause we could get the existing ones out of the way when presenting terms

@qgroom
Copy link
Member

qgroom commented Feb 20, 2018

As Markus points out the problem is provenance. As data associated with an observation/specimen get amended the chain of provenance is lost if you only have one verbatim field. If I understand Markus correctly you could have no verbatim fields because every field would be verbatim and you would link versions together to determine provenance.

@debpaul
Copy link

debpaul commented May 7, 2018

From a data-mining standpoint, we need verbatim data to do things things like automatically find matching references between a dwc record and an old publication in BHL. If for example, the verbatim locality, verbatim taxon name, are not shared, then it will be much for difficult for computer algorithms to make the connections between the two datasets. I'm not sure it matters what we call it as long as it's clear that it's the "original text" in this case, as found in or on the label / field notebook / ledger. So it seems you are all saying we could / ought to use Identification History (and other such extensions) to share this type of information? What about verbatimCountry? this comes up all the time.

@Daphnisd
Copy link
Author

Daphnisd commented May 7, 2018

I don't think using a separate extension for this is an option for us, as it would not be compatible with event core in IPT.

@ansell
Copy link
Contributor

ansell commented May 8, 2018

Adding a single "verbatim" extension to a Darwin Core Archive isn't going to satisfy every use case if provenance over time is required, but those use cases also won't be satisfied by a single verbatimScientificName field.

In a possibly more serious provenance case, the ALA created an issue for itself, GBIF, and the community, a number of years ago with its choice to overwrite the original occurrenceID obtained from scientists with an internal opaque GUID when sending this data to GBIF, but still shows the original occurrenceID on ALA websites/downloads and stores that in the ALA datasets. I have been told by the person who made that decision that it should/must not be fixed (for various reasons). However, without a standard way to express the verbatimOccurrenceID, I also can't provide any workarounds to enable the original data to passthrough unhindered.

Having a standardised way of providing one or more verbatim or historical Darwin Core Archive extension files would allow users to optionally read what the original author provided, or read what other evolutions of the record contained. The current GBIF-only convention only allows for a single verbatim extension based on a static file name, which won't work for historical contexts where you want to track evolution of a dataset over time. Having an accepted convention that uses metadata rather than file names, whether it is based on the (overly complex) W3C PROV vocabulary, or another system, is essential to me for providing a workaround for the ALA occurrenceID mistake in future, which will (likely already has) hit some users just as badly as rewrites of scientificName to use the taxonomy or merged taxonomies which are currently accepted by a particular organisation.

I don't agree that we should add more verbatim terms to Darwin Core Terms solely to satisfy existing systems that aren't designed for a "verbatim extensions" model that we haven't developed yet. However, given the verbatim prefix already exists in Darwin Core Terms, it wouldn't be creating a new convention, just continuing an old convention, to create verbatimOccurrenceID and/or verbatimScientificName.

@mdoering
Copy link
Contributor

mdoering commented May 9, 2018

If the old convention is continued, how bad would it be to create a verbatim term for every term in Darwin Core? At least we had a consistent model then

@peterdesmet
Copy link
Member

Couldn't this be done with a dwcverbatim: namespace?

@baskaufs
Copy link

baskaufs commented Jan 3, 2019

I have suggested an approach for recording verbatim information involving the W3C SKOS-XL standard in the issue tdwg/tag#22. The actual process of getting from a provided verbatim string to full metadata associated with SKOS-XL instances is fleshed out more in my comment on TNC Issue 24.

@ianengelbrecht
Copy link

Could I suggest that a strategy for verbatim terms be created as a separate Github issue? Returning to the request for dwc:verbatimScientificName in itself, this would be useful. The documentation for dwc:scientificName says 'This term should not contain identification qualifications, which should instead be supplied in the IdentificationQualifier term' (although the example does include a case that includes the identification qualifier). The BDQ TG2 tests and assertions includes TG2-VALIDATION_POLYNOMIAL_NOTSTANDARD, which as currently defined will return NOT_COMPLIANT for any dwc:scientificName values that include a qualifier. We should also be able to represent identifications such as 'Harpactira sp.' in our datasets, and we also have the case of informal names for undescribed species, such as Harpactira sp. 'blue', manuscript names, etc.

@ianengelbrecht
Copy link

In a possibly more serious provenance case, the ALA created an issue for itself, GBIF, and the community, a number of years ago with its choice to overwrite the original occurrenceID obtained from scientists with an internal opaque GUID when sending this data to GBIF, but still shows the original occurrenceID on ALA websites/downloads and stores that in the ALA datasets. I have been told by the person who made that decision that it should/must not be fixed (for various reasons). However, without a standard way to express the verbatimOccurrenceID, I also can't provide any workarounds to enable the original data to passthrough unhindered.

@ansell it seems that the practice of creating or overwriting GUIDs is a pervasive problem, probably resulting from a misunderstanding of the purpose of GUIDs in the first place. IMO overwriting dwc:occurrenceID is a misapplication of the standard. Should we modify the standard to cope with its misapplication? Not a route I would advocate for.

@ianengelbrecht
Copy link

I see there is an verbatimScientificName field, and an accompanying verbatimScientificNameAuthorship field in a dataset I just downloaded from GBIF.

@tucotuco
Copy link
Member

tucotuco commented Sep 5, 2020

I see there is an verbatimScientificName field, and an accompanying verbatimScientificNameAuthorship field in a dataset I just downloaded from GBIF.

Those must be the dwc:scientificName and dwc:scientitifNameAuthorship data from the originally published source.

I am reviewing all existing Darwin Core issues to try to move them forward or abandon them as the Vocabulary Maintenance Specification demands. This particular issue had a lot of activity, and in the meantime the community has apparently arrived at practical solutions.

I would like to establish if there is still demand for a new term dwc:verbatimScientificName. If there is, someone please follow the process and provide evidence of demand from at least two independent parties and a term definition following the template provided in Guidelines for contributing.

Observation: I think this term would be best organized in the Identification class and have a name that explicitly makes the role of the name apparent, such as "verbatimIdentification".

@qgroom
Copy link
Member

qgroom commented Sep 6, 2020

Observation: I think this term would be best organized in the Identification class and have a name that explicitly makes the role of the name apparent, such as "verbatimIdentification".

I agree.
This issue was part of the inspiration for the discussion on verbatim data we wrote in the publication below. We concluded that versioning was a much better approach.

Quentin Groom, Mathias Dillen, Helen Hardy, Sarah Phillips, Luc Willemse, Zhengzhe Wu, Improved standardization of transcribed digital specimen data, Database, Volume 2019, 2019, baz129, https://doi.org/10.1093/database/baz129

@nielsklazenga
Copy link
Member

Observation: I think this term would be best organized in the Identification class and have a name that explicitly makes the role of the name apparent, such as "verbatimIdentification".

tdwg/tnc#24 (comment)

@tucotuco tucotuco changed the title New term verbatimScientificName New term - verbatimIdentification Apr 19, 2021
@tucotuco
Copy link
Member

I have changed the title of the issue and prepended a templated term change request to the original comment so as not to have to make a separate issue and relate it to the discussion in this one. Help is needed to know what the equivalent XPATH is in ABCD, if any.

@nielsklazenga
Copy link
Member

@tucotuco , there is no equivalent for this term in ABCD 2.06.

@tucotuco
Copy link
Member

Thank you @nielsklazenga. Term definition updated and ready to be prepared for public comment.

@afuchs1
Copy link

afuchs1 commented May 26, 2021

The Australasian Herbarium Information Systems Committee (HISCOM) endorses the addition of this term to Darwin Core, but proposes to add to the usage notes that verbatimIdentification is best used in addition to scientificName (and identificationQualifier etc.), not instead of it.

@tucotuco
Copy link
Member

@afuchs1 That seems a perfectly reasonable amendment to me. If there is no conflicting view, I will add it to the final usage comment. In the meantime, I have put a link to your suggestion in the usage section of the first comment.

@hollyel
Copy link

hollyel commented May 28, 2021

This term will be useful to the paleo collections community for expressing original IDs and the full extent of our knowledge despite nomenclatural uncertainty (e.g., "Genus sp. nov. 1" as illustrated by one of the existing examples). At least with our current systems, this kind of uncertainty and complexity can lead to unexpected results when our data go to aggregators and get matched to taxonomic backbones. - Holly Little, Erica Krimmel (@ekrimmel), and Talia Karim (@tkarim) (on behalf of the Paleo Data Working Group)

@EstebanMH-SiB
Copy link

We endorse this proposal on behalf of @SiBColombia

@tucotuco
Copy link
Member

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests