-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New term - verbatimLabel #32
Comments
This proposal still needs evidence of demand. My question is, "Is it not sufficient/preferable to capture the label images? That is one level less of interpretation already." |
We use this field in the TaxonWorks. We split it into three fields "Buffered Determination Label", "Buffered Collecting Event Label" and "Buffered Other Labels". Just having an image is not enough, or sometimes we do not have an image. Basically, I, and many other collections using TaxonWorks, want this DWC field. |
Does this encompass both "gold standard" verbatim transcriptions of specimen labels and outputs of automated OCR processes (e.g. Tesseract)? How to encode the different approaches and their metadata (methodology)? How to differentiate between labels and their relative location? I don't think $ and are reliable enough, in particular if OCR outputs are in scope. |
Wes use a field for verbatim transcription of a label in the DataShot object to image to data workflow software. This captures the verbatim transcription of text from a region of interest representing a single label identified in an image of a set of labels. Subsequent workflow steps add interpretation of this verbatim text into structured data. In a less formal manner, there is a twitter feed https://twitter.com/EntoTranslator and a facebook group https://www.facebook.com/groups/232785306782255/ where images of difficult to interpret labels are posted for members of the community to either provide transcriptions from difficult to read handwriting or interpretations of words, phrases, abbreviations, and such on the labels. There are clear upstream needs in digitization workflows for representing verbatim label text in structured form. |
Closing for lack of demand. |
"Lack of demand?" Four different people have requested this be a DWC field and expected something to happen. I don't see lack of demand here. What do we need to provide to evidence "demand?" |
TDWG members discussing a good idea does not constitute demand. The demand
requirement needs independent organizations with a mission-driven need to
share these data.
…On Mon, Apr 19, 2021 at 10:58 AM Tommy McElrath ***@***.***> wrote:
"Lack of demand?" Four different people have requested this be a DWC field
and expected something to happen. I don't see lack of demand here. What do
we need to provide to evidence "demand?"
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#32 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADQ723ZZ7E44AXMGQFHUFTTJQZG7ANCNFSM4AXK2UUA>
.
|
@tucotuco What specifically, do you want us to provide then? would a survey of different natural history collections members with documented support of their need of this field suffice? |
@tmcelrath TaxonWorks suffices to represent that class of proponent. That is the equivalent of one proponent. What other organization or project needs it? If you can come up with that, the next step is to submit a templated New term request. I can do that, adding it to the beginning of the first comment to keep all the discussion in one place, but I need that evidence of demand. |
As noted above, We've got a field for this in the DataShot system at the MCZ associated with a region of interest in an image that contains multiple lables, but haven't been able to go very far with this in the absence of a means of sharing with the community. |
This initially seems like a straightforward enough proposal, but how does it interplay with the existing (and numerous) verbatim fields within DarwinCore? It seems to risk becoming a dumping ground for data that could/should go into existing fields, and perhaps discouraging their use because it's easier to just put it all, unstructured, into verbatimLabel. I think my main reservation is the following: are there many examples where the existing verbatim fields are inadequate, and could these be better covered by additional verbatim field(s) rather than such a loosely defined single field? |
@edwbaker The issue is actually slightly different. "Parsing" text into many verbatim fields automatically introduces interpretation by its very nature. For example: What is a "verbatimLocality"? Should all locality info go in it? Or just the most specific locality? We've had differences of opinion just within our own group on just this one field. To answer your question, DWC absolutely does not have enough verbatim fields. There are no verbatim identification fields, or verbatim curation labels fields (e.g. accession numbers, comments about preparation, etc ...). We use the ones that DWC has in addition to the verbatim one we are providing. Users do not have to use these fields, and yes, it introduces duplication of text, but that actually adds more power in terms of text-breakdown. We will never stop misreading labels and having poor quality control, but having this field allows for comparisons to the original verbatim label and will allow for corrections to be made. The idea of this field is in part, quality control. I have found having this field INVALUABLE more times than I can count when looking back at the original text, comparing incorrect GPS coords, poorly interpreted localities, or people misreading labels. |
To anyone following this thread, I have a poll out right now: https://forms.gle/fgxbQUmQLQC4a1NY6 collecting people's thoughts about this proposed DWC field. Please help me gather responses there. I am looking to get as many diverse stakeholders as possible. |
Reopened to accommodate renewed vigor in the proposal. |
What I'm wondering about this proposal is if we are conflating data management with implementing a standard. In my work for OBIS-USA I rarely receive data already in Darwin Core and I have to do a crosswalk. When I do that work there is always a chance that I performed that work incorrectly in some way and so I do my best to preserve the original data in a data repository and a link to that in the IPT so that future users of the data can get back to the original data to check the translation if they need to. For me it would not make sense to have all of that information stored in verbatim fields. When and how is the best place to separate out the standardization of the data from management of the data? Apologies if my comment doesn't make sense in this context since this is primarily considering museum collection data and I'm thinking of sampling event data. |
@albenson-usgs I think the only way of going back to the original data here is to include a label image. Having a label field is one potential source of error, then any further processing from that is another potential source of error. There are a number of potential solutions to "the verbatim problem" in this thread (using either SKOS or a separate dwc namespace). |
There are various different use cases for verbatim data. We described quite a few of them in a paper we wrote a while ago, more specifically in this table.. Darwin Core terms currently hardly support these use cases, with many verbatim concepts unaccounted for and no unambiguous term for the uninterpreted text dump as Tommy described. While the content of this term will be messy and not very practical for machine training purposes, which seems like it could be a nice use case, it would support improved findability, validation efforts and linguistic aspects. |
The issue I see with adding verbatimLabel or an equivalent (in name it doesn't cover other data sources, such as occurrences from a notebook) is that if we have that, why do we need all the verbatim fields in dwc? The current process seems to be we put the label data in verbatimX and cleaner data in X. If we follow this precedent, then we should look at what verbatim label data is missed at present, and how we address that (two possible solutions in my above comment). If we don't follow this precedent then (in my mind) we have a much larger discussion. I think the point raised above by @albenson-usgs between data management (which I take in this instance to broadly be within an institution) and data standards (broadly between institutions) is highly relevant. From what I can see (glancing over dwc) this would be the first break from relatively atomic data to a definition that might include multiple data types. This alone I think is worthy of some serious discussion. I wonder if a better solution to this might actually be within AudubonCore as a term like 'transcription of data' which would cover not only the textual transcription of a photograph of a label, but also the equivalent spoken data in audio recordings of species, etc. In this way we could potentially cover occurrence as well as specimen data using the same methodology - each time having a resource (label image, sound recording, etc) to verify against. |
Having had a more thorough search it looks like GBIF have already minted a verbatimLabel term, and that it is used in the DwC-A format already by Plazi - http://plazi.org/api-tools/api/. |
Given that GBIF has minted a term, are there any stability issues with
Darwin Core making one? Does the term have a definition? If so, is it
semantically the same as proposed here?
…On Tue, Apr 20, 2021 at 10:31 PM Ed Baker ***@***.***> wrote:
Having had a more thorough search it looks like GBIF have already minted a
verbatimLabel term, and that it is used in the DwC-A format already by
Plazi - http://plazi.org/api-tools/api/.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#32 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADQ72Z4TJHFLADWLW5ZVBDTJYTHJANCNFSM4AXK2UUA>
.
|
And then @dimus asked me if I could get it to output in JSON or XML just for the asking and I got this!!! |
Could this also be applied to a MachineObservation when it is supported by a physical photograph? |
@Jegelewicz I'm not sure I understand your use case. What would the text be in the verbatimLabel in your use case? Also note. (See above comment, somewhere in the long thread). In the distant future, one can imagine something to do with the AC Extension where any media with associated verbatim text can both be shared. For now, having dwc:verbatimLabel moves us closer to getting a lot more out of our skeletal records when it comes to searching in the aggregate. |
Photos in an archive generally have information either written on the back or on an associated label. |
My reading of the definition would make this an ideal field to capture that @Jegelewicz |
@timrobertson100 so, for any media shared (a la using AC extension), associated text from that media could go in dwc:verbatimLabel? |
@timrobertson100 & @Jegelewicz as this is written "textual content of a label affixed on, near, or explicitly associated with a material entity" that can easily be incorporated. E.g. a machine observation (e.g. OCR) from a photo explicitly associated with a specimen. |
Yes - I think the change to MaterialEntity from the other set of terms fixes my concern that MachineObservation was left out. |
Is there anyone who would be willing to prepare a markdown document that describes, at a minimum, the recommendations for the usage comments for this term? The idea would be to point to this new document with a link in the usage comments for verbatimLabel, which can't support the complexity required for the existing commentary. We are still in the process of figuring out the best way to do this (see issue #444, but will need the content regardless before we can move on to ratification. |
@tucotuco I've never done such before (by myself anyway). I could ask @timrobertson100 and @tmcelrath and see if we can manage it. What's our deadline? Many of us headed to SPNHC. |
Also regarding usage comments, what then needs to be added / edited / removed from this original version then?
Wouldn't this be |
@tucotuco, @debpaul how about this as a start, which is a verbatim extract from the opening comment? The raw markdown which you can see in this preview. (being non-normative, we can evolve these examples at any time without further ratification) |
@timrobertson100 This is the simple approach I was hoping for. I would include the document in dwc repository (sensible location to be determined) and flag the entire document as non-normative paralleling what it would be if its content were captured directly in comments or examples. |
The examples now exist on https://dwc.tdwg.org/examples/verbatimLabel |
Huzzah! |
This proposal has had extensive commentary and has been updated by @timrobertson100 to accommodate all comments up to Dec 8th 2022. Previous versions of this proposal may be viewed by clicking the "edited" link above, and were the subject of the earlier comments below
New term
Proposed attributes of the new term:
Term name (in lowerCamelCase for properties, UpperCamelCase for classes): verbatimLabel
Organized in Class (e.g., Occurrence, Event, Location, Taxon): MaterialSample
Definition of the term (normative): A serialized encoding intended to represent the literal, i.e., character by character, textual content of a label affixed on, near, or explicitly associated with a material entity, free from interpretation, translation, or transliteration.
Usage comments (recommendations regarding content, etc., not normative): The content of this term should include no embellishments, prefixes, headers or other additions made to the text. Abbreviations must not be expanded and supposed misspellings must not be corrected. Lines or breakpoints between blocks of text that could be verified by seeing the original labels or images of them may be used. Examples of material entities include preserved specimens, fossil specimens, and material samples. Best practice is to use UTF-8 for all characters. Best practice is to add comment “verbatimLabel derived from human transcription” in occurrenceRemarks.
Examples (not normative):
For a label affixed to a pinned insect specimen, the verbatimLabel would contain:
With comment "verbatimLabel derived from human transcription" added in
occurrenceRemarks
.When using Optical Character Recognition (OCR) techniques against an herbarium sheet, the verbatimLabel would contain:
With comment “verbatimLabel derived from unadulterated OCR output” added in
occurrenceRemarks
.Refines (identifier of the broader term this term refines; normative): None
Replaces (identifier of the existing term that would be deprecated and replaced by this term; normative): None. Does not replace any current DWC “verbatim” terms. Other “verbatim” terms have already been “parsed” to a certain data class and have their own uses
ABCD 2.06 (XPATH of the equivalent term in ABCD or EFG; not normative): /Marks/Mark/MarkText
The text was updated successfully, but these errors were encountered: