New Term

deepreef · 2021-04-28T06:43:47Z

Submitter: Richard Pyle
Proponents (at least two independent parties who need this term): Bishop Museum, TaxonWorks, data publishers from Denmark, Australia, Norway and Sweden, and multiple other institutions and data managers as expressed in commentary on several related issues.
Justification (why is this term necessary?): Multiple implementations of biodiversity data management systems manage instances of MaterialSample in a hierarchy. The definition of the MaterialSample class implies instances that result from sampling and subsampling, and includes examples including whole organisms, parts of organizations, and aggregates of multiple organisms. It is common practice among managers of physical material that meet the definition of MaterialSample to track derivatives (e.g., different preparations of the same organism, specimens within lots, tissue samples and other subsamples derived from whole organisms, etc.). The proposed new term allows for representing a parent/child relationship between instances of MaterialSample derived from other "parent" MaterialSample instances.

Proposed attributes of the new term:

Proposed definition of the new term: "An identifier for the broader MaterialSample from which this and potentially other MaterialSamples were derived, or which they collectively comprise."
Term name (in lowerCamelCase): parentMaterialSampleID
Class (e.g. Location, Taxon): MaterialSample
Comment (recommendations regarding content, etc.): Recommended best practice is to use a persistent, globally unique identifier for a dwc:MaterialSample or an identifier for a dwc:MaterialSample that is specific to the data set.
Examples: 6e43b33d-88ce-4a37-ad94-74d6c99b9e25, urn:uuid:11142195-4865-4b52-baed-1b76a39613a3
Refines (identifier of the broader term this term refines, if applicable): None
Replaces (identifier of the existing term that would be deprecated and replaced by this term, if applicable): None
ABCD 2.06 (XPATH of the equivalent term in ABCD, if applicable): not in ABCD

Term originally proposed a year ago by @thomasstjerne on the GBIF GitHub. Discussion around changes to MaterialSample on DwC (#314) and GBIF issue #37. This new term has direct relevance to dwc:preparations, in cases where multiple different preparations are derived from the same whole specimen.

The text was updated successfully, but these errors were encountered:

dagendresen · 2021-04-28T07:22:47Z

Excellent!
Maybe the example format could be: urn:uuid:6e43b33d-88ce-4a37-ad94-74d6c99b9e25

deepreef · 2021-04-28T07:36:51Z

Thanks! RE: the example, I was following the template of other similar terms in DwC (e.g., materialSampleID). Also, I generally try to minimize the inclusion of dereferencing metadata from identifiers; but that's more of a personal preference.

dagendresen · 2021-04-28T08:02:51Z

Do you consider urn:uuid: as dereferencing metadata?

dagendresen · 2021-04-28T09:02:47Z

Example of a preserved specimen of bluethroat with a blood sample extracted for DNA in support of the need for the proposed parentMaterialSampleID term.

Preserved specimen (mounted)

basisOfRecord = PreservedSpecimen
catalogNumber = NHMO-BI-104452/2-P
occurrenceID = urn:uuid:11142195-4865-4b52-baed-1b76a39613a3
materialSampleID = urn:uuid:11142195-4865-4b52-baed-1b76a39613a3
organismID = urn:uuid:246afd01-f734-5da9-874b-4a09f26030f8

Blood sample

basisOfRecord = MaterialSample
catalogNumber = NHMO-BI-104452/1-B
occurrenceID = urn:uuid:1ca12cf5-9c1a-4a25-82fb-739f2f1a322c
materialSampleID = urn:uuid:1ca12cf5-9c1a-4a25-82fb-739f2f1a322c
organismID = urn:uuid:246afd01-f734-5da9-874b-4a09f26030f8
parentMaterialSampleID = urn:uuid:11142195-4865-4b52-baed-1b76a39613a3

DNA sample (not yet here, but available for other)

parentMaterialSampleID = urn:uuid:1ca12cf5-9c1a-4a25-82fb-739f2f1a322c

...apropos, which begs the question if the reuse of the UUID for occurrenceID as the UUID for materialSampleID is at all the correct use, however, a value for occurrenceID is mandatory to enable the records to be published in GBIF.

dagendresen · 2021-04-28T09:04:37Z

Here is an example of a bluethroat (Luscinia svecica subsp. svecica) from which 7 MaterialSamples were extracted, in support of the need for the proposed parentMaterialSampleID term. For many of these bluethroats we lack parentMaterialSampleID to describe the hierarchy between material samples, sub samples for DNA. (To describe if the DNA sample is sub-sampled from the blood sample, from the tissue sample, from the sperm sample, etc..., each preserved as separate biobank MaterialSamples).

organismID = urn:uuid:e593838a-f7a9-5ef2-a04a-2bfc7c90771f

deepreef · 2021-04-28T19:59:33Z

@dagendresen : MANY thanks for the great example!

Do you consider urn:uuid: as dereferencing metadata?

Well... I guess it's technically not "dereferencing" metadata (like http://dx.doi.org/ or https://doi.org/); but it is still metadata, which basically translates to "What follows should be interpreted as a Uniform Resource Name, of the type Universally unique identifier". The actual "identifier" itself is the stuff that comes after the second : (in the same way that the stuff that comes after the third / in https://doi.org/10.3897/zookeys.641.11500 is the actual identifier).

I don't want to hijack this thread, but just to make a point... this is the closest representation of the actual identifier for your organismID in the post above, that can be rendered in textual form:
11100101100100111000001110001010111101111010100101011110111100101010000001001010001010111111110001111100100100000111011100011111
(i.e., 128 consecutive bits, represented here as 1s and 0s)

A less cumbersome way to display this value to human eyeballs would be in hexadecimal form:
e593838af7a95ef2a04a2bfc7c90771f
(that reduces it to 32 characters, instead of 128)

It could also be represented as a decimal number:
305159146678742414161168577211252373279
(but that increases the number of characters to 39)

The most text-economical way to represent it is in base64:
5ZODivepXvKgSgAAK/x8kA
(22 characters; but with a bonus: "Dive" is in there! Cool! It must be a sign...)

Of course, the most common way to represent it (and the way most people provide them to GBIF) is in the so-called canonical textual representation:
e593838a-f7a9-5ef2-a04a-2bfc7c90771f
(36 characters)
This form is already embellished with an additional four characters (hyphens) that are not actually part of the 128 bits of the identifier itself. They're added for the benefit of human eyeballs, presumably because breaking it up into a 8-4-4-4-12 template is less scary to humans (there are other technical reasons, but but important ones, in my opinion).

Microsoft unhelpfully represents them sometimes using upper-case letters:
E593838A-F7A9-5EF2-A04A-2BFC7C90771F
(also the form I regretfully chose for rendering as ZooBank LSIDs)
Or, even worse, with curly brackets:
{e593838a-f7a9-5ef2-a04a-2bfc7c90771f}

I get why it's useful in the context of RFC 4122 to pre-pend them with the aforementioned metadata (urn:uuid:), as you advocate:
urn:uuid:e593838a-f7a9-5ef2-a04a-2bfc7c90771f
And honestly, other than the canonical text form, I could be most easily persuaded to embrace this form (it's certainly better than pre-pending LSID metadata, as in something like urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3FBEC)

But here's my point: the actual identifier is 128 consecutive 1s and 0s -- which is how most database systems actually store them on disk, in the form of 16-byte numbers. However, they're almost always presented (and consumed) as text strings -- usually UTF text strings, which make them a whopping 576 bits in canonical form. So basically, we're consuming 4x as many bytes as the actual identifier, just to make them a little bit more human-friendly.

You could argue that the form urn:uuid:e593838a-f7a9-5ef2-a04a-2bfc7c90771f makes them even more computer-friendly (at the cost of only an additional 144 bits... more than the actual identifier itself, BTW), but I would argue "not really". While I know the intent of the RFC 4122 system was to allow computers to automate things, I'm not sure how much it's caught fire broadly among people who process this information (and write code to process this information). I bet the first thing that a lot of developers (most?) would do with urn:uuid:e593838a-f7a9-5ef2-a04a-2bfc7c90771f is strip off the first 9 characters, then write the rest of the code based on matching the canonical text form. And even without the prefix, it's not too hard to incorporate a regular expression to identify a UUID within any text string (including those from PLAZI, which typically lack the hyphens).

OK... like I said, I don't want to hijack this thread with a diatribe about identifiers, but it appears that ship has already left the barn (or something like that).

tucotuco · 2021-04-28T22:09:22Z

@deepreef This looks like a solid proposal. I took pause at first at the "and potentially other MaterialSamples were derived, or which they collectively comprise" in the definition. It seemed odd to refer to other entities than required to define the concept, but these additions really do help to nail down more broadly how to use the term in practice, and they do nothing to obscure the immediate concept, so I end up quite liking it.
Is that example a real identifier for a MaterialSample somewhere? I try to make sure the examples are real. If it isn't, can we use that provided by @dagendresen? Thanks Dag for the great illustration of usage.

deepreef · 2021-04-28T22:31:16Z

Thanks, @tucotuco

I took pause at first at the "and potentially other MaterialSamples were derived, or which they collectively comprise" in the definition. It seemed odd to refer to other entities than required to define the concept, but these additions really do help to nail down more broadly how to use the term in practice, and they do nothing to obscure the immediate concept, so I end up quite liking it.

Yeah, that's the part of the proposal I was most queasy about. I modelled the definition after the existing definition for parentEventID: "An identifier for the broader Event that groups this and potentially other Events."

I originally had it as:

"An identifier for the broader MaterialSample from which this and potentially other MaterialSamples were derived."

But that seemed incomplete, so I added the extra ", or which they collectively comprise" (to avoid people nit-picking the definition of "derived")

Is that example a real identifier for a MaterialSample somewhere?

Yup! And not chosen at random either (here's a hint: search for occurrenceID 4fed2b94-7fb1-4a49-9315-0810171fc507). I was kinda disappointed that there didn't seem to be any way to search GBIF on materialSampleID (doesn't even seem to show up in the full data record). I wanted to find other real-world examples of what values people are presenting under that term, so I could have more than just the UUID example. I even downloaded ~2M GBIF records (Hawaii records -- I need them for another project anyway) so I could get a sampling of other real-world values for materialSampleID; but I has having trouble importing the download into a database, so I gave up and just entered the UUID. I figured that's the only example given for materialSampleID anyway, so might as well be consistent. Except I chose a different UUID for the example, for entirely narcissistic reasons (in my defense, if I were a true narcissist, would have gone with 65fea8a6-c595-4f5b-adda-d1d176f40e7c - I'll make you wait until GBIF adds support for searching on materialSampleID to see what that one is).

In any case... I've added the example from @dagendresen as a second one (even though I'm queasy on the urn:uuid: thing...)

debpaul · 2021-04-29T00:52:03Z

Haha @deepreef wrote:

OK... like I said, I don't want to hijack this thread with a diatribe about identifiers, but it appears that ship has already left the barn (or something like that).

Me either. But here goes. Many moons ago Greg and I asked, do we need the prefix? (answer no). Who or what really needs the "urn:uuid" declaration? A machine can figure out it's a UUID. A human can see it? The field itself comes with expectations of what to find in it. The prefix is redundant, no?

tucotuco · 2021-04-29T00:59:12Z

Related issues are Issue #1, Issue #3, Issue #24 (reopened because of renewed interest), Issue #314, Issue #332, Issue #345, Issue #346, and Issue #347.

deepreef added the Term - add label Apr 28, 2021

tucotuco added Class - MaterialSample Process - ready for public comment labels Apr 28, 2021

This was referenced Apr 29, 2021

Change term - preparations #346

Closed

Change term - disposition #347

Closed

New Term - materialSampleType #345

Closed

tucotuco added the Extensions label Apr 29, 2021

This was referenced Apr 29, 2021

Change term - associatedSequences #332

Closed

Change term - MaterialSample #314

Closed

New term - organismPart #3

Open

New term - preservationMethod #1

Open

tucotuco added normative Process - needs Task Group and removed Process - ready for public comment labels Apr 29, 2021

deepreef mentioned this issue May 3, 2021

LSIDs for taxonomic names live again tdwg/tnc#117

Open

tucotuco added the Task Group - Material Sample https://www.tdwg.org/community/osr/material-sample/ label Jun 2, 2021

tucotuco removed the Process - needs Task Group label Aug 25, 2021

Jegelewicz mentioned this issue Oct 7, 2021

Primary Deliverable - MaterialSample definition tdwg/material-sample#2

Closed

Jegelewicz mentioned this issue Oct 14, 2021

Suggested new term - parentMaterialSampleID tdwg/material-sample#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Term - parentMaterialSampleID #344

New Term - parentMaterialSampleID #344

deepreef commented Apr 28, 2021 •

edited

Loading

dagendresen commented Apr 28, 2021 •

edited

Loading

deepreef commented Apr 28, 2021

dagendresen commented Apr 28, 2021

dagendresen commented Apr 28, 2021 •

edited

Loading

dagendresen commented Apr 28, 2021 •

edited

Loading

deepreef commented Apr 28, 2021

tucotuco commented Apr 28, 2021

deepreef commented Apr 28, 2021 •

edited

Loading

debpaul commented Apr 29, 2021 •

edited

Loading

tucotuco commented Apr 29, 2021

New Term - parentMaterialSampleID #344

New Term - parentMaterialSampleID #344

Comments

deepreef commented Apr 28, 2021 • edited Loading

New term

dagendresen commented Apr 28, 2021 • edited Loading

deepreef commented Apr 28, 2021

dagendresen commented Apr 28, 2021

dagendresen commented Apr 28, 2021 • edited Loading

Preserved specimen (mounted)

Blood sample

DNA sample (not yet here, but available for other)

dagendresen commented Apr 28, 2021 • edited Loading

deepreef commented Apr 28, 2021

tucotuco commented Apr 28, 2021

deepreef commented Apr 28, 2021 • edited Loading

debpaul commented Apr 29, 2021 • edited Loading

tucotuco commented Apr 29, 2021

deepreef commented Apr 28, 2021 •

edited

Loading

dagendresen commented Apr 28, 2021 •

edited

Loading

dagendresen commented Apr 28, 2021 •

edited

Loading

dagendresen commented Apr 28, 2021 •

edited

Loading

deepreef commented Apr 28, 2021 •

edited

Loading

debpaul commented Apr 29, 2021 •

edited

Loading