Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

umbrella issue related to dwc:basisOfRecord and an Evidence class #302

Open
baskaufs opened this issue Oct 28, 2020 · 47 comments
Open

umbrella issue related to dwc:basisOfRecord and an Evidence class #302

baskaufs opened this issue Oct 28, 2020 · 47 comments

Comments

@baskaufs
Copy link
Contributor

baskaufs commented Oct 28, 2020

At TDWG 2020, the subject of the need for an Evidence class came up several times.

Problems with the term dwc:basisOfRecord were also discussed. It is essentially a bespoke term to indicate the type of a record, but there is ambiguity about what particular resource the "record" is referring to since a single line in a table often contains information about several related resources.

Because of this ambiguity about table columns, this issue is therefore also related to the need for some better mechanism for describing the meaning of columns in CSV files that are part of a Darwin Core Archive. Data Packages and the W3C csv2rdf standard were mentioned in this context.

Given that this is a complex issue involving several interrelated issues, it probably needs a task group to come up with a solution. For now, I'm creating this umbrella issue as a way to document the discussion about these topics.

Ping @timrobertson100 @qgroom @dshorthouse @deepreef

@baskaufs baskaufs added the task label Oct 28, 2020
@baskaufs
Copy link
Contributor Author

Related issue "basisOfRecord for Plazi datasets" https://discourse.gbif.org/t/basisofrecord-for-plazi-datasets/2238/3 @dagendresen @ekrimmel @agosti

@qgroom
Copy link
Member

qgroom commented Oct 29, 2020

From Richard Pyle,

Thanks for looping me in, Steve. I agree with all your points below. These kinds of problems are popping up in a number of different places in the TDWG domain (e.g., literature discussions, Agent discussions, etc.).

I think part of the problem is the way in which people think of the meaning of an “occurrence”. I’ve discussed this with several of you over the years, but from my perspective, there is a general and fundamental misapplication of dwc:Occurrence among many in our community, which I think is largely a quirk of history. DwC started out as a standard for exchanging data about specimens. The greatest perceived value of being able to aggregate specimen data was to get better/more robust answers to questions about “what” (Taxon), “where” (Location) and when (the latter of secondary priority to the former two). Of course, observational data also provides valuable insights into what/where/when, so “Specimen” became “Occurrence” to allow integration, and a need arose to indicate whether a particular instance was based on a vouchered specimen or some sort of in-situ observation. Somewhere in that mix is where I think a lot of people (especially those who started out with DwC in the era when it was about specimens) started equating the “Occurrence” as the “Specimen”.

But the “specimen” is not the “occurrence”! The moment when the specimen (physical) was extracted from nature is the occurrence (abstract). We now have dwc:Organism and dwc:MaterialSample, the former representing the basis of a dwc:Occurrence instance (which I define as an intersection of dwc:Organism and dwc:Event instances); and the latter introduced to accommodate multi-taxon samples, but simultaneously creating a class that actually does represent a “specimen”.

I think this history and confusion has led to the issues we’re having with dwc:basisOfRecord. What started out as a need to distinguish between records based on vouchered specimens, vs. records based on non-vouchered occurrence records, has become an overloaded term that tries to solve several problems (but doesn’t really succeed at any of them). Here is what the DwC quick reference guide says are Examples of basisOfRecord, along with some editorializing from me:

PreservedSpecimen
Easy enough – this is how DwC started – to allow exchanging and aggregating data about Museum specimens.

FossilSpecimen
OK, I guess this is different from PreservedSpecimen both because it’s not really “Preserved” (or was already preserved prior to extraction from nature?), and because the implications of its underlying Occurrence are somewhat different, in that there are other properties (like those associated with dwc:GeologicalContext). And perhaps also to imply certain assumptions about the organism on which the occurrence is based (i.e., it died long before its occurrence was documented, and may have moved).
LivingSpecimen

Needed for things that aren’t dead yet (and, hence, not yet preserved), such as organisms in zoos and aquaria and botanical gardens and cultures and such. Is this because the associated dwc:Occurrence is anchored to existing locations (i.e., the aforementioned zoos and aquaria and gardens), rather the intersection of the Organism with the Event at which it was extracted from nature (as is the case for dead specimens)? What are the implications of tagging records with this value, compared to the other two values above? What assumptions or restrictions on fit-for-purpose evaluations that come with this designation?

MaterialSample
A genuine DwC class, but by the DwC definition, this appears to be a superclass of the first two above. I’m not sure how many people regard an instance of LivingSpecimen to be a subclass of MaterialSample (I probably would)..

HumanObservation
This is the original “Observation” value, to distinguish an instance of Occurrence from a vouchered specimen. Easy enough.

MachineObservation
Same as HumanObservation, except the Human interpreting the observation didn’t participate directly in the Event itself.

Event
A genuine DwC Class, not conflated with the others.

Taxon
Another genuine DwC Class, not conflated with the others.

Occurrence
The superclass for all the others except Event & Taxon? Or maybe just the superclass for the two Observations?

I know this has already been discussed to death, including the overloaded nature of dwc:basisOfRecord and the suggestion of a need to recognize classes and subclasses of these things, and so on. I also know that these are only represented as examples, not a controlled vocabulary of explicit enumeration of allowable values. But I think we can break down these various values into two separate domains.

The first domain is bona-fide DwC Classes (Occurrence, Taxon, Event, MaterialSample). I see theses as logical values for a dwc:basisOfRecord term/property for use in things like star schema, so it’s clear what class of object the associated values apply to. Missing are the other DwC Classes (Organism, Location, GeologicalContext, Identification, MeasurementOrFact, ResourceRelationship, [UseWitIRI?]). We probably don’t even need this as a record-level term (BTW, why is this nested with dwc:Occurrence? Shouldn’t it be

The second domain are qualifiers (subclasses) of either MaterialSample or Occurrence (or possibly Organism?):

MaterialSample

          PreservedSpecimen

          FossilSpecimen

          LivingSpecimen(?)

Occurrence

          HumanObservation

          MachineObservation

          LivingSpecimen(?)

Organism?

          LivingSpecimen(?)

I think the pathway to salvation is to refine the definition of basisOfRecord to be restricted to the names of DwC Classes, so that there is an explicit indicator for each record as to what class of object that record represents. The other things are more aligned with what we’ve been talking about as an Evidence Class. Mostly Evidence underpins Occurrence instances, but it can also underpin Identification instances. The Evidence itself can take various forms, including instances of MaterialSample, Images, “MaterialCitation”, and unvouchered in-situ observations of various kinds (some involving humans, some involving machines, some involving images created by either humans or machines).

I’ve rambled on enough for now, but I completely agree with the following points:

dwc:basisOfRecord needs some clarification in function & purpose
It’s probably time to seriously consider dwc:Evidence
It’s probably time to revisit the star schema to either modify or replace it
We probably need a new Task Group to sort this out (within which Interest Group, though?)

Vaguely related to this, we also need to start seriously considering two more dwc classes (Reference and Agent). We’ve avoided these in the past mostly (I think) because we felt thy were outside of our Scope. But DwC already has a Location class, and the TDWG community has embraced AudubonCore, and we have an Annotation group. All of these areas are covered in much more general contexts outside of TDWG-land, but we created them because we have domain-specific needs associated with them. I think the same is true for Agents and References, and they are somewhat related because they both have a lot of relevancy to a possible Evidence class.

@deepreef
Copy link

Thanks, @qgroom !

@qgroom
Copy link
Member

qgroom commented Oct 29, 2020

There is also an open issue in GBIF-marine on basisOfRecord iobis/gbif-marine#10

@qgroom
Copy link
Member

qgroom commented Oct 29, 2020

Note that the vocabulary GBIF uses for basis of record is not the same as that suggested for Darwin Core
https://gbif.github.io/parsers/apidocs/org/gbif/api/vocabulary/BasisOfRecord.html

@qgroom
Copy link
Member

qgroom commented Oct 29, 2020

@baskaufs baskaufs changed the title umbrella issue related to dwc:establishmentMeans and an Evidence class umbrella issue related to dwc:basisOfRecord and an Evidence class Oct 29, 2020
@baskaufs
Copy link
Contributor Author

Meant to say "basisOfRecord" but "establishmentMeans" came out of my fingers!

@tucotuco
Copy link
Member

tucotuco commented Oct 29, 2020 via email

@baskaufs
Copy link
Contributor Author

I don't have time at the moment to fully engage in this discussion, but I wanted to make one note about the idea of an Evidence class. In our discussions about minting the "Token" class (which should have been called Evidence because that's exactly how it's used) in Darwin-SW, @camwebb and I considered whether such a class was actually necessary or not. The critical thing is actually to have the term "hasEvidence" to link to the evidence. Whether or not we declare that linked thing as an instance of an Evidence class is secondary to the linking. We can infer that it is evidence by use of the linking property. In the case of Darwin-SW, dsw:hasEvidence has a range of dsw:Token, so the act of using the property entails that the connected resource is a dsw:Tokenautomatically.

The TDWG Vocabulary Maintenance Spec disallows such range declarations as part of the core metadata for terms, so we probably wouldn't do that in a dwc:hasEvidence term. But using the term would still imply that the object of the statement is "evidence" regardless of whether we mint a dwc:Evidence class or not. Most if not all objects would already be instances of some other class like dwc:PreservedSpecimen, dcmitype:StillImage, foaf:Document, etc. and there wouldn't be any harm of also declaring them to be in instance of a second class (dwc:Evidence). But it isn't clear to me that anything would necessarily be gained by that.

@deepreef
Copy link

I have a subtly different view of this. I don't see the instances of a class Evidence being the actual items you list (dwc:PreservedSpecimen, dcmitype:StillImage, foaf:Document, etc.). Rather I see instances of Evidence as actually being the join between an dwc:Occurrence (or dwc:Identification) and the instance of one of those other classes you mention (among others). In this way, I see a proposed dwc:Evidence class as being analogous to dwc:Identification, the latter of which effectively serves as a join between an instance of dwc:Organism and an instance of dwc:Taxon. I didn't explain that well in the stream-of-consciousness email that @qgroom posted on my behalf above.

@tucotuco
Copy link
Member

tucotuco commented Oct 30, 2020 via email

@deepreef
Copy link

@tucotuco -- I'm not sure I follow. The context in which I suggested "or dwc:Identification" was based on a realization we had that exactly the same things can serve as evidence for both occurrences and taxonomic identifications. I think the primary emphasis should be focused on "Evidence of Occurrence" (I believe that's the context that Darwin-SW framed Token/Evidence). For example: "This specimen represents Evidence that this Organism occurred at this Event" (i.e., this Evidence supports this Occurrence). You can replace "Specimen" with "Image" or "Publication" ["materialCitation"?] or "Human Observation" or "Machine Observation" (etc.). What occurred to us when implementing this model is that some of these (especially "Image" and "Specimen", but also things like "DNA sequence" and potentially others) not only serve as "Evidence of Occurrence", but also (potentially) "Evidence of Identification". In other words, "This specimen/Image/DNA Sequence represents Evidence that this Organism is identified as this Taxon." And each piece of "evidence" can simultaneously represent both (i.e., both Evidence of Occurrence and Evidence of taxonomic Identification).

It might make more sense to focus only on "Evidence of Occurrence", but I can see a possible role for "Evidence of Identification" as well.

Incidentally, another way to solve this is, instead of creating a dwc:Evidence class, we could just adopt a standard value of dwc:relationshipOfResource (e.g., "hasEvidence"/"isEvidenceOf") and capture all this within instances of dwc:ResourceRelationship. But I could make essentially the same argument for dwc:Identification (with some tweaks). Elsewhere there are murmurings of tweaking dwc:ResourceRelationship to accommodate a broader array of functions (e.g., adding a sequence term to the class to facilitate linking Agents to various other dwc classes).

Indeed... I look forward to the day when ALL relationships between instances from among and within DwC classes are represented this way (i.e., as instances of dwc:ResourceRelationship). But that's probably a topic for another Issue (or Task Group, or Interest Group...)

@baskaufs
Copy link
Contributor Author

OK, just to try to clarify how I am understanding the situation, here are two diagrams that I think represent the difference between how Rich is describing evidence and how Darwin-SW describes evidence. (DSW does not use the name "evidence", rather it uses "token" but you can consider the two to be interchangeable in the discussion below.)

Rich's evidence model

If I'm understanding correctly, Rich is saying evidence is a "join" (which I would call a node) between occurrence and the thing that is serving as evidence. The type of the evidence node is dwc:Evidence and the type of the thing that is serving as the evidence is dwc:PreservedSpecimen, but could be any of a number of other things like dcmitype:StillImage, foaf:Document, etc. I suppose a dwc:hasEvidence property would link the occurrence to the evidence instance. I'm not sure what property would link the evidence instance to the resource that is serving as the evidence.

DSW evidence model

The way Darwin-SW models evidence is that any type of thing can be the evidence and the dwc:hasEvidence property would connect the occurrence to the resource that is serving as the evidence. The resource serving as the evidence is going to intrinsically have some type (dwc:PreservedSpecimen, dcmitype:StillImage, foaf:Document, etc.) but could also be in instance of the class dwc:Evidence. I am showing the typing as dwc:Evidence in red because it is not clear to me if that statement actually serves any purpose and perhaps is superfluous since the resources serving as evidence already have other more meaningful types. We know that the resource serving as evidence is evidence because we've used the dwc:hasEvidence property to connect it.

There is nothing "wrong" with either of these models. But in my experience, it is best to have the simplest model that allows you to do what you need to do, and no more. There are two reasons why we might need an extra node in the middle like Rich has suggested:

  1. If there are properties that we want to attach to that node that we can't reasonably attach to one of the adjacent nodes (the occurrence or the resource serving as the evidence).
  2. If the node is needed to facilitate one-to-many or many-to-many relationships. For example, if a single "evidence" instance of Rich's type needed to be connected to many resources serving as evidence (as opposed to attaching those many resources directly to the occurrence).

It is not clear to me what the use cases are that would fit either of those two reasons, but Rich may be able to provide them.

@baskaufs
Copy link
Contributor Author

As far as whether evidence can serve as evidence for an occurrence, or an identification, or both, here is another diagram:

use of evidence diagram

The model of the main classes (identification, organism, occurrence, event, "taxon-ish thing") is according to Darwin-SW, which pretty much originally sprang out of Rich's brain, so I think he is probably thinking of those classes as having the same connections.

The question is whether a kind of evidence can be linked to an occurrence or an identification or both. In this diagram, the same property, dwc:hasEvidence is used to make both of the links. In contrast, the DSW model has two separate properties: dsw:idBasedOn (for connecting identifications) and dsw:hasEvidence (for connecting occurrences). Whether or not it is better to have the same property or two different ones again depends on the use cases. Having a single property would be less complicated since it would involve creating only one new property instead of two. However, if one's goal is to be able to query, then using the same property would require adding an additional screen to the query pattern (is the subject resource an occurrence or an identification). It is not clear to me which is better. But this brings us to a critical point in standards development: we should not be setting standards by what seems right in our brains, but rather defining use cases and then deciding whether the proposed solutions satisfy them or not.

There is a separate issue that hasn't come up and that is which direction the arrows should point. In Darwin-SW, we defined inverse properties pointing in both directions, but in retrospect I think that was a mistake since it is a burden to providers to have to figure out how to provide both properties and a burden on queriers to have to design complicated queries in the event that providers don't provide both. (This is why we established "preferred" properties within inverse pairs.) Based on my experience it is much better to have the properties point from the "many" direction to the "one" direction (if it's a one-to-many relationship). Since one occurrence can have many forms of evidence, that would argue towards having the properties point towards the occurrence. However, the same piece of evidence could potentially serve as evidence for several occurrences (e.g. a single image capturing multiple organisms). So that isn't necessarily clear, although if we actually figure out how to annotate parts of images, segments of videos and sound recordings, parts of documents, etc. the granularity of describing the evidence might be such that would could say that generally a single piece of evidence only describes one occurrence.

@deepreef
Copy link

Thanks, @baskaufs -- this is VERY helpful! First, full disclosure: our actual implementation looks more like the DSW model than "my" model. And it works very well. However the thing that has bothered me about it is that the "Evidence" table in our RDMS implementation is the thin stack of identifiers that map 1:1 with identifiers in various other tables (Media, Reference, CollectionObject [=generalized PreservedSpecimen], etc.). In effect, this makes our "Evidence" function as a superclass of all those other things. But that seems like a distorted view of the universe (i.e., Images and Specimens and such have more inherent/intrinsic value in and of themselves than simply serving as evidence of other things). Moreover, we still need a join table to represent the M:M relationship between Evidence and Occurrence (i.e., each instance of Evidence in this sense may underpin multiple Occurrence instances, and conversely each instance of an Occurrence may be supported by multiple instances of Evidence), which we call "OccurrenceEvidence". In exposing the values of OccurrenceEvidence via DwC, I had imagined using dwc:ResourceRelationship.

So... if our actual implementation is more or less the DSW approach, why have I offered the alternate approach in this discussion? Well in part for the reasons described above, as well as a few other reasons, it seems much more appropriate to represent Evidence as the "join" between the Specimen|Image|Publication|Etc. and the Occurrence[|Identification?] instance, because that is the specific context in which the Subject instance (Specimen|Image|Publication|Etc.) actually functions as Evidence for the Object instance (Occurrence[|Identification?]). This is why I opened the door to representing instances of what we want to characterize as Evidence within dwc:ResourceRelationship, typed via a specific value for dwc:relationshipOfResource.

It seems to me, the justification for establishing a Class of anything in the context of DwC (or any data standard) is to be able to represent it through an identifier, and to attach key properties to it. So this leads to: what are the key properties we would want to attach to an instance of dwc:Evidence? In our implementation, we don't really have any meaningful properties associated with our Evidence sensu DSW (i.e., "superclass" of those other things). The key properties are actually within our OccurrenceEvidence table -- things like evidenceQuality and isPrimarySubject.

As to specific comments & questions from @baskaufs :

I'm not sure what property would link the evidence instance to the resource that is serving as the evidence.

I would suggest something like isEvidenceOf?

  1. If there are properties that we want to attach to that node that we can't reasonably attach to one of the adjacent nodes (the occurrence or the resource serving as the evidence).

As I noted above, in our implementation we attach properties like evidenceQuality and isPrimarySubject (not by those names, but that's what they represent).

  1. If the node is needed to facilitate one-to-many or many-to-many relationships. For example, if a single "evidence" instance of Rich's type needed to be connected to many resources serving as evidence (as opposed to attaching those many resources directly to the occurrence).

Yeah, this is pretty much a given. Certainly each Occurrence instance (or Identification instance... if we go there) can be supported by multiple instances that function as evidence (e.g., five humans observed the bird at the pond, two photos were taken of it, then someone killed it and preserved it at a Museum). But likewise, an image/video or a publication could serve as evidence for multiple Occurrence instances.

The model of the main classes (identification, organism, occurrence, event, "taxon-ish thing") is according to Darwin-SW, which pretty much originally sprang out of Rich's brain, so I think he is probably thinking of those classes as having the same connections.

Yup! That diagram in your second post works for me!

we should not be setting standards by what seems right in our brains, but rather defining use cases and then deciding whether the proposed solutions satisfy them or not.

I think this is key, so here are some use cases off the top of my head (I can come up with more):

  1. I have a library of underwater video recordings. Most of the individual recordings focus on a particular Organism (e.g., fish), but the video image also captures many other organisms coming in and out of frame. I would like to document as many Occurrence instances as I can based on this video clip, and ensure that each Occurrence can be traced back to the video clip to serve as the foundation for the Occurrence.

  2. I would like to generate a regional checklist of species that occur within a defined area, and I would like to cite all the evidence to support my assertions that each taxon occurs (or has occurred) at the indicated location. The source of the Occurrence instances include specimens, reported human observations, published distributions, in-situ images, and eDNA samples.

  3. I would like to asses the confidence of a taxonomic identification of an organism based on whether the identification was made with a specimen in hand, or from an image of the organism, or from a DNA sequence, or some combination of these.

  4. I would like to filter a list of Occurrence records based on whether they are supported by preserved specimens, in-situ images, published records, or some combination of these things.

There' a LOT to unpack here, and I'm not sure if my rantings are in any way helpful to what this Issue was created for, but I do know that dwc:basisOfRecord does not, by itself, allow me to track the kind of information I would like to track (and share it and/or harvest it from an aggregator in the way I would like to be able access it).

As with so many of these discussions, it's important to separate implementation-specific things from things that are genuinely helpful/important in a data exchange standard like DwC. Also, it's important to focus on actual user needs and actual available data. When I first saw the DSW model, I was satisfied that the first bar had been reached (i.e., it was clear that this idea wasn't restricted to our own implementation). The reason I have stayed quiet on the Evidence class is mostly with respect to the second bar: how much data exists that can actually be represented with this degree of granularity, and who really has a use for it? In my mind that bar was reached via several conversations at both recent TDWG conferences (especially, but not only, discussions related to dwc:basisOfRecord). In other words, I think we may be close to critical mass on a minimum threshold of available data and expressed need that it may be time for this community to "go there" with respect to Evidence.

I don't know about others, but this is exactly the kind of discussion I was hoping would emerge from this. My main fear is that the only ones who find this discussion important and worthwhile are @baskaufs and I.

@tucotuco
Copy link
Member

tucotuco commented Oct 30, 2020 via email

@deepreef
Copy link

deepreef commented Oct 30, 2020

I think @tucotuco is probably right -- we'd need to collectively understand (and agree on) the ontology of this stuff before it makes sense to flesh out a DwC class. A year ago I would have said the ontological representation is pretty solid (i.e., DSW model), but as evidenced by the discussion here, it still needs some clarification (in my own mind, at least).

Edit: "evidenced" not intended as pun, but kinda works out that way.

@dagendresen
Copy link
Contributor

dagendresen commented Oct 31, 2020

I enjoyed this thread a lot! Not sure if helpful, but here is a fringe case question that got stuck in my mind.

What happens if two people co-collect/co-observe a dwc:Occurrence together and both are listed together in dwc:recordedBy? We normally record this as one single species dwc:Occurrence - assigning one single dwc:occurrenceID. But if we for some reason want to talk about the observation of the species-occurrence (dwc:Occurrence) by each person separately, would this create two new dwc:Occurrences? (In total three dwc:occurrenceIDs - I'd guess maybe not?) Or might there maybe be a use case for two (or three?) instances of a Evidence here??? [Related - Is the person/Agent needed to make a species-occurrence into a dwc:Occurrence, would it be a dwc:Occurrence if nobody recorded it? (provided Evidence)]

Next, what happens if a third person, maybe not a naturalist at all, maybe eg. a journalist, or an amateur photographer, or another researcher, observes the two collectors (dwc:recordedBy/Agents) making the species observation (dwc:Occurrence?) together? (Maybe publishing the photograph in a newspaper, or a scientific paper). Might even this case generate another dwc:Occurrence, with another dwc:occurrenceID (I'd guess not ??), or might this maybe generate another instance of Evidence??? (Maybe simply a Image made to have the role as Evidence by a hasEvidence statement?) [Might this be similar in any way to the use case of the literature-occurrences that Plazi and BHL works with ??]

+1 @tucotuco for thinking of BCO, however, also +1! @deepreef for basisOfRecord overloaded! and the need for doing something

@dshorthouse
Copy link

Forgive my ignorance in the nuance that is being expressed here, but this feels like turtles all the way down. It reminds me of how Crossref in its early days wrestled with WHAT gets a DOI. At that time publishers were exploring with multiple digital outputs, views, and file formats of scientific articles beyond landing pages and PDFs. They finally settled on a rule: "A Crossref DOI should point to one intellectually discrete scholarly document". And, Crossref then developed a suite of tools and services (coincidentally reused to detect plagiarism) to police the assignment of DOIs to duplicative content. Bad behaviour had real financial consequences for members and could mean getting tossed from the playground. The point here is that Crossref and its members settled on a precise definition for the entities in the collective sandbox & then got on with their business. It's not entirely fair to compare our entities to scientific papers that come ready-made with tidy little citation graphs, but there's also a message here to be ever mindful of how we expect our equivalent "intellectually discrete scholarly documents" to be used and linked. If there is confusion over WHAT are our grains of sand when drawn into the aggregate then we have a very real problem; it does not instil confidence and the citation graph will be a dull mat of cold mud.

@deepreef
Copy link

deepreef commented Oct 31, 2020

@dagendresen : We have a similar conundrum in our world: A team of three divers go on a dive together. They stay pretty close to each other during the dive. Each has his/her own video camera, and records a whole bunch of video clips during the dive. Because the divers are close together, they often capture video images of the same individual organisms (e.g., rare fish, big shark swims by, etc.). A not-uncommon circumstance is that two of the divers are filming the rare fish/shark, and the third diver is filming the other two divers filming the rare fish/shark. Scenarios such as this are more the norm than they are the exception in our real-life world.

The most vexing issue for us isn't even about the Occurrences (that's relatively straightforward -- see below). The thing we argue about is: How many Events are in play here? Our Event model is hierarchical, so we have one top-level event for the "Expedition". We then typically generate a "Team Dive" event as a child of the "Expedition" Event. From there, things get squirrelly. In most cases, we don't define any more granular events than this. The main problem is that properties like minimumDepthInMeters/maximumDepthInMeters need to be attached to the Occurrence instance (rather than where they really belong, which is the Location instance -- or at least the Event instance), so we can capture accurate/precise depth values for each Occurrence established during the dive. If we want to avoid that problem, then we need to define lots of sub-events for each "Team Dive", so that the depth values can be correctly assigned. The problem, though, is a massive inflation of Event instances (at least), or Location instances -- because we know the depth with precision, and therefore effectively every video clip becomes its own Event (and Location). Another way to parse out subevents is to create three "Person Dives" as child events to the parent "Team Dive", such that each diver's experience and set of documented occurrence instances represent a distinct Event, separate from the other divers' Event-experiences. And then thing get really complicated when we try to parse events along both pathways, in which case there are potentially hundreds of events on a single dive.

That was a digression from your questions, but I wanted to explain that the definition of an Event comes into play here as well (and also the question of whether properties like minimumDepthInMeters/maximumDepthInMeters apply to the Occurrence, Event, or Location instance).

So getting back to your question, let's ignore how we parse the Event, and focus on the three divers, each with their own video cameras. Two of them find a rare fish and film it, while the third diver films the first two divers filming the rare fish. The way we capture this in the current implementation is as follows (simple form):

1 Event (however parsed it is)
1 Organism (rare fish)
1 Occurrence (Intersection of Organism and Event)
3 Media files (video clips from each of the divers -- you can see the fish in the third diver's video)

In the DSW model, the 3 Media files = 3 Evidence instances -- which is how our current implementation works. But in my current thinking, there are three Media Files which are media files, and three separate instances of "Evidence", one from each Media file linked to the 1 Occurrence (i.e., three records in a join table between Media instance and Occurrence instance).

Now, let's review how we actually capture this in the current implementation:
1 Event (however parsed it is)
4 Organisms (one rare fish, three humans)
4 Occurrences (one for each of the four Organisms at this Event)
3 Media files (video clips from each of the divers)

In the existing (DSW-like) implementation, we have 3 Evidence instances -- representing each of the 3 video files (as above). However, in the model representing my current thinking, we have at least 5, and perhaps 6 instances of Evidence:

  1. Intersection of first diver's video of rare fish
  2. Intersection of second diver's video of rare fish
  3. Intersection of third diver's video of rare fish
  4. Intersection of third diver's video with first diver
  5. Intersection of third diver's video with second diver
  6. Presence of third diver at Event, evidenced either via recordedBy (per our earlier discussion on Agent As Organism), or via HumanObservation.

Now, my answers to your specific questions:

But if we for some reason want to talk about the observation of the species-occurrence (dwc:Occurrence) by each person separately, would this create two new dwc:Occurrences?

No. The Occurrence is the intersection of the Organism and the Event; so only one Occurrence no matter how many observers or pieces of evidence.

Or might there maybe be a use case for two (or three?) instances of a Evidence here???

Potentially. I don't think we would bother parsing out each person's observation as a separate piece of evidence; but I guess you certainly could if you wanted to.

[Related - Is the person/Agent needed to make a species-occurrence into a dwc:Occurrence, would it be a dwc:Occurrence if nobody recorded it? (provided Evidence)]

In the absence of any evidence, how would we ever know to generate the data record? (If a tree falls in the woods it does make a sound, but someone needs to document that sound in order to record the circumstance of its falling.)

Please note: the scenarios I described above are NOT edge cases -- in fact, they are representative of the majority of our data related to video-based occurrence records. As I was typing all that, I was thinking about how to show an example, and one occurred to me: https://www.youtube.com/watch?v=3fI2QxUAv1g. That one is actually almost perfect for the hypothetical presented by @dagendresen . We have three naturalists gathering biological data, one in the form of video (John Earle) and two collecting specimens specimens (Richard Pyle and Brian Greene). We also have a journalist (Bob Cranston; BBC cameraman) and his assistant holding the lights (Peter Kraugh). John Earle's video is focusing on the non-human organisms, whereas Bob Cranston's video is capturing both the human organisms and the non-human organisms.

I don't remember whether we ever processed this particular set of video clips, but there are potentially hundreds of Organisms (if you count all the corals and fishes), five humans, and several lines of evidence (videos, observations, collected specimens). Interestingly, I think that at least three of the collected specimens in that series of videos ended up as Holotypes (for Chromis abyssus, Proganthodes geminus, and Tosanoides annepatrice). It might be fun to use this set of video clips to explore how we would capture all the relevant information, so that if I asked the question "What species of fishes live on deep coral reefs in Palau?", I would be able to include all these lines of evidence as the foundation for Occurrence instances to build my checklist.

It would also be fun to explore other questions, like: How do the video clips function as Evidence of Identification as well as Evidence of Occurrence? Does the audio portion of the video (e.g., my helium voice proclaiming "Prognathodes" and "abei") count as separate pieces of evidence-of-identification from the image captured in the video? If we extract a frame from the video and publish it as a still image (as we have), does that count as a separate piece of evidence? It is a separate media file, after all - even though it's contained within the "parent" video media file. And here's a good one: Surely this represents a legitimate example of evidence-of-identification. But should it also serve as an instance of evidence-of-occurrence? What if instead of a fin clip, the sequence was obtained from an eDNA analysis of a water sample take at the same event?

My head is about to explode, so I'd better stop here.

@dagendresen
Copy link
Contributor

@deepreef When asking if the Agent is needed to make an Occurrence, I had in my mind (the misconception?) that the Occurrence was the intersection of the organism at a place and time (Event) and when recorded by an observer (what-organism, where-location, when-eventTime, whom-recordedBy). You teach me here that the whom-recordedBy is not part of this scenario?

At the GBIF nodes meeting in Portugal I was playing with this together with a colleague. We both made a photo (for iNaturalist) of the identical same butterfly larve eating the idetical same brassica plant at the very same time (we counted down before pressing the camera button). We wanted to play with the idea of if this created one or two Occurrences.

My Occurrence: https://www.inaturalist.org/observations/3031052
My colleagues Occurrence: https://www.inaturalist.org/observations/3029206

We tied them together by attempting to machine-tagging them with the same eventID. In light of your model of the Occurrence as the intersect of the Organism and the Event, I have learnt now that these two (instances of Evidence?) are for the same Occurrence?

The origin of our thought game was at the time also in part that different observers might not even be aware of the other declaring a dwc:Occurrence for the shared species-observation. In the real-world, we imagined a large group of bird-watchers flocking to a site where a rare bird had been reported. (In Norway I learnt there is a SMS message service that bird-watchers subscribe to and that they might travel far to watch a bird). Maybe each bird-watcher will declare their own dwc:Occurrence for the same bird in a given citizen science platform to photo-voucher the evidence of including the bird-species on their individual list of birds-species they have seen. In this competitive bird-watching space, would we instruct them that ONLY the first reported sighting of a bird count (as the Occurrence)? Are a few seconds/minutes/hours between recordings anyway sufficient to count as distinct Occurrences? In my mind, I thought that the different observer in recordedBy by itself was sufficient :-)

The other scenario with your BBC cameraman and the potential hundreds of organisms recorded in the video, that COULD be parsed out -- makes me think of the 2014 GBIF Ebbe Nielsen winner Vijal Barve investigating if images shared on social media such as Flickr, Facebook, etc could be untapped sources of occurrence-data.

[Apropos your digression example -- Is it the depth of the video-camera or the inferred depth of the organism in the occurrence that is the most appropriate attribute value here? Maybe even both? However, do you always need to explicitly declare all Location nodes just because you have precise depth measurements? Even if these distinct Locations are evident from having the depth reading. It is indeed cumbersome but possible to talk about these Locations eg. as "the location associated with the MeasurementOrFact with measurementOrFactID = urn:uuid:nnn".]

I hope this is still useful for the topic of the tread on basisOfRecord and Evidence. My main interest in the tread is rather (than the above) the distinction between the classes in the basisOfRecord vocabulary -- and in particular how to describe specimens as MaterialSample and as Evidence for an Occurrence.

@dagendresen
Copy link
Contributor

dagendresen commented Oct 31, 2020

... so if the Occurrence is the intersection between the Organism and Event, then we do need something else such as a new Evidence class for all the real-world things that have occurrenceID today??

... would maybe (not saying I think it is) an alternative possibly be a new class OrganismEvent (OrganismOccurrence, SpeciesOccurrence or similar) and renaming Occurrence to Evidence or OccurrenceEvidence (...). [Because Occurrence maybe is misused (?) for very many of the things that currently have occurrenceID assigned?]

Apropos multiple occurrences in the same photo/video (Evidence) -- the same museum specimen - the thing with a catalogNumber - can also be the evidence of multiple Occurrences, when two plants are mounted on the same paper (to save paper) ... or when we start to extract DNA evidence of microorganisms on/inside the specimens, or collect Salmon louse from the ichthyology fish collection [2] that next are accessioned with their own catalog numbers --- and thus PreservedSpecimen as Occurrence does not work here??!

@deepreef
Copy link

deepreef commented Oct 31, 2020

@dagendresen :

When asking if the Agent is needed to make an Occurrence, I had in my mind (the misconception?) that the Occurrence was the intersection of the organism at a place and time (Event) and when recorded by an observer (what-organism, where-location, when-eventTime, whom-recordedBy). You teach me here that the whom-recordedBy is not part of this scenario?

I'm not sure I follow. In my mind, the Occurrence is the intersection of the Event+Organism. So maybe a better way to answer your question is: Gazillions of Occurrences exist every moment, but only a tiny subset of them get into our databases -- and in most cases, that tiny subset corresponds to the ones where: 1) an Agent was present; and 2) the Agent documented the Occurrence in a form that finds its way into our databases (and thereby gets issued an occurrenceID). This assumes Machines can count as Agents, and does not take into account the side discussion we had about Agents as Organisms).

recordedBy is certainly an important property of an Occurrence, but I wouldn't call it a definitive one. Definitive = Event+Organism. If we take Event as "Where+When", and Organism as "What", then we have Occurrence=Where+When+What. The "ByWhom" part is important, but not definitive (and perhaps plays more into Evidence).

In your butterfly example, I would consider it to be one (non-human) Occurrence (Where+When+What, with the "What" being the butterfly). You now have two instances of Evidence (two photos). Or four if you want to add the two HumanObservations (kind of redundant, but still Evidence). And if you killed the butterfly and preserved it in a Museum, you could add a fifth instance of Evidence (PreservedSpecimen).

In light of your model of the Occurrence as the intersect of the Organism and the Event, I have learnt now that these two (instances of Evidence?) are for the same Occurrence?

That's how I would model it, yes. Or, I guess I should say, that's how our current implementation models it. As per our side conversation, I might be more inclined to model it as three Occurrences -- two human and one non-human Organisms intersecting at the same Event. It would have been great if you were on opposite sides of the butterfly such that your image-based evidence captured each other in the frame as well as the Butterfly!

The origin of our thought game was at the time also in part that different observers might not even be aware of the other declaring a dwc:Occurrence for the shared species-observation.

Indeed, this is something we deal with not infrequently in the real world as well. Two divers at the same place and time each record the same fish with their respective cameras, but not at the same moment (e.g., one on the way down, and the other on the way back up). By default, the fish is assumed to be a different organism for each of the two video clips. However, we sometimes discover that both divers captured video of the same individual fish, in which case we collapse the two organisms as the same, and usually our events are defined broadly enough to be the same as well, which means that the two occurrences also collapse as the same. If we decide to separate the two divers' dives into separate events (as mentioned previously), then we still collapse the Organism instance into one, but the Occurrences end up as separate (Same "What", but possibly different "Where" and/or "When").

In your bird scenario, the real-word problem is that these DO get reported as multiple distinct Occurrences, which can give the false impression to the data consumer that 20 different individuals the same same rare bird occurred at the same place and (roughly) the same time. It would be nice in such cases to have a global mechanism to collapse the dwc:individualID value (= identifier for the Organism instance) so that it isn't misleading in the aggregated data. Whether or not the Events are also collapsed into one (resulting in a single Occurrence instance) depends on how granular one wants to be in defining Event boundaries (dwc:eventTimeUncertaintyInSeconds, anyone?)

In any case, one of the main reasons why we (and, I assume @baskaufs and others in DSW context) recognized the need for a "Token"/"Evidence" class was to deal with exactly this issue -- i.e., that there can often be multi0ple lines of Evidence to document the same Occurrence instance.

makes me think of the 2014 GBIF Ebbe Nielsen winner Vijal Barve investigating if images shared on social media such as Flickr, Facebook, etc could be untapped sources of occurrence-data.

Not only could they be, they absolutely are! About 10 years ago, Rob Whitton and I conceived a plan to build a crowdsourcing platform on Explorer's Log to do exactly this sort of thing, but we never followed through. We may yet, though...

Is it the depth of the video-camera or the inferred depth of the organism in the occurrence that is the most appropriate attribute value here?

Technically (and in the ideal scenario), it's the depth of the dive computer on the diver's rebreather, time-synched with the timestamp on the video camera. But we assume +/- a couple meters, and the diver is usually horizontal to, and within a couple meters of, the subject. In rare cases where there is a meaningful difference, we estimate and record the depth of the Organism, not the depth of the diver (unless the diver is the Organism...)

However, do you always need to explicitly declare all Location nodes just because you have precise depth measurements?

No. In fact, we usually don't. That's why we "cheat" and record the depth at the Occurrence. Rob and I argue about this a lot -- I want to at least push it to Event (if not Location), but Rob doesn't want to populate a gazillion nearly identical Event (& Location) records to the point where they approach 1:1 with Occurrences. My counterpoint to him is that if we do ever extract those hundreds of "other" Occurrences from all those video clips, we will no longer suffer a near-1:1 ratio of Occurrence & Event (or even Location). As an aside, we've decided internally that -- for now at least -- "Location" describes a two-dimensional footprint on the surface of the earth, and any depth/elevation values (z-axis, 3rd dimension) are properties of the Event, not the Location. Yet another topic for another thread.

... so if the Occurrence is the intersection between the Organism and Event, then we do need something else such as a new Evidence class for all the real-world things that have occurrenceID today??

This gets at the heart of why I've been thinking about this for more than a decade, but am only making noise about it now. I don't know if the TDWG community is "There" yet. We do progress over time (we're a LONG way from where we were in the early days of DiGIR). But if you try to push things too hard/too fast, they sometimes break. We'll see if the discussion on Evidence as a Class takes root this time, or needs to go back into hiding for another few years or a decade or so.

... would maybe (not saying I think it is) an alternative possibly be a new class OrganismEvent (OrganismOccurrence, SpeciesOccurrence or similar) and renaming Occurrence to Evidence or OccurrenceEvidence (...). [Because Occurrence maybe is misused (?) for very many of the things that currently have occurrenceID assigned?]

I would regard that as the greater of two evils.

Apropos multiple occurrences in the same photo/video (Evidence) -- the same museum specimen - the thing with a catalogNumber - can also be the evidence of multiple Occurrences, when two plants are mounted on the same paper (to save paper) ... or when we start to extract DNA evidence of microorganisms on/inside the specimens, or collect Salmon louse from the ichthyology fish collection [2] that next are accessioned with their own catalog numbers --- and thus PreservedSpecimen as Occurrence does not work here??!

Exactly. MANY examples exist where one MaterialSample instance includes multiple Organisms. What we call a "Specimen" is vague, but even in the traditional sense, parasites are an obvious example (until they are removed from the host and cataloged separately.

I'm going to assume at this stage that @dagendresen and I are the only ones actually following this discussion, and I therefore apologize to everyone else. But all this stuff needs to be discussed somewhere, some time, and at some point, and it is directly tied to the "Problem" of dwc:basisOfRecord. Maybe this is not the right time or place (=Event) to have this in-depth discussion. But I wouldn't be spending a non-trivial part of my Saturday morning banging away at it, if I didn't think it was (ultimately) important for our community.

@dagendresen
Copy link
Contributor

Many thanks for engaging!! This tread is very educational for me!

In my example when we, at the time, were thinking of what=Taxon (=scientificName) + where=Location (=decimalLatitude+decimalLongitude) + when=eventTime (or eventDate) + agent=recordedBy as the immutable "data" that decided what was the same Occurrence (to be identified by the same occurrenceID) our thinking was MUCH less complex than your thinking!! And also rather influenced by trying to make sense of how we observed the concept of Occurrence was applied and used for real-world datasets [more bottom-up from data and less top-down from ontological thinking].

@dagendresen
Copy link
Contributor

I wish to return to my very first experience of problems to use basisOfRecord. This was when trying to link seedbank collections data to GBIF (for me starting from back in 2004). At this time there were no basisOfRecord = LivingSpecimen or basisOfRecord = MaterialSample yet. However, also later I have always found LivingSpecimen to be much more suitable to botanical garden specimens than to seedbank "specimens" or Accessions as the seedbanks normally call them. And many more possibilities with MaterialSample.

Brief summary of this use case:

For originally in situ wild or on-farm source material (1) seeds are collected in situ in the wild or "in situ"/on-farm from a regionally localized traditional farming context. This material aligns well with the Darwin Core concepts; and in situ/on-farm collected seeds as dwc:Occurrence works fine (except from some missing terminology addressed in the Darwin Core Germplasm extension, https://doi.org/10.17161/bi.v8i1.4095 & https://doi.org/10.13140/2.1.1207.3923).

The Bioversity collecting mission database (https://doi.org/10.15468/ulk1iz) holds examples of such in situ and on-farm material.

Next (2) the collected seed material is multiplied through seed multiplication ex situ (grown on lands at agricultural field stations and new more numerous seeds harvested) and included in a seedbank -- not way too different from museum collections in function. These seedbank seed samples as PreservedSpecimen or LivingSpecimen (or rather MaterialSample) is more or less reasonably acceptable.

The European Genetic Resource Search Catalog (EURISCO) (https://doi.org/10.15468/a3lnmd) holds examples of seed bank accessions.

BUT next (3) seed samples are distributed to other parties and very often also to other seedbanks. These seed samples are assigned other "catalog numbers". Other public seedbanks assign new accession numbers and unfortunately too often lose the provenance link to the parent seed sample material (more often because of unreliable material identifiers than the lack of trying). Private crop breeding companies assign breeding-line numbers and start a genetic selection for a reduction of genetic diversity to fit agricultural needs - and at the same time also an increase in genetic diversity by crossing with other breeding lines. Differences both between different seedbank accessions and also against breeding-lines are of vital importance here. The identity of these derived material seed samples is ultimately MUCH more important here than the link to the original source material that is more appropriately representing the Occurrence concept!

The UN FAO ITPGRFA Global Information System (https://ssl.fao.org/glis/) holds examples of seed distribution for public seedbank accessions. [All public seedbank material distributed is identified by a machine-readable DOI, each time it is distributed]

Furthermore, when (4) seedbank Accessions and the breeding-lines result in a new (commercial -- or public pre-breeding) cultivar, seed material from the cultivars/varieties enter public seedbank when licensing periods end and/or cultivars are no longer in the market for sale. And thus further again used as a new raw material in breeding programs towards yet another new (commercial) cultivar. [Something new is clearly created here that is no longer the same as the source thing identified by the original occurrenceID]

If the seed MaterialSamples were to simply be Occurrences of type basisOfRecord = PreservedSpecimen or LivingSpecimen then the full line of parent-child decedents from the in situ/on-farm source material down to the seed sample Accessions and breeding-lines would share the same occurrenceID identity???? [At this time there was NO MaterialSample and no materialSampleID in Darwin Core, which might help a-lot!!]

This is my rationale and why I came up with a huge problem of accepting "Specimens" as Occurrences - or rather seed material samples as Occurrences.

@deepreef, sorry to just throw out another complex use case. But thought it might be useful to declare my actual primary interest in basisOfRecord issues and thus my primary interest in this thread. [PS: I would not have the conscience to go so deep in this thread during charged working hours at the museum - the weekends are my window for this type of fun]

@deepreef
Copy link

deepreef commented Nov 1, 2020

@dagendresen : Thanks for the detailed use case! I've never considered a use case like this before, so I found it helpful to see how well my own thinking of these various entities/classes work when modelling a novel (to me) situation as you describe.

I think I understand your description to involve several generations of the plants -- correct? In other words, Material (1) seeds from wild/on-farm are collected at Event (Location+Time) 1. Collectively, the seeds represent an instance of "Organism" (which accommodates more than one individual, when appropriate), and their presence at the collection Event constitutes an Occurrence. I'll call these Organism1, Event1 and Occurrence1.

If I understand correctly, these same seeds are moved to a different location at a different time at (2), which means the same Organism1 intersects with a new Event (Location+Time; Event2), and hence yields Occurrence2. Correct?

This next step is where I'm a bit unclear. Do I correctly interpret this part:

collected seed material is multiplied through seed multiplication ex situ (grown on lands at agricultural field stations and new more numerous seeds harvested) and included in a seedbank

...to mean that the originally collected seeds (Organism1) are germinated and grown and bear new seeds of their own, and those new seeds are then harvested for distribution to a seedbank? If so then this second generation of seeds represent a new instance of Organism (Organism2), and the same place but different time (Event3), and hence represent a new Occurrence3? We could also document the original (now grown) Organism1 at Event3 as representing Occurrence4.

Your step 3 seems to be a situation where Organism2 is now relocated again (other parties/seedbanks), at a new Location+Time (Events), and new Occurrences accordingly. Each new generation represents a new Organism, and each new documeted instance of the Organism at Location+Time (Event) constitutes a new Occurrence.

It seems to me that the key piece of information you need to track is the pedigree of the Organisms. I'm sure Zoos with breeding programs and in-situ evolutionary/ecological studies need to track this same kind of information, and there are at least two ways to do that in DwC: either in a simple way via dwc:associatedOrganisms, or in a more structured way using dwc:ResourceRelationship, with a value of dwc:relationshipOfResource something like "mother of" (as given in the example here).

The other issue raised by your use case is the fuzzy boundary between dwc:Organism and dwc:MaterialSample. I still wrestle with that, and I honestly haven't figured out yet how to draw that line, other than "alive" vs. "dead". I'm not sure if that question is part of this Issue, or needs to be explored in a new Issue.

@dagendresen
Copy link
Contributor

dagendresen commented Nov 1, 2020

@deepreef Yes, several generations of plants (plant populations of similar-ish genetic variation). The plant material (seeds) in seedbanks are conserved and treated as the same Accession (same accession number, aka dwc:catalogNumber) through several generations. The Accession is grown ex situ for multiplication of seed stocks at new locations, with the goal to be maintained as genetically similar as possible.

I was until now thinking of the same Accession (aka genotype) through several generations, grown again at new locations, as still remaining the same Organism - as is the tradition at seedbanks. Considering each new intersect of the "Accession" as Organism with a new Event (ex situ) in a field station as a new Occurrence is what I did not quite dare to do, but I DO AGREE, it makes very good sense. [Did the definition of Occurrence change to no longer require that the organism occurrence is in its natural habitat in situ? Or did I just fool my mind with the assumption of such a limitation?]

I believe that seedbanks already normally keep track of the harvest year for all material that is distributed. So catalogNumber + harvest year makes identifying this new "Occurrence2" quite possible -- and if modeled so also the "Organism2" can thus be identified in practice. The thought of a new Organism for each harvest year is new to me (I might still remain inclined to model the material in step 2, new generations of plants at the same seedbank, as the same Organism - but I do think this model is possible to make based on the current metadata already documented in the seedbank databases). In my head it is step 3 which makes a new Organism.

The pedigree is also maintained in seedbank databases but only per accession number (or often only per cultivar name) and most often not per harvest year (generation). With the rather recent new UN FAO ITPGRFA GLIS system assigning new DOIs for each seed transfer it is from now onwards absolutely possible to reconstruct the pedigree to the accession + harvest year.

This was definitely a new way to look at the model! Thanks a zillion!!!

@rondlg
Copy link

rondlg commented May 10, 2021

This is a really interesting conversation but sorry, dumb question - what was the final decision?

@qgroom
Copy link
Member

qgroom commented May 10, 2021

@rondlg The conclusion is that it needs a task group.
I was thinking this would be a good topic for the Interest Group meetings after the TDWG Conference.

@deepreef
Copy link

I was thinking this would be a good topic for the Interest Group meetings after the TDWG Conference.

I agree, and would like to be actively involved. There is also discussion along these lines within the MIDS group.

@baskaufs
Copy link
Contributor Author

I would also like to be involved.

@rondlg
Copy link

rondlg commented May 10, 2021

For what mine and the CD groups thoughts are worth I ditto Steve and Rich.

@albenson-usgs
Copy link

I haven't read through this whole thread but does it relate to this one? #314 I'm also interested to be involved if so.

@deepreef
Copy link

@albenson-usgs :

I haven't read through this whole thread but does it relate to this one? #314 I'm also interested to be involved if so.

Yes.

Also, I suppose that one of us who has expressed interest in being actively involved will have to step up and lead the effort. But, I suspect that most/all of us are in the same position of having way too much on our plates already!

@albenson-usgs
Copy link

If we are targeting this task group for after TDWG (i.e. starting in November) I may be able to lead it.

@gdadade
Copy link

gdadade commented May 25, 2021

It was mentioned numerous times hat an occurrence is not equal to a specimen or a sample. It rather is the aggregated information based on multiple physical and digital “things” (e.g. specimen, sample AND observation AND media all collected/recorded at the same time and place), hence often ONE occurrence has MULTIPLE basisOfRecords. From my point of view this confusion and struggle to redefine a basisOfRecord is caused by the occurrence centered way we are using DwC (and ABCD).

Excuse my collection based point of view, but collections are not organized as occurrences. And I think if a model works for the complexity of collections it can also deal with observations and events plus occurrences could be easily aggregated.

Since GGBN is a network of collections dedicated to specimen-sample-subsample relationships the occurrence model wasn’t working for us. That’s why we (GGBN) use a (collection)object based model, which also works for observations and environmental samples/DNA in my humble opinion. Relationships between physical or digital “things” (e.g. eVoucher) are done through the relatedResourceClass (UnitAssociation in ABCD) in ONE direction. By doing so related data can easily be connected and aggregated to whatever is needed (e.g. an occurrence or in case of GGBN a DNA sample with all its related parent tissue sample, specimen etc.). The relationshipType must be defined by controlled/recommended vocabulary of course.

By doing so the basisOfRecord really is a single entity of ONE record and can be defined and used straight forward. Of course you’ll still need additional parameters such as objectType (tdwg/cd#304) preparationType and preservationMethods to describe/classify the “thing” in more detail. GGBN is using the term materialSampleType for many years, but I think a generic objectType similar to KindOfUnit in ABCD makes much more sense.

So shifting to a generic model instead of focusing on occurrences would improve the use of DwC a lot. Aggregators such as GBIF, OBIS, GGBN etc. should/could easily join the dots and turn them into occurrences or whatever is needed.

I am happy to help in this discussion which of course if closely related to #314

@deepreef
Copy link

I completely agree with the fundamental problem with basisOfRecord in that it confounds what really should be thought of as "Evidence of Occurrence" with instances of Occurrence itself; and, as you note, of course there may be multiple lines of evidence to support the truth of any given purported Occurrence instance. At the moment, the only way to present that notion is through multiple Occurrence instances. On another issue thread I pointed out the example of Chromis abyssus Occurrence records in GBIF, where 7 actual Occurrence instances are represented by 21 records. One set of records correspond to the seven PreservedSpecimen records, and the others correspond to published citations of those specimens. Some of latter are currently indicated as basisOfRecord=HumanObservtion, and others as basisOfRecord=PreservedSpecimen, but presumably those would be tagged as MaterialCitation if/when that class is adopted in DwC. But even then, there are two sets of the MaterialCitation instances drawn from the exact same publication (two separate content providers).

As for collections folks needing to focus on managing collection objects, the obvious way to do this within DwC is to thoroughly embrace MaterialSample as the core object to focus on. We have been minting MaterialSample identifiers, which we (most unfortunately) represent in 1:1 cardinality with Occurrence records shared via DwC (in reality it's M:M). My hope is that in the next phase of DwC development and implementation, the collections community organizes their data exchange protocol around MaterialSample in the same way that, up until now, the community has centered around Occurrence. The discussions on this issue since (and including) last year's TDWG meeting certainly seem to be heading in that direction.

I would still favor materialSampleType over objectType in this context, because I think the property should be explicitly defined in the context of instances of MaterialSample. For better or worse, the standard has adopted this term (instead of other terms meaning essentially the same thing, including PreservedSpecimen, "collectionObject", or just plain old "object"). We're already moving towards using MaterialSample as a term that extends to objects beyond the scope of biological specimens (e.g., geological specimens, archaeological artifacts, cultural objects, and perhaps even physical copies of books/etc. in our Library & Archives collections).

I am also happy to be involved with this process (to whatever extent my massively and unnecessarily long commentaries on this topic actually represent "help").

@rondlg
Copy link

rondlg commented May 26, 2021

Just pulling out a tiny piece of this for now. There is a lot of discussion around ways of refering to cultural items and so we should be very diligent in our research on how to do that appropriately. The prevailing term at the moment I believe is "cultural resource".

@deepreef
Copy link

Thanks, @rondlg :

Yes, we are working with our Cultural Collections experts to ensure compliance with relevant cultural resource information standards. Our ambition at our Museum is to harmonize back-end data structures across all collections (natural sciences, archaeology, cultural collections, library & archives), so we are identifying congruencies among the key data objects in all of these domains. The example I referenced here is that what DwC defines as MaterialSample is largely consistent in many ways to what the cultural collections community refers to as cultural resource, and the library community refers to as item, and in other contexts is referred to as "object" or some variant.

The important thing from the data implementation/standardization perspective is that whatever term each constituent community uses, the core informatic properties are functionally the same.

@tucotuco
Copy link
Member

tucotuco commented May 26, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants