Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other Deliverable - BasisOfRecord review #11

Closed
Jegelewicz opened this issue Aug 19, 2021 · 82 comments
Closed

Other Deliverable - BasisOfRecord review #11

Jegelewicz opened this issue Aug 19, 2021 · 82 comments

Comments

@Jegelewicz
Copy link
Collaborator

Jegelewicz commented Aug 19, 2021

Task Group will make a recommendation [...] as to which class in the Darwin Core standard these properties belong which may also include recommendations for terms being revised, added, disambiguated, or deprecated. Depends upon definitions provided [in primary deliverable].
[...]
Recommendations will be provided for a revised formal definition as it pertains to materialSample but will not consider other data types.

Current Darwin Core Placement/Definition

http://rs.tdwg.org/dwc/terms/basisOfRecord

this term is a property of Record-level

Defintion

The specific nature of the data record.

Examples

PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation

Comments

Recommended best practice is to use the standard label of one of the Darwin Core classes.

See also

umbrella issue related to dwc:basisOfRecord and an Evidence class: tdwg/dwc#302

@Jegelewicz
Copy link
Collaborator Author

From #3 (comment)

challenge ourselves to include the edges, or relationships, in how we come to define PreservedSpecimen and other ilk we've traditionally dumped in basisOfRecord

Why do we define terms that are meant to be controlled vocabulary for this particular term? Is that done anywhere else in Darwin Core? How did this one term get the privilege of using other terms as controlled vocabulary?

Maybe this term is so difficult because it is now special and "we" have total control over it. Yet, it is only "recommended", so when I decide to use "Object", "Specimen" or "Photo", what happens? Deviation from the Darwin Core class terms will probably mean nobody knows what to do with my records. I guess I feel that we are essentially saying that basisOfRecord requires a Darwin Core Class and I am not certain that is something we want, or is it?

@tucotuco
Copy link
Member

Why do we define terms that are meant to be controlled vocabulary for this particular term? Is that done anywhere else in Darwin Core? How did this one term get the privilege of using other terms as controlled vocabulary?

There are actually four Darwin Core terms that recommend a controlled vocabulary of terms also generated and managed by Darwin Core. The basisOfRecord term happened to be the first of them, already in place in the first version of the standard in 2009. The other three terms with controlled vocabularies minted and managed by Darwin Core are establishmentMeans, degreeOfEstablishment, and pathway. There are many other terms in Darwin Core that recommend using a controlled vocabulary, and some of them recommend specific controlled vocabularies. There is also one term adopted by Darwin Core from Dublin Core, dc:type, which requires adherence to a specific vocabulary, the DCMI Type Vocabulary, among which PhysicalObject, Event, StillImage, MovingImage, Sound, Dataset, Collection, and Text are of interest to us in biodiversity.

Now for some history. I had hoped never to have to write "The Sordid History of Darwin Core", but I feel compelled to at least provide a draft of the "Sordid History of basisOfRecord" chapter. Warning, I am going to be thorough.

Darwin Core evolved from an abbreviated schema of terms for the Species Analyst network (2001) meant to be used for sharing information about museum and herbarium specimens. By 2003, there was demand to be able to share observation records as well. It was deemed to be of extreme importance to be able to distinguish specimens from observations. Thus BasisOfRecord was born. It was defined as "An abbreviation indicating whether the record represents an observation (O), living organism (L), specimen (S), germplasm/seed (G), etc." Just four days later it was redefined (this was before being a standard, things were easier to change then), repenting the recommendation for abbreviations that would not be universally interpretable, and the definition became, "A description indicating whether the record represents an observation, tissue sample, living organism, voucher specimen, germplasm/seed, genetic information, etc." Over the next four years a trend toward a vocabulary began to form and in 2007 a new version of the term was minted with the definition, "A descriptive term indicating whether the record represents an object or observation. Examples: PreservedSpecimen- A physical object representing one or more organisms, part of organism, or artifact of an organism. synonyms: voucher, collection, lot. FossilSpecimen- A physical object representing one or more fossil organisms, part of fossil organism, or artifact of a fossil organism. LivingSpecimen- An organism removed from its natural occurrence and now living in captivity or cultivation. HumanObservation- A report by a known observer that an organism was present at the place and time. MachineObservation- A report by a monitoring device that an organism was present at the place and time. StillImage- An [sic] photograph, drawing, painting. MovingImage- A sequence of still images taken at regular intervals and intended to be played back as a moving image; may include sound. SoundRecording- An audio recording. OtherSpecimen- Any type of specimen not covered by any of the categories above." Careful sleuthing would indicate that someone had already been looking at Dublin Core, though none of the Dublin Core terms had been adopted at that time.

Darwin Core as a standard under TDWG was modeled on Dublin Core and the version of basisOfRecord that came out of the ratification process for the first version had a biodiversity-based type vocabulary and was a subtype of the more generic Dublin Core dcterms:type term. The first-ever official version of basisOfRecord recommended the use of the Darwin Core Type controlled vocabulary, "The specific nature of the data record - a subtype of the dcterms:type. Recommended best practice is to use a controlled vocabulary such as the Darwin Core Type Vocabulary (http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm)." The type vocabulary consisted of terms (classes) in a namespace dwctype:, separate from dwc:, again following the Dublin Core example, with the following labels: FossilSpecimen, HumanObservation, LivingSpecimen, Location, MachineObservation, NomenclaturalChecklist, Occurrence, PreservedSpecimen, and Taxon. Being a subtype of dcterms:type, the Dublin Core Type Vocabulary was also supposed to be valid to use for basisOfRecord.

Later in 2009 there was a decision to rescind the recommendation to populate dcterms:type with Darwin Core terms, "The recommended controlled vocabulary for dcterms:type was changed from a vocabulary of Darwin Core Classes (Occurrence, Taxon, Location, Event) to the DCMI type vocabulary (PhysicalObject, Event, StillImage, MovingImage, Sound, Text, Dataset) to be consistent with the standard use of that term." There was an accompanying decision to take the DCMI terms out of the recommended Type Vocabulary for Darwin Core, "The recommended controlled vocabulary for basisOfRecord remains the Darwin Core Type Vocabulary, but the Dublin Core classes StillImage, MovingImage, and Sound were removed from that list as these are to be used as vocabulary for the dcterms:type term."

In 2011 there was further introspection as we began to think about using Darwin Core with RDF for Linked Open Data. There had been an attempt to add the Dublin Core Type Vocabulary terms StillImage, MovingImage, and Sound back into the recommended list for the Darwin Core Type Vocabulary, but this was rejected with this decision, "The Dublin Core type vocabulary values StillImage, MovingImage, and Sound have not been added to the list of valid values for Darwin Core type vocabulary pending further insights from the RDF Interest Group about how best to manage basisOfRecord when a record can be 'about' more than one subject." At the same time, and for the same reasons, a decision was made to remove the subclass designations from terms in the Darwin Core Type vocabulary, "The subclasses for Darwin Core type vocabulary have been removed. These were seen as too constraining when considering biodiversity information in the context of linked data." This affected PreservedSpecimen, FossilSpecimen, LivingSpecimen, HumanObservation, MachineObservation, and NomenclaturalChecklist, which were all subclasses of dwctype:Occurrence.

In 2013 something curious happened. The MaterialSample term was added to the Darwin Core Type Vocabulary (along with a MaterialSample class in the dwc: namespace). The curious part was that, despite the 2011 decisions to the contrary, dwctype:MaterialSample was defined as a subclassOf OBI:specimen! A month later Joel Sachs questioned this detail in a presentation at TDWG 2013, concluding, "Assertions that tie Core terms to upper ontologies should be asserted in a separate document. That way, those doing integration that depends on OBI axioms can ingest the appropriate descriptions. Those that don’t need the OBI axioms don’t have to worry about incorrect inference."

The Darwin Core Type Vocabulary survived until late 2014, when the utility of having separate (some redundant) terms in a separate namespace in Darwin Core was questioned. Community discussion resulted in the decision to remove the Darwin Core Type Vocabulary in favor of the Darwin Core classes, and to add the classes that were missing in the dwc: namespace (namely PreservedSpecimen, LivingSpecimen, FossilSpecimen, MachineObservation, and HumanObservation). With this change, MaterialSample lost its subclassOf property (though it retains OBI:specimen as a superclass in the Biological Collections Ontology).

Things have been stable (but not without controversy) with respect to terms related to basisOfRecord until July 2021 when MaterialCitation was added as a new class and recommended term in the vocabulary of basisOfRecord.

If you are a glutton for punishment there is a lot of historical discussion going back to 2009 on basisOfRecord that can be followed from this post in the tdwg-content list.

Maybe this term is so difficult because it is now special and "we" have total control over it. Yet, it is only "recommended", so when I decide to use "Object", "Specimen" or "Photo", what happens? Deviation from the Darwin Core class terms will probably mean nobody knows what to do with my records.

Humans will unlikely be able to find your records and machines certainly won't without a lot more help.

I guess I feel that we are essentially saying that basisOfRecord requires a Darwin Core Class and I am not certain that is something we want, or is it?

Controlled vocabularies that are used are more useful for finding things than not having them, or not having them followed. That doesn't seem like the crux of the problem. Some people want more options, and that is an option - just look at MaterialCitation - it was identified as necessary, justified, and put into practice. More of an issue is that in Simple Darwin Core all you get is a row for an Occurrence, an Event, or a Taxon, but that "record" can be "about" lots of things at the same time, and the one I'm interested in publishing might not be the one that interests someone else searching biodiversity databases.

@dr-shorthair
Copy link

There is also one term adopted by Darwin Core from Dublin Core, dc:type, which requires adherence to a specific vocabulary, the DCMI Type Vocabulary, among which PhysicalObject, Event, StillImage, MovingImage, Sound, Dataset, Collection, and Text are of interest to us in biodiversity.

Not quite correct. "Recommended practice is to use a controlled vocabulary such as the DCMI Type Vocabulary". You can choose to use a different type vocabulary. In DCAT2 we identified four other general purpose type vocabularies: ISO-19115-1 scope codes; Datacite resource types; PARSE.Insight content-types used by re3data.org; and MARC intellectual resource types. You could also define a domain-specific type vocabulary ... which I think is what you guys have in DWC.

@tucotuco
Copy link
Member

tucotuco commented Aug 29, 2021 via email

@albenson-usgs
Copy link

albenson-usgs commented Aug 29, 2021

@Jegelewicz wrote

Maybe this term is so difficult because it is now special and "we" have total control over it. Yet, it is only "recommended", so when I

I would argue this is not true. At least if you are sharing via the IPT then basisOfRecord is definitely a required term and the only one with a required controlled vocabulary.

@tdwg tdwg deleted a comment from albenson-usgs Aug 29, 2021
@Jegelewicz
Copy link
Collaborator Author

image

Is the IPT setting this requirement? Or does "recommended" really mean "thou shalt"?

@tucotuco
Copy link
Member

The IPT can be configured to use various Core and Extension definitions. The Cores are for Occurrence, Event, and Taxon. The Event and Taxon Cores do not have a basisOfRecord term in them. The Occurrence Core does have basisOfRecord and it is required to be present in the published dataset. The basisOfRecord term in the Occurrence Core currently has a "thesaurus" element that points to https://github.com/gbif/rs.gbif.org/blob/master/vocabulary/dwc/basis_of_record.xml, which defines the recommended terms and gives translations for their labels in various languages. Because of this thesaurus element, the IPT gives data publishers the option to choose a value from that list in order to map an entire dataset to a constant (one of those values) in lieu of mapping to a field in the dataset. The IPT does not constrain the values in any field mapped to basisOfRecord, which is why there are more that 620 distinct values for that field from the data shared through GBIF. GBIF does their best to interpret those verbatim values into the controlled vocabulary values so that people can still use them effectively for searching in GBIF.

@albenson-usgs
Copy link

albenson-usgs commented Aug 30, 2021

I wanted to confirm I was correct before responding. You cannot publish a dataset via the IPT if it does not follow the controlled vocabulary and it must follow it explicitly (e.g. "Human Observation" will be rejected). The 620 distinct values in GBIF must be from legacy data or other methods for data sharing that are not the IPT.
CantPublish

@tucotuco
Copy link
Member

Good catch, @albenson-usgs .

@Jegelewicz
Copy link
Collaborator Author

Just putting this here because I was thinking about it last night.

It seems to me that "BasisOfRecord" should be more about the evidence at hand and I think there are three types of evidence that we are dealing with.

  1. Primary (or direct) evidence - material that is or was all or part of the object identified as taxon
  2. Secondary evidence - material that indicates the presence of the object identified as taxon (footprint, nest, scat, photograph, sound recording, digital image)
  3. Tertiary evidence - a human records information about an object identified as taxon, but no material support or trace exists (field notes, publications and preparation catalogs when no material is cataloged)

Primary evidence would always be a MaterialSample (which might be fossil, preserved, living or whatever else we end up needing to make it easy for people to put stuff in bins)

Tertiary evidence would always be an Observation (which might be human, but as AI grows, maybe we will have machines reporting what they "saw" without providing evidence of some sort?)

Secondary evidence - this one is the hard grey area. Any MaterialSample that is Secondary evidence would really be a trace of some sort and any Observation would be an image or sound recording (also a trace?).

So maybe instead of the above, we have choices for BasisOfRecord as follows:

MaterialSample - material that is or was all or part of the object identified as taxon (skin, skeleton, tree, seed)

Trace - material that indicates the presence of the object identified as taxon (footprint, nest, scat, photograph, sound recording, digital image)

Observation - recorded information about an object identified as taxon when no MaterialSample or Trace exists (field notes, publications and preparation catalogs)

Probably replace "object" with "organism" for the purposes of TDWG - I am always thinking about cultural collections too.

@baskaufs
Copy link

baskaufs commented Sep 5, 2021

@Jegelewicz I think this is a very useful way to think about categories of evidence.

I think that some of the difficulties that you are encountering at least partly stem from historical precedents on how we've categorized evidence in Darwin Core. At its core, basisOfRecord is a type designation. We have somewhat artificially separated type designations between dwc:basisOfRecord and dcterms:type, I think mostly because there is an expectation that values of dcterms:type will come from the Dublin Core type vocabulary, which doesn't include the specific kinds of types that we care about like preserved specimens.

So what we've ended up with is a somewhat odd situation where certain types of evidence like live organism images taken by people or camera trap photos, which can be typed as dcmitype:StillImage as a value for dcterms:type then have to have an additional designation as dwc:HumanObervation or dwc:MachineObservation as a value for dwc:basisOfRecord. But then for a type of evidence like dwc:PreservedSpecimen, we can use that as a value for dwc:basisOfRecord, but we have no value at all for dcterms:type because we haven't traditionally populated that field for specimens.

To me, it doesn't make sense that we have two terms for type. It would make a lot more sense to me to just have one term for type and populate it with the most specific class that makes sense. If we want to establish a type hierarchy (i.e. ontology with subclassing), then on could infer broader categories from more narrow ones. That is, if we decided that still images are a kind of observation, then state that the evidence is a still image and anyone could reason that it's also an observation. That seems much cleaner than having some kind of artificial distinction where some kind of broad typing happens with basisOfRecord and narrower typing may or may not be done with a separate term. This is more or less what is specified in the DwC RDF guide, where it is suggested that rdf:type be used in preference to dwc:basisOfRecord or dcterms:type.

Unfortunately, basisOfRecord is so deeply embedded in Darwin Core systems (i.e. a required term for everything) that it seems like it would be difficult to "reform" it. But I would just love to ditch basisOfRecord and dcterms:type and just use the single term rdf:type for everything. In that case, the kind of categorization that @Jegelewicz is doing would be the way that we could ontologically relate the various kinds of classes and how they are related to each other.

@tucotuco
Copy link
Member

tucotuco commented Sep 6, 2021

Hopefully all of this is good fodder for a basisOfRecord Task Group, if we don't end up solving it here after all.

Since I am all thumbs I give ten thumbs up for deprecating dc:type and dwc:basisOfRecord in favor of rdf:type.

But that is only part of the problem. The other, important part is that a "record" can be about a lot of things all at once, which we can't solve in Simple Darwin Core without either a term that takes a list (a hack, and in this case an "inconvenience" term) or relegating the term to an extension that can have a one-to-many relationship. ResourceRelationship could accomplish that, but I am not sure of the wisdom of that. I mean, if we are going to do something drastic, why not do something drastic that works better than what we have? This could be by allowing more complex relationships between "tables" of data to be shared in structures rather than the limited star schema we are currently working under. Each of those "tables" could have their own types and we could start to get as specific (or not) as we like with controlled vocabularies for those types.

@dr-shorthair
Copy link

I'm a little nervous about rolling all classification into rdf:type. It is a very powerful predicate with some special behaviours and entailments, particularly around sub-classing and participation in global domain/range constraints, which might be reaching beyond the expectations of the people who designed the classification vocabularies. I generally find that rdf:type and dcterms:type can work nicely alongside each other - the former for formal typing, and the latter for more informal classification treated more like annotation.

@baskaufs
Copy link

baskaufs commented Sep 7, 2021

@dr-shorthair I'm unclear about the dangers of using rdf:type here. If you adopt the RDFS view of things, then there are certainly triples that you can assert that create entailments involving rdf:type. But I can't think of any entailments that would be generated directly by making an rdf:type assertion. It's a basic RDF property for asserting the class of thing we are talking about and is nearly universally understood in the RDF/Linked Data world. Isn't that exactly what we are trying to accomplish here (i.e. communicate the type of a resource)? There are no prohibitions against a particular resource having more than one type, so asserting a type doesn't preclude someone else from typing it as something else.

The other piece of this is that TDWG now has a formal policy, expressed in section 4.4.2.2 of the Standards Documentation Specification that we keep entailment-generating statements out of the basic "bag of terms" layer of vocabularies and assert them as separate "layers" on top of the basic layer when their use can be justified. So the burden of determining the implications of entailment-generating statements like range, domain, subclass, etc. falls on the people who want to add those layers on top of the basic layer. That prohibition would not apply to rdf:type declarations, because such declarations do not in themselves result in any entailments that I am aware of.

@Jegelewicz
Copy link
Collaborator Author

Can someone please define RDF? Also, where do I find the mythical "RDF Interest Group"?

@baskaufs
Copy link

baskaufs commented Sep 7, 2021

@Jegelewicz Sorry about that. RDF is Resource Description Framework and it's the "language" that can be used to express metadata in a machine-readable way. It's probably the most common way that linked data is communicated automatically. For a somewhat quirky intro to RDF, you can check out this video from a TDWG meeting years ago: https://www.youtube.com/watch?v=XAGifYBiXMY

As far as I know, there isn't an RDF Interest Group. There once was an RDF task group, which created the Darwin Core RDF Guide, but it hasn't been active for several years, and hopefully has therefore been disbanded since it doesn't have a task any more.

@dr-shorthair
Copy link

The special thing about rdf:type is that it entails that a resource is also a member of all the parents of the object class. So my concern is that a value might be assigned to a rdf:type which has not been conceived as a class in a logical subsumption or broader/narrower hierarchy, or is taken from a vocabulary structured inconsistently internally or with respect to another set of classifiers that is already in use.

My experience is that dcterms:type provides a safe-haven for less formal classifiers, which makes them more like annotations (there are no formal entailments associated with DCMI), but where the predicate is still taken from a 'standard' RDF vocabulary.

@baskaufs
Copy link

baskaufs commented Sep 8, 2021

@dr-shorthair I see your point. However, as I understand it, the task of this group is essentially to lay out a class hierarchy for Darwin Core classes (perhaps also including the Dublin Core classes) and to assert those classes as the recommended values for whatever term we end up using for typing (currently dwc:basisOfRecord, potentially rdf:type). So presumably if we do our job well, the situation you are worried about shouldn't usually happen for Darwin Core records.

I find it instructive to think about what has happened in Wikidata. Although the Wikibase model has an RDF serialization, it very intentionally avoids going into the RDFS realm. rdf:type doesn't really exist there as a property for describing kinds of items. But almost immediately the property P31 ("instance of") was created and it's now recommended to be used as a property of all items. And now people are building class hierarchies using P279 ("subclass of") to lay out the relationships between the informal class items that have been created. So despite not using the RDFS model directly, people have essentially re-created an RDFS analog because people have a basic need to understand what something is and how it's type is related to other classes. The key difference is that the Wikidata system of typing and subclassing does not support automatically computing entailed triples. Rather, people discover the hierarchical relationships using SPARQL queries and "police" the correct use of terms using ShEx rather than trying to establish ranges and domains for properties.

My point here is that we are dealing with a special case of controlled vocabulary. We are not just trying to lead people to choose the right term to categorize the subject resource (as one might do with a SKOS-based controlled vocabulary), but we are also trying to meet this fundamental need that people have to know what kind of thing we are describing. The thing about using rdf:type is that it is the most straightforward and widely recognized linked data term for indicating what kind of thing a resource is. What you say is true about it can be part of generating entailments when combined with subclass, range, or domain declarations. But as far as I can tell, carrying out that kind of reasoning is extremely rare in our community, so it seems to me that the risk of the problems you raise is low. In contrast, the value of using an extremely well-known property is high -- it's so well known that it's the only property that has its own special abbreviation in RDF Turtle ("a"). DCMI itself recognized this fact when it recommended

It is recommended that RDF applications implementing this specification primarily use and understand rdf:type in place of dcterms:type when expressing Dublin Core™ metadata in RDF, as most RDF processors come with built-in knowledge of rdf:type.

That recommendation was carried through in Section 2.3.1.4 of the Darwin Core RDF Guide, which specifies that rdf:type should be used for IRI-valued typing, and explains why there isn't any dwciri: analog of dwc:basisOfRecord.

One problem with my suggestion of replacing dwc:basisOfRecord with rdf:type is that the value of rdf:type should be an IRI and not a string. That would differ from the current practice of using a controlled value string for basisOfRecord. However, I don't think it would necessarily be a problem to require an IRI as long as we tell people what it should be. That's a social problem, not a technological one and I've taken the position that there shouldn't be any problem with people using properties that require an IRI value in spreadsheets.

Another advantage of going with rdf:type over dwc:basisOfRecord is that it would make it easier to formally define a subclass hierarchy using rdfs:subClassOf. That can't readily be done using values of dwc:basisOfRecord, given that they are literals and therefore can't be used as subjects in triples. That is, if this group is successful at more clearly defining the classes to be used for typing and their hierarchical relationships, it would be relatively straightforward to express that as a machine-readable ontology that could be informally or formally laid on top of the "bag of terms" layer of DwC.

@Jegelewicz
Copy link
Collaborator Author

I am not 100% sure I get all of this, but there is one thing that appears useful to me and that is simplification. If we can rid ourselves of BasisOfRecord in favor of something that is more globally recognized and useful, then I think it makes a lot of sense. We don't need to invent a wheel that is special for biological collections and their data. If we can use the wheel that is already out there, then we can add special hubcaps if we really need them - right?

Thanks for the video @baskaufs but also, this is a good place to learn about RDF - https://en.wikipedia.org/wiki/RDF_Schema

Could someone give me a practical example of how this would work as opposed to what is happening now?

@dr-shorthair
Copy link

it's so well known that it's the only property that has its own special abbreviation in RDF Turtle ("a"). DCMI itself recognized this fact when it recommended

Yes, indeed. It is so well known also because it has the strongest effects. Your story about Wikidata is illuminating - I am frustrated with WD (and OBO) because of the ways that they invented new predicates for things that already exist in RDFS, etc.

I hadn't seen that note from DCMI. It certainly makes sense for 'strong' uses of typing. But I don't think it invalidates my proposed use of dcterms:type, precisely because there are no entailments associated with DC properties.

However, if you think that DWC classes will be well-enough controlled to be able to move all usage to rdf:type then certainly that makes sense in that context.

@baskaufs
Copy link

baskaufs commented Sep 9, 2021

@Jegelewicz How about this:

image

It's an example spreadsheet from the IPT User's Manual. The top part is the existing example, the bottom part is how I would change it. Some important points:

  1. When the class IRI is put in the table, the full IRI must be used, vs. abbreviating it like dwc:PreservedSpecimen. There are a couple reason for that. First of all, people don't generally understand what dwc: means (i.e. that it's an abbreviation for http://rs.tdwg.org/dwc/terms/). Second, the namespace abbreviations are not always consistent, particularly in Dublin Core where you see dc:, dcterms:, and even dcterm: used inconsistently for the same namespace. The benefit of not having to disambiguate values would be lost if the IRIs were abbreviated.
  2. In the example, the headers only have the local names and don't give any indication of the namespace. This basically assumes that all of the column header properties are in the dwc: namespace, which is not a safe assumption if people want to use dwciri: terms. I think @tucotuco could speak to this problem. I'm not sure it's a problem for a Darwin Core Archive if it has a meta.xml file that maps the columns to the actual full property IRIs, but it may be a problem for simple spreadsheets. In this case, it would be better to have a header like rdf_type than just type given that rdf:type, dc:type, and dcterms:type all have the same local name.
  3. This does not deal with the problem that @tucotuco brought up about a single row being "about" more than one type of thing. So if you have a specimen and an image of the specimen being described in the same row, you'd have two types to specify. But that isn't the problem we are trying to fix here.

@Jegelewicz
Copy link
Collaborator Author

@baskaufs thanks! That makes sense and I think it should be easy to get people on board with that as long as we provide clear instructions for use.

@ghwhitbread
Copy link

ghwhitbread commented Sep 9, 2021

Basis of record: An alternative point of view ?

Darwin Core (and the aggregators that need it) exists because there are information systems out here that are built for the purposes of managing natural history collections (specimens, images, observations, samples) and able to provide meta-data extracts that look like occurrence records. In making these data generically available, Darwin Cores has played a critical role in the rise of biodiversity informatics and also for the status of Collections.

For the most part, these Darwin Core records are answers to a question. They are not persistent objects. The persistent objects (things like taxon_concept, name, agent, locality, event, observation, annotation (determination), treatment, work, collection item (specimen, sample, image), site, enclosure) in our information systems, and their relationships and behaviours, have been carefully modelled to meet the business requirements intrinsic to the management of Collections. They are already implemented and evolving, come with their own entailments, and critically, do not exist for the primary purpose of delivering Darwin Core.

In normative Darwin Core the collection objects that are the subject of this discussion - the evidence for occurrence - provide the real world relationship that makes a Darwin Core record useful at all. It is not the role of Darwin Core to model them. These Collection objects live in the realm of collection management systems, and for an occurrence meta-data interchange standard we simply need a way to convey the nature of the underlying objects - the basis of each record - and a means for attribution to the sources.

To expand on John’s concise term history. The “basis of record group” was previously used in HISPID (1989 - TDWG standard (1996) and in ABCD as “record source”, taken from ABIS (1976)*, as a simple, and unambiguous, statement of origin as an aid to determine the falsifiability of reified assertions and assumptions derived from meta-data extracts.

I would advise a simple, though extensible, vocabulary. And as @afuchs1 suggests - a basis of record category for bundling object identifiers. If modelling collection objects is your thing, please keep them separate from Darwin Core.

Very early in this discussion, @tucotuco stated that Darwin Core does not have sub-classes. I believe that is partly because there are no real classes either. Darwin Core “classes” are in fact categories that can be used to group (rearrange) terms into records, without entailment, for the purpose of meta-data transfer, most often from Collections to Aggregators, beyond the constraints of types at the records origin.

This is why Darwin Core is successful.

*ABIS: Australian Biotaxonomic/Biogeographic Information System (Australian Biological Resources Study - ABRS)

@Jegelewicz
Copy link
Collaborator Author

a basis of record category for bundling object identifiers. If modelling collection objects is your thing, please keep them separate from Darwin Core.

I am not sure that I understand this statement. I feel that "modelling collection objects" and Darwin Core are inextricably connected, but maybe I am misunderstanding something?

Darwin Core “classes” are in fact categories that can be used to group (rearrange) terms into records, without entailment, for the purpose of meta-data transfer, most often from Collections to Aggregators, beyond the constraints of types at the records origin.

I think that it is very difficult to make anyone believe that Darwin Core is "classless" when the "classes" are identified with the definitions. I say this because I have to explain this to others and I am not 100% sure that I can confidently do so. But, this statement also makes me think that a lot of collections are doing Darwin Core wrong AND that the Darwin Core archive facilitates this? If ALL the aggregators are interested in are occurrences, maybe we shouldn't include anything about the collection object in the Darwin Core archive sent to them other than the fact that BasisOfRecord = material evidence? Is that what this statement and the one above are getting at?

a simple, and unambiguous, statement of origin as an aid to determine the falsifiability of reiified assertions and assumptions derived from meta-data extracts.

To me, there are a lot of jargon terms in this statement. If we are going to make this useful to everyone, we need to explain in the most plain terms possible what we are trying to get at. We are asking collections staff to be experts in biology, collections management and now computing language - and for some it will be OK, but for a lot, it will be one more thing they don't have time to figure out and that will lead to poor data quality.

@baskaufs
Copy link

baskaufs commented Sep 10, 2021

A response to @ghwhitbread's comment:

It may be true that Darwin Core originated to serve information systems for managing natural history collections. But at this point, use of Darwin Core goes way beyond that. It's used broadly with remote sensing data, camera trap data, and data from citizen science projects like iNaturalist and eBird, and these kinds of projects may have no connection to natural history collections (unless Greg intends these kind of "collections" to also be considered "natural history collections"). Assuming that there is always going to be a "basis of a record" that is a physical object sitting in a museum simply does not hold for many of these new kinds of records described by Darwin Core. And those records may have evidence whose type doesn't fall into a current Darwin Core class that's historically been used as a value for dwc:basisOfRecord. More often in the examples I've given, it will be a media item like a still image or sound recording.

I also disagree with the statement that Darwin Core has no real classes. There are a number of classes formally defined as such among the Darwin Core terms: dwc:Event, dwc:MaterialSample, dwc:Occurrence, dwc:PreservedSpecimen, etc., all of which formally have the type rdfs:Class. What Darwin Core does NOT have are formal domain declarations associating properties with those classes. So in that sense, the classes are used as Greg said: to organize the terms. But that does not prevent them from being used as we might use other terms that are classes, i.e. as values for rdf:type or to construct formal class hierarchies with subclass declarations. Those hierarchies may never formally become part of Darwin Core, but that doesn't mean they wouldn't be useful for clarifying relationships or even machine reasoning.

@albenson-usgs
Copy link

+1 to Steve and just want to add that there are data where there is no evidence in the way it's being discussed here such as coral reef monitoring using the point line intercept method where the only evidence is that someone wrote it down on a piece of paper- not even a still image or a sound recording - same for small mammal trapping, vegetation surveys, fish trawls,...

@Jegelewicz
Copy link
Collaborator Author

only evidence is that someone wrote it down on a piece of paper

See #11 (comment)

@smrgeoinfo
Copy link

@baskaufs do the rows in the organism table represent individuals or classes of organisms?

@baskaufs
Copy link

@smrgeoinfo individuals. But that would be individual organisms as defined by DwC, which can also include taxonomically uniform groups of organisms like packs, clones, etc.

@Jegelewicz
Copy link
Collaborator Author

It isn't clear to me that there needs to be a "core". If we create records of "rdf:type" and relate them as appropriate then why does anything need to be the core?

@Jegelewicz
Copy link
Collaborator Author

Also, isn't the "star schema" a GBIF/Darwin Core Archive thing? Is that something we should be concerning ourselves with? I am worried that we are working on GBIF issues (that we cannot resolve) instead of definitions of Darwin Core terms.

@baskaufs
Copy link

baskaufs commented Oct 1, 2021

@Jegelewicz Darwin Core Archive is a specific implementation, but the "star schema" system with core and extension files is actually laid out in the Darwin Core text guide, which is officially a part of Darwin Core. It's not the only way to use Darwin Core, but it's probably the most common way.

The "core" and "extension" designation is terminology built into normative parts of the text guide specification. See Section 2.1.2 and beyond.

I get your point about

If we create records of "rdf:type" and relate them as appropriate then why does anything need to be the core?

That's really a kind of Linked Data argument and I agree with it totally. The reality is that there are tons of people using a system based on the Text Guide. So how do we move in the direction where we allow as many types of things to be represented in their own table (be distinct types) rather than "flattening" them into fewer tables and therefore losing the ability to create many one-to-many relationships?

My suggestion (put Organism in the middle of the star) was intended to maximize the number of distinct tables that could be handled by the existing star schema design laid out in the text guide. It does not allow for at least two other relatively common designs where events or taxa are put in the middle of the star. What I believe it does fix is the various problems people have with putting occurrence in the middle.

@Jegelewicz
Copy link
Collaborator Author

@baskaufs thanks for that explanation. I'm betting that next to zero collection managers know about or understand this. In fact, I can read it and sorta get what's going on, but I know I could not confidently explain it to anyone.

Given all of that - if we must choose a "core", I think that for museums, the "core" that makes sense is MaterialSample, although now that everyone has been trained to think of occurrence as core, that will be very difficult to change. I say MS for core because we are primarily managing physical objects that may or may not represent an organism (or many organisms) - often the organism is only implied. Having implied data be the "core" seems wrong somehow. I know this argument will not go over well, even among museums, who somehow still think that their mouse skull IS an "organism" and maybe they are right - who am I to say!

Maybe it really doesn't matter what the "core" is? There could be data sets with anything at the core as long as we have well-defined terms and row:types and I should (with a bit of work) be able to mash any of them together (I think?).

My head hurts....

@deepreef
Copy link

deepreef commented Oct 1, 2021

although now that everyone has been trained to think of occurrence as core, that will be very difficult to change

This is an issue/problem/concern that I've been aware of/trying to address for a long time (many years). The first step was to create the necessary classes in DwC (MaterialSample, Organism; still working towards something like Evidence) to parse out the "meat" (so to speak) of the ubiquitous Occurrence instances, a very large portion of which actually represent "the circumstances when a specimen was extracted from nature", but are often thought of as the specimen itself.

I'm actually very encouraged by the slow but steady progress to solve this, and the existence of this Working Group is a VERY important step in the right direction. I guess my main point of reassurance here is that there is a lot of "inertia" in our very broad community, so fundamental shifts do require time. While this shift has been years in the making, the ship is definitely turning, so I'm actually kind of excited and optimistic that we're on the right track.

There are several general ways we can move forward on this:

  1. Baby Step: We define MaterialSample as another Core for the existing star schema architecture, and represent occurrences associated with those MS instances as an Extension.
  2. Moderate Step: We define Organism as another Core for the existing star schema architecture, as per @baskaufs diagram/description.
  3. Big Step: We define ResourceRelationship as "the" Core for the existing star schema architecture, as I have suggested (and will eventually describe and illustrate).
  4. Giant Leap: We retire the star schema architecture, and develop a new architecture to move structured packages of our data around (to GBIF, iDigBio, and elsewhere).

There are other options in there as well (e.g., introducing an Evidence class and representing that as the Core).

The key question, I think, is how big of a step is our broader community willing to take at this stage? I'm very confident that we're ready to at least take the Baby Step. I'm likewise very skeptical that we're ready to make the Giant Leap. But it's less clear whether the Moderate or Big Steps are practical/realistic.

Obviously, some of this is beyond the scope of this Task Group. But this Task Group is sort of at the epicenter of the larger issue -- which ultimately boils down to finding the right balance between flat/simple data structure vs. highly complex/normalized data structure at the exchange level. The star schema is a compromise between one end of the spectrum (simple, flat table of Occurrences) and the other end (RDF triple store). There are other options representing compromises at different stages along the same spectrum. The trick is to find the sweet spot, then take the steps necessary to help shift the community in the right direction.

Yes, my head hurts too. But my heart also beats (with excitement about the prospects of real progress in a long journey)!

@baskaufs
Copy link

baskaufs commented Oct 3, 2021

@Jegelewicz In response to your recent comment, I just want to emphasize that I think we are using the term "core" in two distinct ways.

In your comment, I believe you are using the term "core" to mean "the table that contains information about the kind of thing we think is most important in a particular community". When I suggest that Organism should be the core table, I intend for "core" to have the technical meaning it is given in Section 2 of the DwC Text Guide: the table that sits at the center of the star in the star schema.

In most current cases, I think that the various available "cores" (occurrence, event, taxon) position the table for what a community considers the "most important thing" ("core" in @Jegelewicz sense) in the "core" position in the star ("core" in @baskaufs sense). What I am advocating is that in the interest of making it possible to document the more complex kinds of relationships people want, we get away from this "center of the universe" thinking (i.e. the "most important table" has to be in the center of the star). If we use Organism as the "core" file in the star, it is actually likely to be the LEAST important table in the star -- it's just the table best positioned to link many of the other ones that people do think are important (occurrences, material samples, media items, and identifications) and still keep the existing star schema system.

@baskaufs
Copy link

baskaufs commented Oct 3, 2021

@deepreef in response to your comment, I think your listing of possible steps based on how big they are is a good statement of the situation. The one thing that I would say in response is that I don't think that it is clear that there is a benefit to having a distinct "evidence" class. Despite what we did in defining a Token (a.k.a. Evidence) class in Darwin-SW, I don't really think that there is anything to be gained from it.

There is a benefit to be gained when a resource of any type (e.g. MaterialSample, image, sound) is identified as evidence through linking it to an occurrence or an identification. In other words, any kind of thing becomes evidence when we assert that it serves as evidence (using yet-to-be-defined property terms). We could create some domain assertion for those properties that automatically entail that the thing used as evidence is an instance of an "Evidence" class, but what would be the benefit? I think that people look at types more to understand what kind of thing something is (material object, image, etc.) rather than what role it plays. If you want to know whether it has an evidentiary role, look to see if it has an "isEvidenceFor" property.

If you are a visual type person (like me) and these words confuse you, look at this picture:

Darwin-SW graph diagram

Imagine that we have a MaterialSample (maybe a museum specimen) in the position of the diagram labeled dsw:Token. If it has a value for the property dsw:evidenceFor, then we know that it's evidence for an Occurrence. If it has a value for the property dsw:isBasisForId, then we know that it's evidence for an Identification. Why do we need to check if it has rdf:type of dsw:Token? That's superfluous and it's much more useful to know that it's type is dwc:MaterialSample or even more specifically dwc:PreservedSpecimen.

This is a sort of "Occam's razor" argument -- why create a class if it doesn't add anything to our understanding of the resource.

@deepreef
Copy link

deepreef commented Oct 3, 2021

The one thing that I would say in response is that I don't think that it is clear that there is a benefit to having a distinct "evidence" class.

I actually agree -- which is why I've not been beating the "Evidence Class" drums very much lately. When we did our implementation of Evidence-stuff, it was clear that each "thing" ("token"?) that functions as evidence (e.g., for an Occurrence, or for an Identification, or possibly for an Event, GeologicContext, MeasurementOrFact, etc.) is not, by it's nature "Evidence". Rather, these things/tokens are entities of their own (e.g., MaterialSample, multimedia item, literature documentation, documented report, etc.), each with their own specific class properties. What makes them function as "Evidence" is the relationships they have with these other things (Occurrence, Identification, etc.). Basically, tangible things serve as "evidence" for abstract instances.

So... the "Evidence-ness" really ought to be represented as instances of the ResourceRelationship (which I am increasingly coming to view as the universal "many-to-many join" for links between -- and even recursively within -- instances of other DwC classes).

I'm not sure if we're basically saying the same thing here, but when I look at the dsw diagram, I think dsw:Token is really sort of a placeholder to represent not "any" class, but rather the subset of entities that have tangible (physical or digital) manifestation. Currently, the only DwC class we have in that space is MaterialSample, but multimedia and (potentially) literature are two others that could fit that scope.

Organisms/Agents (which I see as fundamentally the same thing, with the latter restricted to a single particular taxon) are partly tangible (they have physical manifestation), but partly abstract (an Organism represents the dynamic set of matter and kinetic/chemical action that spans more or less from birth to death/disintegration). The same thing could probably be said for instances of Location (which seem to be at the intersection of tangible and abstract). But I don't see these as things that themselves function as evidence, nor are they really in need of being supported by evidence, so I'm not sure they operate in the same way.

OK, that's enough Sunday-morning armchair philosophizing....

@baskaufs
Copy link

baskaufs commented Oct 4, 2021

@deepreef Yes, it sounds like we are pretty much in agreement.

I'm pretty much agnostic about the mechanism for documenting many-to-many relationships. In the "organism core" scheme I suggested, I treated relationships between "evidence" like MaterialSample instances and Occurrence (or Identification) as many-to-one. The reason for that is because it was easy to fit into the star schema system, not because there weren't many-to-many relationships, which would have to be documented in some other way.

@Jegelewicz
Copy link
Collaborator Author

Organisms/Agents (which I see as fundamentally the same thing, with the latter restricted to a single particular taxon)

YES! I have been arguing this within Arctos for some time now!

If you have loads of free time - this epic issue is interesting and @deepreef contributed there as well, but here are some of the "arguments for organism as agent" highlights:

ArctosDB/arctos#1966 (comment)
ArctosDB/arctos#1966 (comment)
ArctosDB/arctos#1966 (comment)
ArctosDB/arctos#1966 (comment)

Here is an example of an "organism" agent:

Kianga in Arctos

And the discussion continues....
ArctosDB/arctos#3765

All living things are agents. We can treat Homo sapiens as special kinds of agents, but almost everything that applies to people also applies to other species. I don't know why I have to argue so hard for using the exact same model for other living things that we use for people and I don't feel like anyone has made a really good argument against it. I think that nobody wants to admit that Homo sapiens are just another species on planet Earth? Also, no one wants to add another table to their data or attempt to manage "mouse12345" just because part of it is somewhere else or it had 5 embryos. This is the social issue that needs to be overcome if we are going to do this well - IMO.

@deepreef
Copy link

deepreef commented Oct 5, 2021

In the "organism core" scheme I suggested, I treated relationships between "evidence" like MaterialSample instances and Occurrence (or Identification) as many-to-one. The reason for that is because it was easy to fit into the star schema system, not because there weren't many-to-many relationships, which would have to be documented in some other way.

Yes -- that's why I'm captivated by the idea of having ResourceRelationship at the Core of a star schema, because that can handle all one-to-one, one-to-many, and many-to-many relationships between, within, and among instances of all the other classes. I suppose one could even express relationships between instances of ResourceRelationship, but that's a bit meta for this late in the day, and is causing my head to hurt just thinking about it.

@deepreef
Copy link

deepreef commented Oct 5, 2021

All living things are agents. We can treat Homo sapiens as special kinds of agents, but almost everything that applies to people also applies to other species. I don't know why I have to argue so hard for using the exact same model for other living things that we use for people and I don't feel like anyone has made a really good argument against it. I think that nobody wants to admit that Homo sapiens are just another species on planet Earth? Also, no one wants to add another table to their data or attempt to manage "mouse12345" just because part of it is somewhere else or it had 5 embryos. This is the social issue that needs to be overcome if we are going to do this well - IMO.

This captures my own point of view perfectly! Indeed, we have many instances where individual non-human organisms have names, so it works both ways (i.e., all properties of dwc:organism apply to individuals identified as H. sapiens, and many/most properties of Agents as expressed in various data models could potentially be applied to non-human organisms. I know it's weird to think about, but from a data modelling perspective, it's practically a no-brainer.

But... we digress from the topic at hand...

@dagendresen
Copy link
Contributor

More of an issue is that in Simple Darwin Core all you get is a row for an Occurrence, an Event, or a Taxon, but that "record" can be "about" lots of things at the same time, and the one I'm interested in publishing might not be the one that interests someone else searching biodiversity databases.

+1 If the basisofRecord value "I" am thinking of (when publishing a Simple Darwin Core dwc:Occurrence record) is different from the basisOfRecord value "you" will be thinking of when using the record in your research - why then do we need basisOfRecord at all? Rather simply use proper (resolvable) identifiers for the actual things we think of when we publish the records? And simply acknowledge that when we do not use such identifiers the basisOfRecord will not add much value anyway.

ten thumbs up for deprecating dc:type and dwc:basisOfRecord in favor of rdf:type

+1

MaterialSample as a type of evidence for a species occurrence
What happens if the specimen is not originating from nature at all? Imagine the offspring from an animal kept in captivity (or a plant kept in cultivation) sampled and used as a type specimen as evidence for a new scientific name? If the specimen in a scientific collection lacks spatial or temporal information about the location or date from which it was sampled - is it then still (only) evidence for a species occurrence? A specimen or MaterialSample modeled (or published) as an instance of Occurrence always reduces the information value of the MaterialSample to be evidence for the occurrence of a taxon. While a specimen can be soooo much more. A specimen can be the evidence for a scientific name, it can be the source for DNA, a donor of genetic diversity for plant breeding, etc, etc. My argument is that specimens may have many other and different roles than only as evidence for a species occurrence! I will thus argue that we might rather need to model an independent MaterialSample class that is NOT reduced to (only) be evidence for a species occurrence.

If modelling collection objects is your thing, please keep them separate from Darwin Core

It looks to me like many of us have a bit of agenda in how we describe the history, origin, and original purpose of Darwin Core? (Was the primary purpose of Darwin Core to describe evidence for species occurrences or to describe specimens in a collection?). Maybe rather focus on what Darwin Core has become (to many people) and where we want to go from here? Alternatively maybe it is time to start a new TDWG core for natural history specimens? E.g. a Aristotle Core ?? I hope not.

because it perpetuates the problem we currently have, which is forcing non-flat data about multiple things (e.g., Occurrence+MaterialSample+Identification+Taxon+Event+Location+etc.) into a flat structure, as though all these things exist simply for the purpose of documenting instances where organisms occur in nature

+1

@deepreef
Copy link

deepreef commented Oct 9, 2021

MaterialSample as a type of evidence for a species occurrence
What happens if the specimen is not originating from nature at all? Imagine the offspring from an animal kept in captivity (or a plant kept in cultivation) sampled and used as a type specimen as evidence for a new scientific name?

An occurrence is an occurrence, regardless of whether an organism participated in the associated Event of its own volition, or was somehow assisted in being present at it's place and time with the aid of some other organism. You give the example of a captive organism, but the same applies to parasites or stomach contents of birds/fish/mammals/etc., pollen on bees, and countless other examples where the participation of one organism at an event was facilitated by another organism -- regardless of whether that other organism happened to be identified as Homo sapiens.

The point is, if DwC lacks the necessary terms to capture the properties we want to do our analyses, then we need to fill those gaps in DwC. But in any case, an instance of MaterialSample (with its associated metadata) still can serve as evidence of an occurrence.

If the specimen in a scientific collection lacks spatial or temporal information about the location or date from which it was sampled - is it then still (only) evidence for a species occurrence?

The same applies to images that lack corresponding metadata, or any other kind of potential evidence. Maybe instead of describing MaterialSample instances as a form of evidence, we should refer to them as "potential evidence". In any case, without spatial or temporal metadata (aka association with an Event), it's pretty useless in terms of serving as evidence of an occurrence -- regardless of whether it's an organism-based occurrence or a taxon-based occurrence. However, it can still serve as evidence of Identification. And the reverse may also be true -- that a specimen in a Museum serves as evidence of an organism occurrence, but is damaged such that the taxonomically diagnostic character(s) are absent, and therefore functions as weak/useless evidence of Identification. It can still have an Identification -- just not so much evidence supporting it.

A specimen or MaterialSample modeled (or published) as an instance of Occurrence always reduces the information value of the MaterialSample to be evidence for the occurrence of a taxon. While a specimen can be soooo much more.

+1 Exactly! This is why we're having this discussion -- to elevate MS to something much more than simple (potential) evidence of occurrence. That doesn't mean that MS instances cannot function as evidence of occurrences (and identifications) -- it just means that MS instances can have other functions in other contexts.

A specimen can be the evidence for a scientific name,

Well... sort of... only if it's a name-bearing type, and even then, it inherits it's evidentiary role through an instance of Identification. So, the pedant in me want to clarify that a specimen can be evidence for a dwc:Identification instance, rather than for a scientific name.

it can be the source for DNA, a donor of genetic diversity for plant breeding, etc, etc.

...or the subject an image, or an item sent on loan, or an object monitored over time to study preservation techniques (just to flesh out the "etc." a bit more...)

My argument is that specimens may have many other and different roles than only as evidence for a species occurrence!

+10!!

I will thus argue that we might rather need to model an independent MaterialSample class that is NOT reduced to (only) be evidence for a species occurrence.

Yes, but isn't that exactly why this Task Group exists?

Maybe rather focus on what Darwin Core has become (to many people) and where we want to go from here?

Agreed! The "purpose" of DwC has evolved over time, and it continues to evolve, and that's a good thing in my opinion. From my perspective, it began as a way to exchange specimen data using a mostly-flat structure, then expanded to share information on occurrence records using a mostly-flat structure, then was implemented in a slightly less-flat star schema via GBIF. Now I think we're gradually easing towards more sophisticated structuring because we want to share information more precisely and robustly than we have before. Many/most content providers are not ready to go too far down this path, because of the limitations of software used to manage their data. But there seems to be critical mass among folks who have the capabilities to share information more precisely and robustly than we have before, so we're laying the foundation for the next-generation mechanism of biodiversity data exchange.

These are exciting times for biodiversity data nerds!

@smrgeoinfo
Copy link

Here's a UML conceptual diagram of what I'm understanding from this conversation, reduced to the concepts that seem to be central to the use cases.

image

@Jegelewicz
Copy link
Collaborator Author

need to model an independent MaterialSample class that is NOT reduced to (only) be evidence for a species occurrence.

We definitely need this, but I think we have multiple independent task groups working on it now?

https://www.tdwg.org/community/gbwg/enviro/

I really feels like these will overlap significantly if we do take a "giant leap".

@dr-shorthair
Copy link

dr-shorthair commented Oct 13, 2021

I'm only an occasional DWC observer. I find it interesting to read the perspectives, though I may not fully understand the drivers for the different emphases. However, I do see some tension between the folk who use DWC primarily as a set of tags for data transfer, and the people who would like to conceive it as a model of an information system.

I generally work from the latter perspective. So the sketch by @baskaufs is useful. However, that is still cast in terms of tables, with noise from 'keys' and 'rows', etc. The view from DWC-SW puts it more conceptually. Nevertheless, there is clearly still some confusion.

Just above, @smrgeoinfo proposes a conceptual model that re-frames the discussion in a more general context. I believe this is based on (or at least it is compatible with) the OGC/ISO O&M and W3C SSN/SOSA (which I was involved in developing). This is based on a process-flow-model.

W3C Prov-O formalizes the components of a basic process-flow model:

PROV Starting Point classes

The key idea is to recognise that every Entity - both physical and informational - is the result of an Activity (a process or event) undertaken at a particular time and place, involving particular people and instrument(s), following a particular protocol, and using certain entities and information as inputs. Then the trick is to sort the activities, entities and information into sub-classes that are useful for your application. But in the first place I always find this distinction between Activity and Entity is a critical and very useful basic clarification when getting tangled up in models.

Important activity-types in empirical science are

  1. act-of-observation - the result of which is a piece of information, often a quantity or classification which is a characteristic of the entity-of-interest.
  2. act-of-sampling - the result of which is a special kind of entity, called a sample, which is somehow representative of some larger entity. A sample is expected to be the subject (entity-of-interest) of subsequent acts-of-observation.

In W3C SSN/SOSA these are modeled in a way that can be interpreted as types of prov:Activity, specifically

Observation-Prov-alignmentSampling-Prov-alignment

(SOSA calls it a 'feature-of-interest' instead of 'entity-of-interest'.)

There are lots of potential relationships between all of these things, and even more potential pathways that involve two or more steps. A practical information model will realize or implement a convenient subset. Which ones are convenient depends on your needs, in particular which class is central to your application perspective. Collection managers will focus on material samples; biodiversity people will focus on occurrences; ecologists on sites; etc. Each will have a different class at the middle of their star.
The diagrams immediately above put the act-of-observation and act-of-sampling at the centre.

Then the insights from SOSA are to focus on

  • 'what is the entity-of-interest of this act-of-observation?' This entity must be the thing that carries the 'observed-property' of the observation - i.e. the trait, or identification (classification).
  • 'what is the entity-of-interest of this act-of-sampling?' i.e. what is the entity that this (physical) sample is a sample of.

When samples are involved, then the science-question might also require you to distinguish the proximate and ultimate entity-of-interest. The proximate one is often a sample. In particular, I think the proximate entity-of-interest would be the basis-of-record of the result of the act-of-observation.

The ultimate entity-of-interest may be something else entirely (a population, or an ecosystem, for example). Some scenarios showing both proximate and ultimate versions in the context of both sampling and observations are sketched out here https://www.w3.org/TR/vocab-ssn-ext/#fig-observation-ultimate-foi


A. The Prov-O Activity-Entity disjuncture matches the fundamental Occurrent-Continuant distinction from BFO, which is the basis of the OBO ontologies, and is called Perdurant-Endurant in some other systems.

B. A small extension allows us to fully align SOSA to OBOE which some of you may be familiar with.

@jbstatgen
Copy link

This is an exciting discussion and I am very happy where @Jegelewicz , @baskaufs, @deepreef and everybody else are taking us.

Working on and trying to fit together the concept for the "Digital and extended Specimen (DES)" concept, an ecological use case, and trying to catch up on the ideas here, I am only joining the exchange now. It is encouraging to see that we independently are attracted to the same/similar solutions.

Below is a diagram summarizing my current understanding. It is based on the discussions within the DES-community, Baskauf & Wells 2016 (@baskaufs Thanks for pointing out the reference!), my subjective highlights of the discussion here and the diagrams added to the wiki by @stanblum .

DarwinSWdiagram_20211013 drawio

A couple of remarks:
"Token", "Event" and "Agent/Role" are taken from the recommendations of the RDA/TDWG Attribution Metadata Working Group (https://www.rd-alliance.org/group/rda-tdwg-metadata-standards-attribution-physical-and-digital-collections-stewardship/outcomes), which are based on the W3C-PROV ontology (https://www.w3.org/TR/2013/NOTE-prov-primer-20130430/).

All Planetary and moreover Galactic Beings welcome: I am certainly a named Organism with the taxonomic classification "Homo sapiens" acting in one of the Role of an Agent here. Thus, the light grey connection looping around on the right side. Thank you all for this step away from a human-centric perspective (-> multispecies ethics). The Agent-box might be better placed between Organism and Token, however, we are here mostly focusing on the relationships surrounding Organisms and Tokens, thus, the Agent-box on the outside.

Working on definitions for "Digital Specimen", "Derived Data" and "Associated Data" in the DES context, I originally had equated "Digital Specimen" with "Token". However, a question that Anna Monfils asked for the European Frog Bit use case (cp. https://biss.pensoft.net/article/73814/) showed that it isn't as simple. The question was how to find all the data directly derived from a specific EFB individual, eg. the physical specimen, multiple plant-clippings for tissue samples, DNA-seqs., images, (audio recordings), etc. This is a one-step search within a network, however, it excludes one-step links to data closely associated. Such one-step links of a different "kind/type" are for example, links to other EFB individuals recorded within the same population, or to other species recorded in vegetation relevées describing the community in which the focal EFB individual was found, etc. Basically, all the data derived directly from the focal individual need to be linked to a shared entity/ID. Often this might be a physical specimen (ie. the focal individual), yet such a physical specimen might not have been collected (cp. also photo and audio/video recording of the same bird that wasn't caught). My solution was/is that a Digital Specimen fundamentally is a "bag of links", very much like the Organism entity discussed here. The EFB use case is the reason, why I see Token and Organism as separate entities.

In the context of "MaterialSample" discussed here, the Organism class/entity seems to be the core entity to which star-like all the other resources (Tokens, ie. different MaterialSamples, types) are linked. However, considering what @Jegelewicz asked above, my understanding is that all/most entities can be cores, depending on the question asked and user context. From a Plazi point of view a Publication which is a derived Token for us here, might be the core of their data model and thus basis of their star-diagram. Also in their use case, a core as bag of links might make sense: a publication has a DOI or ISBN etc., however, there are online versions and a host of hardcopy instances in libraries all over the world. There doesn't seem to be The parent hardcopy entity that can be defined as the origin of all other copies, derived data and links to associated information.

A Token in my concept is an abstract entity standing in for all the "types" that can provide evidence about an organism (observations, LivingSpecimens, seedbank lot, ...), an event, an identification, an agent, etc.

Tokens understood as evidence are closely related to the question of how reliable (cp. RA Fisher's statistical concept) this evidence is. At least at first sight, a physical specimen curated in a collection seems to provide more support/have more weight as evidence than a gal chatting you up in a bar (you might find out later that she is the world expert on a group of Peridinea and the anecdote she told you is quite solid). Traditional (ecological) Knowledge about species occurrences, habitats or distributions might never be recorded by physical specimens, publications, etc., however, being the accumulated (oral) knowledge and history over generations it might be solid scientific information and evidence. Also, in the point-line-intercept example provided by @albenson-usgs above, that data might not be easily reproducible (another boot trip to the GPS-coordinates soon after?), however the data are far from an anecdotal tourist snapshot. These digital-only, maybe also digitally-born datasets were consciously designed and follow standardized procedures of recording, identification, assessment, etc. On the other hand, a physical specimen with incorrect location information or identification, is unreliable and introduces error into an analysis.

The "types" certainly provide a general idea about reliability. An additional step further is to provide users per default options,eg. "fields" with which they can explicitly record their assessments or knowledge of reliability for all Tokens (several assessments pro token by eg. different users possible). I understand support or reliability as Bayesian priors (independent of how they are recorded by users, eg. for them it might be a choice of red, yellow or green). These Bayesian priors can be the result of calculations in hierarchical Bayesian models. An example is to calculate the support provided by a Token based on the number of links the Token has in the "digital extended network". This is similar to or leads to Bayesian network approaches (eg. https://en.wikipedia.org/wiki/Bayesian_network) for calculating posterior probabilities/likelihood support.

W3C PROV just got also introduced by @dr-shorthair while I wrote this contribution. I am looking forward to our discussion tonight.

@jbstatgen
Copy link

The most powerful solution would be to have ResourceRelationship as the primary core, serving as a "fat" triple-store of relationships among records (i.e., the ultimate join table), and then have each of the other DwC classes represented as extensions (effectively flattened cached values for the properties in each class where parity of property value to object is 1:1). This would get us a step closer to a future world where everything is serialized as RDF (I'm just term-dropping here -- I don't really understand what I'm saying). However, this is probably a bridge too far at this stage in the history of DwC and IPT; so it might be asking too much to make that leap right now (although if anyone is interested in exploring it, I'll gladly generate some sample datasets in this form).

@deepreef Would you generate an example of such a dataset in RDF? I think I have a (vague) idea how this will look like, though I might be completely off.

@cboelling
Copy link
Member

cboelling commented Oct 13, 2021

Trying to put the many facets of this discussion into perspective. Here is my current understanding of the conceptual concerns - which I find helpful to separate from implementation and pragmatic issues.

I understand that the main intention and use case for dwc:basisOfRecord is to act as one element in a structure for representing the state of affairs expressed by the following two sentences:

(1) One or more individuals of taxon X were present (somewhere) within geographical location L (at some time) during time interval T.
(2) Assertion (1) is based* on an artefact that, at least in part, is derived from or depends on said individuals of taxon X.

*possibly among other things

Sentence (1) describes (and asserts) an occurrence. The belief that what sentence (1) asserts for a given X, L, and T is actually the case is, in the context of biodiversity science in general and natural history collections in particular, based on inference chains that can be extremely varied and that can, as one important ingredient, have different kinds of artefacts as their starting points.

Examples of such kinds of artefacts are preserved biological specimen, photographs of biological specimen (taken at a particular place and time), or textual records of an observation, among many other categories that are useful to distinguish.

The term "derived from" in (2) is used to denote the relationship, however involved, between a biological organism that participated in the occurrence and an artefact that retains some of its physical substance, possibly altered by physical or chemical processes. Here are some examples. "at (L,T)" is a placeholder to denote a particular location and time interval in each case.

organism participating in occurrence artefact
a beetle crawling through in the Namib desert at (L,T) its sun-dried carcass collected from the sand some time later and put into a collection
a fish swimming in lake Michigan at (L,T) a flash-frozen tissue sample taken from it before releasing the fish again, stored in a tissue collection
a fish swimming in a certain location in the Baltic Sea at (L,T) its preparation in ethanol in a jar in a collection
a bird inhabiting the area now known as "Grube Messel" at (L,T) the part of its fossil remains derived from its skull, 48 million years later
a worm in a certain portion of soil at (L,T) the nucleic acid extract from a soil sample taken at location L containing some of its DNA

The term "depends on" in (2) is used to denote the relationship, however involved, between a biological organism that participated in the occurrence and an artefact that existentially depends on it - without necessarily retaining any of the physical substance of that organism. Here are some examples:

organism participating in occurrence artefact
a rhinoceros at (L,T) a photograph of that rhinoceros taken at (L,T) by a human observer
a nightingale at (L,T) the sound recording produced by an unmanned monitoring station in the vicinity of L for the entire night containing a recording of its singing from 11:00 pm to 11:10 pm
a coral of type X at (L,T) a written note stating that a coral of type X was present at (L,T)
a gall wasp at (L,T) the dried-up gall that was inflicted on the host plant by the wasp, which is added to a collection

A "derived from" relation in fact implies a "depends on" relation, but I think it is useful to distinguish the two because physical vouchers of biological organisms are the hallmark of natural history collections.

Categorizing a (naturally) dried gall as an artefact may seem odd, but for the purpose at hand I find it helpful (and justified) to think of artefacts as anything that has been subject to human intervention - even minimal interventions like picking up the dried gall and putting it in a collection.

Clearly, knowing the particular artefact (and what kind of thing it is, e.g., preserved specimen, photograph, field book entry) and the characteristics of the inference chain that results in asserting an occurrence is desirable as it is informative for determining how reliable that assertion is, and to draw all sorts of interesting conclusions about the occurrence itself, how its representation in an information system (such as GBIF) came to be, and to provide information on the artefacts involved, e.g., where to find them or digital representations of them.

Oftentimes, only a shorthand categorization of the artefact (e.g., that it is a preserved specimen, a photograph, a text document) is needed, rather than every detail of its relationship with the organism that forms part of the occurrence in question. This is so, because the chains of inference for artefact/occurrence pairs for a given category of artefact, while they will differ in their details, are similar in those characteristics that matter to users. That is, knowing that the artefact is a photograph rather than a preserved specimen gives a big chunk of the information that is relevant for a number of use cases.

This then, I understand, is the intended function of dwc:basisOfRecord: to provide a shorthand categorization of the artefact that is used to infer the happening of a particular occurrence by virtue of the actual relationship of that artefact with an organism that participated in the occurrence. Another way to put this is that dwc:basisOfRecord captures, in a shorthand way, the type of evidence for an occurrence.

Specifically, in such a representation dwc:basisOfRecord would be used to link a pointer to an occurrence (such as an occurrence ID or a description like (1)) to a pointer denoting a category of artefacts. The meaning of this linkage would be: My (the representation author's) belief that the stated occurrence happened is based on an artefact of the stated type.

If this is the intended role of dwc:basisOfRecord then this could be captured by defining:

dwc:basisOfRecord =def: A property the values of which indicate the type of artefact that the assertion of an occurrence is based on.

Some of the ensuing questions leading into the implementation and pragmatic concerns, and that have been raised in this thread, would perhaps be:

  1. Is the current practice of data sharing via Darwin Core Archives and the IPT tool doing justice to this intended meaning of dwc:basisOfRecord? If not, what needs to change?
  2. Should dwc:basisOfRecord be deprecated in favour of another structure that can capture these semantics in a better way (whatever constitutes this being better)?
  3. Is the shorthand categorization that can be provided by dwc:basisOfRecord as a single attribute with a single value that denotes a category of artefacts adequate or is a more elaborate structure required that enables capturing properties of the artefact and characteristics of its linkage with the occurrence in more detail?
  4. Which vocabulary for denoting categories of artefacts should be used?

@wouteraddink
Copy link

@jbstatgen regarding the digital specimen in the figure: this could be any tangible curated object, not only biological objects. tangible because you can look at it, feel it etc. in contrast with an observation which could include a tangible object like a photo but may be just text. In the case of a biological material sample, the object could include a number of individuals of different or the same species, one individual or part of an individual. biological material samples could also be tangible objects that are not species material but can be linked to a species, like bird nests, spore prints, foot prints..

@deepreef
Copy link

@deepreef Would you generate an example of such a dataset in RDF? I think I have a (vague) idea how this will look like, though I might be completely off.

Sure, you could generate the whole thing in RDF, which I imagine would be the ultimate solution. However, I was thinking of an approach that's one step shy of that, which still keeps us in the realm of "relational" thinking with instances of classes that have properties (i.e., maintains the "tables and fields" approach to modelling the data). But I think of ResourceRelationship as a "fat" triple-store (I guess it would be sort of an octuple-store). But the idea would be to take all of the inter-class and intra-class relationships and represent them as instances of ResourceRelationship, as the "core" in a star schema representation, with all the other classes as extensions.

I've been flat-out overcommitted in recent months/weeks, but as soon as I catch a breather, I'll put together an example dataset.

@Jegelewicz
Copy link
Collaborator Author

Task Group has decided that BasisOfRecord is leading us away from our goal and that this issue should be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests