Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indicators for R1.2: (meta)data are associated with detailed provenance #28

Closed
makxdekkers opened this issue Jun 24, 2019 · 21 comments
Closed

Comments

@makxdekkers
Copy link

image

@makxdekkers
Copy link
Author

Points raised in online meeting 3 on 18 June 2019

  • Context information is needed (domain-specific).
  • Provenance is critical; the (re)user needs to know who the author is and how to reach him/her. Provenance includes information on how the dataset was generated (calibration, methodology etc.), source and lineage, versioning, project and/or activity in the framework of which the data was produced.
  • The end-user (or a software agent acting on their behalf) needs to know if there is machine-readable or machine-understandable provenance information since this is essential for contextualisation (relevance, quality) of the asset.
  • R1.2 is very important for the long term.

@keithjeffery
Copy link

To add a litle more: there are many ways of recording provenance such that it can be managed autonoically. The W3C PROV recommendations are not the only way. In fact, provenance information in data models other than PROV has been used for a long time in many research domains since t is commonly critical for evaluation of the re-usability (relevance, quality) of the asset.
For many instances PROV is insufficient; commonly researchers need to know - in the provenance information - not only that the asset was accessed/modified but the wider context including e.g. observational or experiemntal equipment used, its parameters (accuracy, precision, calibration), associated methodology (lab notebok, observing diary), links to relevant publications (grey as well as white).....

@makxdekkers
Copy link
Author

@keithjeffery I am not sure the aspects you mention cannot be satisfied using PROV. As I understand it, PROV is very flexible with its Expanded and Qualified terms and might be able to express all of that.
On the other hand, I think no-one is proposing (yet) for an indicator to reference PROV-O specifically.

How could an indicator be formulated? Could it enumerate some critical provenance items (like the ones you list), or should we link to existing standards/guidelines that could form the basis for the indicator? If so, which standard/guidelines would be candidates for such a reference?

@keithjeffery
Copy link

keithjeffery commented Jul 1, 2019 via email

@makxdekkers
Copy link
Author

@keithjeffery Let's see if there are suggestions for those criteria form others in the WG.

@micheldumontier
Copy link

what we expect is that communities identify what provenance information are crucial to the understanding of the digital resource. of course, we can expect general properties (e.g. who created the resource, when it was created, etc), but there will also be provenance specific to the kind of object (e.g. which instrument was used, what chip array was used, what detector was used). We expect that in many cases communities have already specified elements of provenance in their own data formats... FAIR then simply asks that it be mapped to more general purpose provenance languages such as PROV.

@makxdekkers
Copy link
Author

@micheldumontier Are you suggesting that an indicator be added for the mapping of object-specific or domain-specific provenance items to a more general-purpose provenance language? E.g.

Mapping of object-specific or domain-specific provenance information

  • NO mapping to general purpose provenance language
  • Mapping to general purpose provenance language (e.g. PROV-O)

@markwilkinson
Copy link

markwilkinson commented Jul 9, 2019

I think we need to be very very careful about being prescriptive (either negatively, or positively) about any metadata element, when acting as a high-level working group. As I said during the call, a piece of ancient pottery doesn't have an author. Nor does a mammoth fossil. Nor does an animal in a zoo. Relevant metadata elements cannot be predicted, and therefore IMO, should not be within the scope of a high-level working group.

If I were to "invent a metric" for R1.2 (which I have been avoiding!! ...BECAUSE I think it is absolutely none of my business to do so! It's a community-level task!).... I would design something like this:

  1. Collect all of the traditional provenance-style metadata elements (DC, DCT, DCAT, PROV, etc.), and then do a count of how many of these are being used by the Resource - a larger number is "better"

  2. Of the remaining metadata elements used by the resource, do a profile of how many ontologies are being used (both in predicates and in object-position) in this metadata - where a larger number is "better".

  3. A given community, using their own internal use-cases, can come to some decision about what those numbers should be, to represent "pass" vs "fail" in their context.

v.v. mapping: I like the idea of mapping, though I'm loathe to encourage communities to continue to create new vocabularies that represent existing concepts. There's also the problem with providing a common way for agents to discover mappings - so then we end up (potentially) inventing new standards for how to publish mapping resources... which the communities then have to build (and may not have the expertise to do so, depending how they are implemented. Mapping isn't really a trivial problem - just ask those who have spent their careers doing schema-mapping in databases and XML ;-) ) Nevertheless, if we had mapping-made-easy (something similar to what identifiers.org does for mapping between GUIDs of the same thing in different databases) then I am OK with this idea. Anything harder than that, I suspect would not be sustainable. (it isn't even clear if identifiers.org is sustainable)

@keithjeffery
Copy link

@mark -
I agree that a piece of potery does not have an author but it has relaitonships with persons: the creator (maybe unknowm), the finder, the curator, the owner (maybe), with organisations (e.g. the museum), with documents (e.g. scholarly paper or grey literature) and so on all of which can be expressed in rich metadata.
I suggest we must stop thinking in terms of attributes or properties of an asset described as metadata (DC-think) and more of relationships around an asset decribed as metadata. This is what RDF tries to do (encoded as Turtle or something similar).

@makxdekkers
Copy link
Author

makxdekkers commented Jul 9, 2019

@markwilkinson I understand your reluctance to prescribe a particular set of provenance descriptors, because it very much depends on the type of resource and the community in which the resource is used. In that sense, it could be left to community-specific guidelines. This creates maximum potential for reuse within that community.
In addition, asking for mapping -- as much as possible and relevant -- to a general-purpose provenance ontology could be useful for potential cross-domain reuse. It's true that mapping is not trivial, but even if the mapping is incomplete and lossy, it could still be helpful.

It seems to me that the indicators given in the first comment above, which were based on the contributions in the collaborative document, are probably too specific. Maybe we could propose two new ones:

R1.2-01 Provenance information based on community-specific guidelines relevant for the resource

  • NOT based on community-specific guidelines
  • Based on community-specific guidelines

and

R1.2-02 Mapping of object-specific or domain-specific provenance information to a cross-domain language

  • NO mapping to general purpose provenance language
  • Mapping to general purpose provenance language (e.g. PROV-O)

@keithjeffery
Copy link

@makx -
I would be content with your proposal as long as the last bullet is not prescriptive (fashion in standards changes with time - right now PROV is popular but there are oher general mechanisms)

@makxdekkers
Copy link
Author

@keithjeffery The last bullet says 'e.g.' so it's not prescriptive. Would you have another example that could be included alongside PROV-O?

@keithjeffery
Copy link

@makx -
Agreed. In EPOS we use CERIF of course for all aspects of metadata (discovery, contextualisation, curation, provenance) but I am not pishing for its inclusion. I just wanted to ensure that we (as we have elsewhere) avoid being (or being seen to be) prescriptive.

@markwilkinson
Copy link

@keithjeffery absolutely. I was suggesting exactly the same thing. We need "a thick cloud" of metadata, but we cannot pre-determine what that cloud is composed of (and shouldn't try!)

@SusannaSansone
Copy link

SusannaSansone commented Jul 19, 2019

This discussion also links nicely with the content of the RDA FAIRsharing WG registry, which is now one of the formally approved RDA outputs.

As detailed in #29, many domain/discipline-specific community standards (for representing/reporting digital objects) already contain some provenance, both general information and specific one to the kind of object (who created, when and how, etc...what technology was used, what analytical method etc); these community are not using PROV. Adding R1.2-02 would be too specific.

@makxdekkers
Copy link
Author

@SusannaSansone R1.2-02 tries not to be too specific -- it contains a reference to PROV-O only as an example. The objective was to try to encourage mapping from domain-specific approaches to more general approaches so that people in other domains can also understand the provenance information.
It might indeed be that such a requirement is difficult to satisfy. However, cross-domain reusability will increase if such a mapping is provided.

@bahimc
Copy link
Collaborator

bahimc commented Aug 2, 2019

Please find the current version of the indicator(s) and their respective maturity levels for this FAIR principle. Indicators and maturity levels will be presented, as they stand, to the next working group meeting for approval. In the meantime, any comments are still welcomed.

The editorial team will now concentrate on weighing and prioritizing these indicators. More information soon.

image

@SusannaSansone
Copy link

@SusannaSansone R1.2-02 tries not to be too specific -- it contains a reference to PROV-O only as an example. The objective was to try to encourage mapping from domain-specific approaches to more general approaches so that people in other domains can also understand the provenance information.
It might indeed be that such a requirement is difficult to satisfy. However, cross-domain reusability will increase if such a mapping is provided.

@makxdekkers I understand this "from domain-specific approaches to more general approaches" but then it has to be clear that this only refers to general approaches, because there are many community-specific (that can also implies domain/discipline specific) models/formats (expressed in one or more of metamodels, XML, TAB etc) that include provenance information (without using PROV). Just to pick one example: https://doi.org/10.25504/FAIRsharing.s51qk5

@makxdekkers
Copy link
Author

makxdekkers commented Aug 4, 2019

@SusannaSansone Indicator R1.2-01M is indeed about provenance information according to community-specific guidelines or standards. Is that not sufficiently clear? If not, how could it be formulated better?

@SusannaSansone
Copy link

@SusannaSansone Indicator R1.2-01M is indeed about provenance information according to community-specific guidelines or standards. Is that not sufficiently clear? If not, how could it be formulated better?

@makxdekkers if you just say "provenance information according to community-specific guidelines or standards" is ok. My comment was on the example of PROV, which some domain-specific community-specific standards do not use, yet these capture provenance information.

@bahimc
Copy link
Collaborator

bahimc commented Oct 7, 2019

Dear contributors,

Below you can find the indicators and their maturity levels in their current state as a result of the above discussions and workshops.

image

Please note that this thread is going to be closed, within a short period of time. The current state of the indicators, as of early October 2019, is now frozen, with the exception of the indicators for the principles that are concerned with ‘richness’ of metadata (F2 and R1). The current indicators will be used for the further steps of this WG, which are prioritisation and scoring. Later on, they will be used in a testing phase where owners of evaluation approaches are going to be invited to compare their approaches (questionnaires, tools) against the indicators. The editorial team, in consultation with the Working Group, will define the best approach to test the indicators and evaluate their soundness. As such, the current set of indicators can be seen as an ‘alpha version’. In the first half of 2020, the indicators may be revised and improved, based on the results of the testing. If you have any further comments, suggestions regarding that specific discussion, please share them with us. Besides, we invite you to have a look at the following two sets of issues.

Prioritisation

• Indicators prioritisation for Findability
• Indicators prioritisation for Accessibility
• Indicators prioritisation for Interoperability
• Indicators prioritisation for Reusability

Scoring

• Indicators for FAIRness | Scoring

We thank you for your valuable input!

@bahimc bahimc closed this as completed Oct 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants