Skip to content

A project to facilitate the automatic assessment of the quality of machine-actionable data management. Especially conceived SPARQL queries to be executed over a DCSO serialization - in this instance, JSON-LD - allow for making solid assumptions about the informational value of a maDMP.

License

Notifications You must be signed in to change notification settings

raffaelfoidl/maDMP-evaluation

Repository files navigation

maDMP Evaluation

DOI

Have a look at our project documentation at https://raffaelfoidl.github.io/maDMP-evaluation/.

This page provides a less linear and more structured version of the information presented in this README.

Project Overview

Machine-actionable data management plans (maDMPs) have, by their very nature, advantages over data management plans that are written exclusively in text form. By employing maDMPs, not only researchers should be able to benefit from their merits, but also research funders receiving and assessing the DMPs.

In its Practical Guide to the International Alignment of Research Data Management , Science Europe has published an evaluation rubric (section Guidance for Reviewers) that provides a solid basis to support research (funding) organizations in evaluating DMPs. By stating a set of criteria, it helps to ensure submitted DMPs cover required aspects and support FAIR data management.

This project aims to facilitate leveraging the machine-actionability of DMPs by providing SPARQL queries that are meant to automatically give an initial assessment of the respective data management plan's quality.

Repository Structure

.
├── app-bin
├── dcso-json
│   ├── src
│   ├── target
│   ├── README.md
│   ├── dcso-json.iml
│   └── pom.xml
├── docs
├── maDMPs
│   ├── 1-community
│   ├── 2-preprocessed
│   ├── 3-normalized
│   ├── 4-json-ld-converted
│   ├── 5-json-ld-postprocessed
│   └── convert.sh
├── queries
└── generate_docs.sh
  • app-bin: distribution directory of the dcso-json tool (built and distributed upon mvn install, refer to its documentation for more information)
  • dcso-json: the source code of the dcso-json tool, bundled as a maven project
  • docs: the content of the documentation webpage; served via GitHub pages and the static site generator Jekyll.
  • maDMPs: different versions of the machine-actionable DMPs assessed during this project, from raw input data to normalized JSON-LD DCSO serializations (see also Methodology).
  • queries: SPARQL queries conceived during this project to assess the quality of maDMPs
  • generate_docs.sh: regenerates the content of the docs folder based on the README files in this repository

The regeneration of the documentation webpage is triggered and automatically executed on the docs branch using a GitHub Action at every push to the main branch. In other words, the docs branch is the source of truth for https://raffaelfoidl.github.io/maDMP-evaluation/ and is updated at every push to main using a dedicated GitHub Action.

The docs folder on the main branch is only updated sporadically, e.g. at releases (commits with a release tag).

Methodology

Input Pipeline

The maDMPs we use as raw input data for our project are taken directly from the Zenodo Community Data Stewardship 2021 - DMPs.

  1. Start with raw maDMPs from the Zenodo Community.
  2. Ensure schema conformity, uniform formatting and indenting
  3. Normalization: Establish uniform, alphabetical sorting of JSON properties.
  4. Convert JSON/Turtle maDMPs to a DCSO instance in a JSON-LD serialization using the dcsojon tool.
  5. Apply postprocessing to the JSON-LD maDMPs (again, establish uniform, alphabetical sorting of JSON-LD properties).

Regarding step 2, the following changes had to be made in order to achieve schema conformity for all input maDMPs:

  • 4.json
    • removed line breaks within JSON string literals
  • 6.json
    • some closing brackets were missing
    • object nesting hierarchy in the distribution field was incorrect (we could only make a guess as to the author's original intentions)
  • 10.json
    • objects instead of arrays with one element were used
    • sometimes, strings instead of numerical values (e.g. int) were used
    • two of the four data sets did not exhibit the required dataset_id field
    • on one occasion, a datetime format was used instead of a date
  • 11.json
    • correction regarding datetime format: 2021-04-12T25:10:16.8 -> 2021-04-12T25:10:16.8Z
    • correction regarding incorrect time value: 2021-04-12T25:10:16.8Z -> 2021-04-12T23:10:16.8Z (a day does not have more than 24 hours)

Step 4 has been performed automatically via the convert.sh shell script. For more information on the dcso-json tool invoked by this script, please refer to the dcso-json overview.

Processing of the Semantic maDMP Representation

After having brought the maDMPs into a semantically enriched JSON-LD format, we were ready to express requirements from the evaluation rubric mentioned in the Project Overview. We developed queries that project certain subsets of the data into a customized view (SELECT queries) as well as ones that simply indicate whether some criteria are satisfied (ASK queries).

A more in-depth, but still summarized, overview of the queries we created can be observed in Covered Criteria. The queries themselves are available in the queries directory.

During our experiment, we used a local GraphDB instance as triple store and SPARQL endpoint. Other triple stores such as Jena Fuseki are of course eligible as well. However, as a result of previous experiences with it, we opted for GraphDB.

Report on Quality of Input maDMPs

Finally, after having created the queries, we applied them to the maDMPs that made up our input and which had previously been imported into a GraphDB repository. The results of this assessment can be found in the Assessment Report.

Covered Criteria

The tables in the following subsections depict which criteria from the evaluation rubric we were able to express via SPARQL queries.

We want to stress that some queries could be formulated less strict, i.e. OPTIONAL blocks could be inserted for triple patterns that join elements from the maDMP schema that are, by the schema definition, optional. However, as this project is more of a proof-of-concept-kind, this could be easily be done when extending or building upon the work at hand - in order not to "lose" any results/information about the corresponding maDMP.

The queries can be found in the corresponding directory in the GitHub repository. In the next subsections, the queries are referred to by their filename without extension, e.g. a reference to 5-a-1 is points to the query file 5-a-1.sparql.

General Information

Requirement Covered In Remarks
Administrative information
Provide information such as name of applicant, project number, funding programme, version of DMP.Provide information such as name of applicant, project number, funding programme, version of DMP. 0-1, 0-2, 0-3 Query 0-1 returns the basic information, i.e. the author, title, created date and language of the maDMP as well as the ID of the corresponding DMP. Query 0-2 gathers all important information available for the corresponding project, whereas query 0-3 collects information about the funding of the project.

Data Description and Collection or Re-Use of Existing Data

Requirement Covered In Remarks
1a How will new data be produced and/or how will existing data be re-used?
Explain which methodologies or software will be used if new data are collected or produced. / Information provided by the methodology field in the dataset structure - however, this field is only specified in the funder extension and is not included in the RDA-DMP Common Standard; therefore, it can not be translated when converting the JSON files to a JSON-LD format and in consequence, not be queried.
State any constraints on re-use of existing data if there are any. / Not really covered by maDMP.
Explain how data provenance will be documented. / Not really covered by maDMP.
Briefly state the reasons if the re-use of any existing data sources has been considered but discarded. / Not really covered by maDMP.
1b What data (for example the kind, formats, and volumes) will be collected or produced?
Give details on the kind of data: for example, numeric (databases, spreadsheets), textual (documents), image, audio, video, and/or mixed media. 1-b-1 Queries all declared datasets and displays their title, type and identifier.
Give details on the data format: the way in which the data is encoded for storage, often reflected by the filename extension (for example pdf, xls, doc, txt, or rdf). 1-b-2 Returns the data formats of each specified distribution (including the respective access URL and description of the distribution).
Justify the use of certain formats. For example, decisions may be based on staff expertise within the host organisation, a preference for open formats, standards accepted by data repositories, widespread usage within the research community, or on the software or equipment that will be used. / Not really covered by maDMP.
Give preference to open and standard formats as they facilitate sharing and long-term re-use of data (several repositories provide lists of such ‘preferred formats’). / Not directly covered by maDMP; difficult to cover with a simple SPARQL query.
Give details on the volumes (they can be expressed in storage space required (bytes), and/or in numbers of objects, files, rows, and columns). 1-b-3 Displays for each defined distribution its size in bytes.

Documentation and Data Quality

Requirement Covered In Remarks
2a What metadata and documentation (for example the methodology of data collection and way of organising data) will accompany the data?
Indicate which metadata will be provided to help others identify and discover the data. 2-a-1, 2-a-2 2-a-1 collects all information provided by the metadata field, i.e. a description (optional), the used standard and the language. 2-a-2 displays the specified keywords for each defined dataset.
Indicate which metadata standards (for example DDI, TEI, EML, MARC, CMDI) will be used. 2-a-1 Information about the used metadata standards is covered in this query.
Use community metadata standards where these are in place. 2-a-3 Example query for testing whether certain community standards (Dublin Core, DDI, EML, TEI or MARC) are used. This can be arbitrarily modified based on which standards are preferred.
Indicate how the data will be organised during the project mentioning, for example, conventions, version control, and folder structures. Consistent, well-ordered research data will be easier to find, understand, and re-use. 2-a-4 Displays whether the given distribution hosts support versioning. The other information is not really covered by maDMP; if it is included in the maDMP, then probably in the data_quality_assurance field which is covered by query 2-b-1.
Consider what other documentation is needed to enable re-use. This may include information on the methodology used to collect the data, analytical and procedural information, definitions of variables, units of measurement, and so on. / This information would (if anything) probably be included in the methodology field in the dataset structure - however, this field is only specified in the funder extension and is not included in the RDA-DMP Common Standard; therefore, it can not be translated when converting the JSON files to a JSON-LD format and in consequence, not be queried.
Consider how this information will be captured and where it will be recorded (for example in a database with links to each item, a 'readme' text file, file headers, code books, or lab notebooks). / Not really covered by maDMP.
2b What data quality control measures will be used?
Explain how the consistency and quality of data collection will be controlled and documented. This may include processes such as calibration, repeated samples or measurements, standardised data capture, data entry validation, peer review of data, or representation with controlled vocabularies. 2-b-1 The best one can do is with the data_quality_assurance element.

Storage and Backup During the Research Process

Requirement Covered In Remarks
3a How will data and metadata be stored and backed up during the research?
Describe where the data will be stored and backed up during research activities and how often the backup will be performed. It is recommended to store data in least at two separate locations. 3-a-1 Retrieving information about backups is only possible by querying the host element. If provided, the query returns the backup type and frequency for each specified host, as well as some information about the host.
Give preference to the use of robust, managed storage with automatic backup, such as provided by IT support services of the home institution. Storing data on laptops, stand-alone hard drives, or external storage devices such as USB sticks is not recommended. / Not really covered by maDMP.
3b How will data security and protection of sensitive data be taken care of during the research?
Explain how the data will be recovered in the event of an incident. / Not really covered by maDMP.
Explain who will have access to the data during the research and how access to data is controlled, especially in collaborative partnerships. 3-b-1 The best one can do is with the security_and_privacy field. Information about the availability of data hosts is included in query 3-a-1.
Consider data protection, particularly if your data is sensitive (for example containing personal data, politically sensitive information, or trade secrets). Describe the main risks and how these will be managed. 3-b-2 Description of risks and countermeasures are not really covered by maDMP. Information about whether data are sensitive is covered.
Explain which institutional data protection policies are in place. / Information provided by the related_policy field in the dmp structure - however, this field is only specified in the funder extension and is not included in the RDA-DMP Common Standard; therefore, it can not be translated when converting the JSON files to a JSON-LD format and in consequence, not be queried.

Legal and Ethical Requirements, Code of Conduct

Requirement Covered In Remarks
4a If personal data are processed, how will compliance with legislation on personal data and security be ensured?
Ensure that when dealing with personal data, data protection laws (for example GDPR) are complied with. (including sub-points) 5-a-2 If anything, information about consent for preservation or sharing and anonymization would be included in the preservation_statement which is already covered by query 5-a-2. The other aspects are not really covered by maDMP.
4b How will other legal issues, such as intellectual property rights and ownership, be managed? What legislation is applicable?
Explain who will be the owner of the data, meaning who will have the rights to control access. (including sub-points) 3-b-1, 5-a-3 If anything, access restrictions would be included in the security_and_privacy field which is already covered by query 3-b-1. Descriptions of the licenses in place are queried with query 5-a-3.
Indicate whether intellectual property rights (for example Database Directive, sui generis rights) are affected. If so, explain which and how will they be dealt with. / Not really covered by maDMP.
Indicate whether there are any restrictions on the re-use of third-party data. / Not really covered by maDMP.
4c What ethical issues and codes of conduct are there, and how will they be taken into account?
Consider whether ethical issues can affect how data are stored and transferred, who can see or use them, and how long they are kept. Demonstrate awareness of these aspects and respective planning. 4-c-1, 4-c-2 Query 4-c-1 checks whether ethical issues exist. Query 4-c-2 returns a description of the specified ethical issues, if there are any, as well as the ethical issues report, if there is one.
Follow the national and international codes of conducts and institutional ethical guidelines, and check if ethical review (for example by an ethics committee) is required for data collection in the research project. / Not really covered by maDMP.

Data Sharing and Long-Term Preservation

Requirement Covered In Remarks
5a How and when will data be shared? Are there possible restrictions to data sharing or embargo reasons?
Explain how the data will be discoverable and shared (for example by deposit in a trustworthy data repository, indexed in a catalogue, use of a secure data service, direct handling of data requests, or use of another mechanism). 5-a-1 Displays information about distribution such as Host, access URL and distributed file formats.
Outline the plan for data preservation and give information on how long the data will be retained. 5-a-2 The best one can do is with the preservation statement. However, note that this query does not return anything for our input files in JSON-LD format, probably because the preservation_statement field in the JSON files is ignored by the DCSO-JSON tool (see "Known issues" in the README of the tool) and hence, not converted. In consequence, this field can obviously not be queried.
Explain when the data will be made available. Indicate the expected timely release. Explain whether exclusive use of the data will be claimed and if so, why and for how long. Indicate whether data sharing will be postponed or restricted for example to publish, protect intellectual property, or seek patents. 5-a-3 Gathers information about the data usage constraints (license, embargo period, data access, release data).
Indicate who will be able to use the data. If it is necessary to restrict access to certain communities or to apply a data sharing agreement, explain how and why. Explain what action will be taken to overcome or to minimise restrictions. / Not really covered by maDMP. Possible information contained in maDMP is already queried by 5-a-3.
5b How will data for preservation be selected, and where data will be preserved long-term (for example a data repository or archive)?
Indicate what data must be retained or destroyed for contractual, legal, or regulatory purposes. / Not explicitly covered by maDMP; maybe with 5-a-2
Indicate how it will be decided what data to keep. Describe the data to be preserved long-term. / Not explicitly covered by maDMP; maybe with 5-a-2
Explain the foreseeable research uses (and/or users) for the data. / Not covered by maDMP.
Indicate where the data will be deposited. If no established repository is proposed, demonstrate in the DMP that the data can be curated effectively beyond the lifetime of the grant. It is recommended to demonstrate that the repositories policies and procedures (including any metadata standards, and costs involved) have been checked. 5-b-1 Enumerates all information available about the hosts mentioned in the maDMP.
5c What methods or software tools are needed to access and use data?
Indicate whether potential users need specific tools to access and (re-)use the data. Consider the sustainability of software needed for accessing the data. 5-c-1 A stripped down version of 5-a-1, with focus on the distributed file formats. This is the best we can get with the maDMP since file formats indicate software/tools to be used to read the files.
Indicate whether data will be shared via a repository, requests handled directly, or whether another mechanism will be used? 5-a-1 There is no dedicated field in the maDMP for this. However, 5-a-1 obtains data the requested information can inferred from.
5d How will the application of a unique and persistent identifier (such as a Digital Object Identifier (DOI)) to each data set be ensured?
Explain how the data might be re-used in other contexts. Persistent identifiers (PIDs) should be applied so that data can be reliably and efficiently located and referred to. PIDs also help to track citations and re-use. 5-d-1 This is a slightly modified version of 5-a-1, with an emphasis on the employed PID system.
Indicate whether a PID for the data will be pursued. Typically, a trustworthy, long-term repository will provide a persistent identifier. 5-d-2 Tests whether there exists a distribution with a host that specifies the use of a PID system.

Data Management Responsibilities and Resources

Requirement Covered In Remarks
6a Who (for example role, position, and institution) will be responsible for data management (i.e. the data steward)?
Outline the roles and responsibilities for data management/ stewardship activities for example data capture, metadata production, data quality, storage and backup, data archiving, and data sharing. Name responsible individual(s) where possible. 6-a-1, 6-a-2 The queries show all available information about the contact person and contributors.
For collaborative projects, explain the co-ordination of data management responsibilities across partners 6-a-2 Depicts information about contributors defined by the maDMP.
Indicate who is responsible for implementing the DMP, and for ensuring it is reviewed and, if necessary, revised. / Not explicitly covered by maDMP, but 6-a-1 and 6-a-2 give a good indicator of who might be responsible.
Consider regular updates of the DMP. / Not covered by maDMP.
6b What resources (for example financial and time) will be dedicated to data management and ensuring that data will be FAIR (Findable, Accessible, Interoperable, Re-usable)?
Explain how the necessary resources (for example time) to prepare the data for sharing/preservation (data curation) have been costed in. 6-b-1 Not explicitly covered by maDMP; related information may be found by 6-b-1.
Carefully consider and justify any resources needed to deliver the data. These may include storage costs, hardware, staff time, costs of preparing data for deposit, and repository charges. 6-b-1 Lists everything related to costs that is captured by the maDMP.
Indicate whether additional resources will be needed to prepare data for deposit or to meet any charges from data repositories. If yes, explain how much is needed and how such costs will be covered. 6-b-2 Specifies equipment needed or used to create or process the data.

Summary

Overall, the Science Europe Evaluation Rubric defines 6 broad categories in its assessment guideline. The following table gives of the spectrum we were able to cover with our queries.

Category Number Of Subitems Largely Covered Subitems Percentage
0 General Information 1 1 100 %
1 Data Description and Collection or Re-Use of Existing Data 9 3 33 %
2 Documentation and Data Quality 7 5 71 %
3 Storage and Backup During the Research Process 6 3 50 %
4 Legal and Ethical Requirements, Code of Conduct 6 3 50 %
5 Data Sharing and Long-Term Preservation 12 8 67 %
6 Data Management Responsibilities and Resources 7 5 71 %
Sum 48 28 58 %

Assessment Report

In this section, we want to present an aggregated assessment of the maDMPs submitted to the aforementioned Zenodo community. As it is possible to determine the respective authors from the content of an maDMP, we want to clarify that we - of course - do not intend to disparage neither the efforts that went into creating the documents nor the authors themselves by this assessment in any way. We merely utilized the files as realistic test data since they stem from experiments with diverse topics - in order to gauge the utility of the SPARQL queries developed during our project. The tables below form a summary of our attempts at evaluating the maDMPs.

The column(s) "Satisfaction Value" are numeric on a scale from zero to five. A value of five is equivalent to a holistic fulfilment of the respective criterion, a value of zero either denotes that the criterion is "not satisfied" or that the SPARQL queries are not able to extract the required pieces of information.

1.jsonld

Category Satisfaction Value Justification
0 General Information 2 Sufficient information about DMP. Information about project not included.
1 Data Description and Collection or Re-Use of Existing Data 2 The size of the produced/used data is provided. However, for two out of four distributions, the description is missing. Furthermore, the file formats of the produced data are not specified (in contrast to the reused data).
2 Documentation and Data Quality 2 No information about metadata or versioning provided. Keywords are included for half of the defined datasets. Minimal information about naming conventions included, as well as some statements about quality assurance measures.
3 Storage and Backup During the Research Process 2 maDMP does not have host elements defined, therefore some information is missing (backup type and frequency, availability). Good description of access restrictions. For most datasets, clear indication whether personal/sensitive data is stored provided.
4 Legal and Ethical Requirements, Code of Conduct 5 There is no information about potential preservation considerations. Regarding licenses, the maDMP does contain helpful data. However, the SPARQL query is a little bit too strict and fails due to the missing host definition. Good description of access restrictions and sufficient declaration of ethical considerations.
5 Data Sharing and Long-Term Preservation 3 maDMP does not have host elements defined, therefore a lot of important information is missing (PID system, backup strategies, URLs etc.). There are preservation statements in the original JSON file, but they cannot be queried from the JSON-LD due to the reason explained above. Regarding licenses (license, embargo, openness, sensitivity), the maDMP does contain helpful data. However, the SPARQL query is a little bit too strict and fails due to the missing host definition.
6 Data Management Responsibilities and Resources 1 Contact person is defined, but no contributors and their roles. Costs (resources, equipment, staff expenses etc.) are not specified in the maDMP.
Sum 17/35

Due to the missing host definition, a lot of information could not be extracted with the queries. There is virtually no documentation of metadata. Information about the data management responsibilities is missing as well. Apart from those aspects, the maDMP provides a decent informational value.

2.jsonld

Category Satisfaction Value Justification
0 General Information 2 Sufficient information about DMP. Information about project not included.
1 Data Description and Collection or Re-Use of Existing Data 4 There is a clear description for each distribution. The file formats are specified (except for the source code). The size of the data is given as well (except for the source code).
2 Documentation and Data Quality 0 No keywords specified. Information about metadata, data quality assurance and versioning is missing.
3 Storage and Backup During the Research Process 0 maDMP does not have host elements defined, therefore some information is missing (backup type and frequency, availability). No description of security measures. For most datasets, no clear indication whether personal/sensitive data is stored provided.
4 Legal and Ethical Requirements, Code of Conduct 2 There is no information about potential preservation considerations. Regarding licensing, the maDMP contains helpful data, but the SPARQL query is a little bit too strict and fails due to the missing host definition. No description of access restrictions. Sufficient declaration of ethical considerations.
5 Data Sharing and Long-Term Preservation 2 maDMP does not have host elements defined, therefore a lot of important information is missing (PID system, backup strategies, URLs etc.). There are no preservation statements, therefore no information about research uses and preservation details (which data is kept, how to select etc.). Regarding licenses (license, embargo, openness, sensitivity), the maDMP does contain helpful data. However, the SPARQL query is a little bit too strict and fails due to the missing host definition.
6 Data Management Responsibilities and Resources 1 Contact person is defined, but no contributors and their roles. Costs (resources, equipment, staff expenses etc.) are not specified in the maDMP.
Sum 11/35

A lot of (important) information is missing in this maDMP. Hence, one can conclude based on the assessment with our queries that the maDMP provides insufficient documentation and exhibits many aspects in which it can be improved.

3.jsonld

Category Satisfaction Value Justification
0 General Information 5 Extensive information about DMP and project. Funding information is missing, but this is to be expected since the project which was done in the course of this lecture is obviously not funded by anyone.
1 Data Description and Collection or Re-Use of Existing Data 4 There is a clear description for each distribution. The file formats are specified (except for the source code). The size of the data is given as well (except for the source code).
2 Documentation and Data Quality 0 No keywords specified. Information about metadata and versioning is missing. Minimal information regarding data quality assurance provided.
3 Storage and Backup During the Research Process 2 Extensive description of where the data is stored; however, information about backups and security measures is missing. Clear indication whether personal/sensitive data is stored.
4 Legal and Ethical Requirements, Code of Conduct 3 There is no information about potential preservation considerations. Useful information about licensing provided. No description of access restrictions. Sufficient declaration of ethical considerations.
5 Data Sharing and Long-Term Preservation 4 Substantial information about the data hosts (Zenodo); however, the corresponding queries are a little bit too strict and do not return anything. Good description of licensing/usage (sensitive data, embargo, openness). No explicit data preservation statement (missing data: retention period, data destruction, what data is kept).
6 Data Management Responsibilities and Resources 4 Clear information about creator, contributors and their roles. Costs (resources, equipment, staff expenses etc.) are not specified in the maDMP.
Sum 22/35

The documentation of the metadata is not satisfactory. Furthermore, there is no information regarding security considerations, data preservation and backups. All in all, this maDMP provides sufficient documentation without being of exceptional quality.

4.jsonld

Category Satisfaction Value Justification
0 General Information 5 Extensive information about DMP and project. Funding information is missing, but this is to be expected since the project which was done in the course of this lecture is obviously not funded by anyone.
1 Data Description and Collection or Re-Use of Existing Data 5 There is a clear description for each distribution. The file formats are specified as well as the size of the data.
2 Documentation and Data Quality 5 Significant keywords are provided as well as the metadata accompanying the data. Community metadata standards are used. Minimal information about versioning is available. Extensive description of data quality assurance measures.
3 Storage and Backup During the Research Process 5 Extensive description of where the data is stored, the respective backup modalities and access regulations. Clear indication whether personal/sensitive data is stored.
4 Legal and Ethical Requirements, Code of Conduct 5 Extensive documentation of access restrictions, ethical considerations and licensing. Information about data preservation with regard to sensitive or personal data included.
5 Data Sharing and Long-Term Preservation 5 Substantial information about the data hosts (GitHub, Zenodo). Good description of licensing/usage (sensitive data, embargo, openness). There is a preservation statement in the original JSON file, but it cannot be queried from the JSON-LD due to the reason explained above.
6 Data Management Responsibilities and Resources 4 Clear information about creator, contributors and their roles. Costs (resources, equipment, staff expenses etc.) are not specified in the maDMP.
Sum 34/35

This maDMP could be assessed quite well; based on the results one can argue that this maDMP is excellent. The only missing aspect concerns the costs of the project.

5.jsonld

Category Satisfaction Value Justification
0 General Information 5 The SPARQL queries are not able to collect the required data. Nevertheless, in the original TTL file, there is a sufficient documentation of basic information (author etc.) and the project description. Funding information is missing, but this is to be expected since the project which was done in the course of this lecture is obviously not funded by anyone.
1 Data Description and Collection or Re-Use of Existing Data 3 There is a clear description for each distribution; the file formats are defined as well. However, the type of the dataset is not specified and the size of the produced/used data is missing.
2 Documentation and Data Quality 0 Only few keywords specified. Information about metadata and versioning is missing. No information regarding data quality assurance provided.
3 Storage and Backup During the Research Process 1 Minimal description of where the data is stored and the respective backup modalities (which the SPARQL queries fail to detect). No description of security measures. No indication whether personal/sensitive data is stored.
4 Legal and Ethical Requirements, Code of Conduct 2 No mention of data preservation considerations and no description of access control mechanisms. Regarding licenses (license, embargo, openness, sensitivity), the maDMP does contain helpful data. However, the SPARQL query is a little bit too strict and fails due to the incomplete host definitions. Sufficient declaration of ethical considerations.
5 Data Sharing and Long-Term Preservation 4 Almost complete information about the data hosts (GitHub, Zenodo); information about the PID system is missing. There is no preservation statement, therefore no information about research uses and preservation details (which data is kept, how to select etc.). Regarding licenses (license, embargo, openness, sensitivity), the maDMP does contain helpful data. However, the SPARQL query is a little bit too strict and fails due to the incomplete host definitions.
6 Data Management Responsibilities and Resources 5 Extensive information about creator, contributors and their roles. Needed resources are defined. Financial costs are not specified in the maDMP.
Sum 20/35

The main issue here is the lack of metadata documentation. Aside from that, the maDMP fails to provide information about data storage and preservation, backups and security measures. The other aspects were elaborated sufficiently. In conclusion, this maDMP is of decent quality.

6.jsonld

Category Satisfaction Value Justification
0 General Information 2 Sufficient information about DMP. Information about project not included.
1 Data Description and Collection or Re-Use of Existing Data 4 There is a clear description for each distribution. The file formats are specified (except for the source code). The size of the data is given as well (except for the source code).
2 Documentation and Data Quality 0 No keywords specified. Information about metadata and versioning is missing. Minimal information regarding data quality assurance provided.
3 Storage and Backup During the Research Process 1 maDMP does not have host elements defined, therefore some information is missing (backup type and frequency, availability). No information about security measures provided. Clear indication whether personal/sensitive data is stored.
4 Legal and Ethical Requirements, Code of Conduct 2 No mention of data preservation considerations and no description of access control mechanisms. Regarding licenses (license, embargo, openness, sensitivity), the maDMP does contain helpful data. However, the SPARQL query is a little bit too strict and fails due to the incomplete host definitions. Sufficient declaration of ethical considerations.
5 Data Sharing and Long-Term Preservation 2 maDMP does not have host elements defined, therefore a lot of important information is missing (PID system, backup strategies, URLs etc.). There is no preservation statement, therefore no information about research uses and preservation details (which data is kept, how to select etc.). Regarding licenses (license, embargo, openness, sensitivity), the maDMP does contain helpful data. However, the SPARQL query is a little bit too strict and fails due to the missing host definition.
6 Data Management Responsibilities and Resources 1 Contact person is defined, but no contributors and their roles. Costs (resources, equipment, staff expenses etc.) are not specified in the maDMP.
Sum 12/35

A lot of (important) information is missing here. Hence, one can conclude based on the assessment with our queries that the maDMP provides insufficient documentation and exhibits many aspects in which it can be improved.

7.jsonld

Category Satisfaction Value Justification
0 General Information 5 Extensive information about DMP and project. Funding information is missing, but this is to be expected since the project which was done in the course of this lecture is obviously not funded by anyone.
1 Data Description and Collection or Re-Use of Existing Data 5 There is a clear description for each distribution. The file formats are specified (except for the source code). The size of the data is provided as well.
2 Documentation and Data Quality 0 No keywords specified. Specified metadata standard is not a standard. Minimal information regarding data quality assurance provided.
3 Storage and Backup During the Research Process 2 Extensive description of where the data is stored; however, information about backups and security measures is missing. Clear indication whether personal/sensitive data is stored.
4 Legal and Ethical Requirements, Code of Conduct 3 There is no information about potential preservation considerations. Useful information about licensing provided. No description of access restrictions. Sufficient declaration of ethical considerations.
5 Data Sharing and Long-Term Preservation 4 Substantial information about the data host (Zenodo) and licensing/usage (sensitive data, embargo, openness). No explicit data preservation statement (missing data: retention period, data destruction, what data is kept).
6 Data Management Responsibilities and Resources 4 Clear information about creator, contributors and their roles. Costs (resources, equipment, staff expenses etc.) are not specified in the maDMP.
Sum 23/35

The documentation of the metadata is not quite satisfactory. Furthermore, there is no information regarding security considerations, data preservation and backups. All in all, this maDMP provides sufficient documentation without being of exceptional quality.

8.jsonld

Category Satisfaction Value Justification
0 General Information 5 Extensive information about DMP and project. Funding information is missing, but this is to be expected since the project which was done in the course of this lecture is obviously not funded by anyone.
1 Data Description and Collection or Re-Use of Existing Data 3 The file formats are defined. However, some distribution descriptions are missing as well as the size of some data.
2 Documentation and Data Quality 2 Significant keywords are specified. Specified metadata standards are no standards. No information regarding data quality assurance provided, except for minimal information about versioning.
3 Storage and Backup During the Research Process 3 Extensive description of where the data is stored; however, information about access restrictions is missing, as well as a description of the backup modalities for some hosts. Clear indication whether personal/sensitive data is stored for most specified datasets.
4 Legal and Ethical Requirements, Code of Conduct 3 There is no information about potential preservation considerations and access restrictions. Extensive documentation regarding licensing. Sufficient declaration of ethical issues.
5 Data Sharing and Long-Term Preservation 5 Extensive information about the data hosts (GitHub, Zenodo) and licensing/usage (sensitive data, embargo, openness). There are preservation statements in the original JSON file, but they cannot be queried from the JSON-LD due to the reason explained above.
6 Data Management Responsibilities and Resources 5 Extensive information about creator, contributors and their roles. Extensive description of needed resources and costs.
Sum 26/35

There was apparently some confusion regarding the metadata standards. Apart from that, there are only a few small issues. Overall, this maDMP is of mediocre quality.

9.jsonld

Category Satisfaction Value Justification
0 General Information 4 Extensive information about DMP. Description of the project is not included. Funding information is missing, but this is to be expected since the project which was done in the course of this lecture is obviously not funded by anyone.
1 Data Description and Collection or Re-Use of Existing Data 5 There is a clear description for each distribution. The file formats are specified as well as the size of the data.
2 Documentation and Data Quality 4 Keywords are provided as well as the metadata accompanying the data. Community metadata standards are used. Minimal information about versioning is available. Description of data quality assurance measurements missing.
3 Storage and Backup During the Research Process 4 Extensive description of where the data is stored and access restrictions; however, information about backups is missing. Clear indication whether personal/sensitive data is stored.
4 Legal and Ethical Requirements, Code of Conduct 4 Extensive documentation of access restrictions, ethical considerations and licensing. No information about data preservation with regard to sensitive or personal data included.
5 Data Sharing and Long-Term Preservation 4 There is extensive information about the data hosts (GitHub, Zenodo, The World Bank) and licensing/usage (sensitive data, embargo, openness). No explicit data preservation statement (missing data: retention period, data destruction, what data is kept). No target audiences (foreseeable research uses).
6 Data Management Responsibilities and Resources 5 Clear information about creator, contributors and their roles. Costs (resources, equipment, staff expenses etc.) are also specified in the maDMP.
Sum 30/35

According to the results of the queries, although a few requirements were not completely fulfilled, this maDMP has a quite high amount of information demanded by the evaluation rubric, hinting at good quality.

10.jsonld

Category Satisfaction Value Justification
0 General Information 5 Extensive information about DMP and project. Funding information is missing, but this is to be expected since the project which was done in the course of this lecture is obviously not funded by anyone.
1 Data Description and Collection or Re-Use of Existing Data 4 The file formats are specified, but not in the IANA media type format. The size of the data is provided. However, the distribution descriptions are missing.
2 Documentation and Data Quality 2 Significant keywords are specified. No information about metadata or versioning provided. Extensive documentation of naming conventions included.
3 Storage and Backup During the Research Process 1 maDMP does not have host elements defined, therefore some information is missing (backup type and frequency, availability). No information about security measures provided. Clear indication whether personal/sensitive data is stored.
4 Legal and Ethical Requirements, Code of Conduct 2 There is no information about potential preservation considerations and access control. Regarding licensing, the maDMP does contain helpful data. However, the SPARQL query is a little bit too strict and fails due to the missing host definition. Sufficient declaration of ethical considerations.
5 Data Sharing and Long-Term Preservation 3 maDMP does not have host elements defined, therefore a lot of important information is missing (PID system, backup strategies, URLs etc.). There is a preservation statement in the original JSON file, but it cannot be queried from the JSON-LD due to the reason explained above. Regarding licenses (license, embargo, openness, sensitivity), the maDMP does contain helpful data. However, the SPARQL query is a little bit too strict and fails due to the missing host definition.
6 Data Management Responsibilities and Resources 4 Creator/contact person is defined, but no contributors and their roles. Costs of storing and backing up the data are also specified in the maDMP.
Sum 21/35

Due to the missing host definition, a lot of information could not be extracted with the queries. There is virtually no documentation of metadata. Apart from those aspects, the maDMP did provide a decent informational value.

11.jsonld

Category Satisfaction Value Justification
0 General Information 2 Sufficient information about DMP. Information about project not included.
1 Data Description and Collection or Re-Use of Existing Data 4 There is a clear description for each distribution. The file formats are specified (except for the source code). The size of the data is given as well (except for the source code).
2 Documentation and Data Quality 0 No keywords specified. Original JSON file contains documentation_and_metadata element where some information about metadata is provided; this field is, however, not part of the RDA-DMP Common Standard and can therefore not be considered. No information about versioning. Minimal statement regarding data quality assurance.
3 Storage and Backup During the Research Process 1 maDMP does not have host elements defined, therefore some information is missing (backup type and frequency, availability). No information about security measures provided. Clear indication whether personal/sensitive data is stored.
4 Legal and Ethical Requirements, Code of Conduct 2 There is no information about potential preservation considerations and access control. Regarding licensing, the maDMP does contain helpful data. However, the SPARQL query is a little bit too strict and fails due to the missing host definition. Sufficient declaration of ethical considerations.
5 Data Sharing and Long-Term Preservation 3 maDMP does not have host elements defined, therefore a lot of important information is missing (PID system, backup strategies, URLs etc.). There is no preservation statement, therefore no information about research uses and preservation details (which data is kept, how to select etc.). Regarding licenses (license, embargo, openness, sensitivity), the maDMP does contain helpful data. However, the SPARQL query is a little bit too strict and fails due to the missing host definition.
6 Data Management Responsibilities and Resources 4 Clear information about creator, contributors and their roles. Costs (resources, equipment, staff expenses etc.) are not specified in the maDMP.
Sum 16/35

Since the JSON-LD maDMP was surprisingly short in content, a manual look into the source maDMP revealed that there are a lot of fields that are not actually part of the RDA-DMP Common Standard and thus, not queryable with our approach. From this assessment, one can conclude that there is still room for improvement.

12.jsonld

Category Satisfaction Value Justification
0 General Information 5 Extensive information about DMP and project. Funding information is missing, but this is to be expected since the project which was done in the course of this lecture is obviously not funded by anyone.
1 Data Description and Collection or Re-Use of Existing Data 5 There is a clear description for each distribution. The file formats are specified as well as the size of the data.
2 Documentation and Data Quality 5 Significant keywords are provided as well as the metadata accompanying the data. Community metadata standards are used. Minimal information about versioning is available. Extensive description of data quality assurance measures and folder structures.
3 Storage and Backup During the Research Process 5 Extensive description of where the data is stored, the respective backup modalities and access regulations. Clear indication whether personal/sensitive data is stored. Data is stored at four locations.
4 Legal and Ethical Requirements, Code of Conduct 5 Extensive documentation regarding licensing. Good description of access restrictions and sufficient declaration of ethical considerations.
5 Data Sharing and Long-Term Preservation 4 Substantial information about the data hosts (GitHub, Zenodo) and licensing/usage (sensitive data, embargo, openness). No explicit data preservation statement (missing data: retention period, data destruction, what data is kept). No target audiences (foreseeable research uses).
6 Data Management Responsibilities and Resources 4 Clear information about creator, contributors and their roles. Costs (resources, equipment, staff expenses etc.) are not specified in the maDMP.
Sum 33/35

Overall, this maDMP could be assessed decently well and turned out to be of excellent quality based on the evaluation with our queries. Missing aspects were mostly due to the maDMP schema, a preservation statement would have provided some more information.

Conclusion

The table below displays the average satisfaction value for each category defined in the rubric as well as the average sum.

Category Average Satisfaction Value
0 General Information 3.9
1 Data Description and Collection or Re-Use of Existing Data 4.0
2 Documentation and Data Quality 1.6
3 Storage and Backup During the Research Process 2.3
4 Legal and Ethical Requirements, Code of Conduct 3.2
5 Data Sharing and Long-Term Preservation 3.6
6 Data Management Responsibilities and Resources 3.5
Sum 22/35

As one can see in the table and the individual evaluations above, the main issues in the input maDMPs are insufficient documentation of the metadata accompanying the used and produced data as well as lacking information about the storage and backup of data (categories 2 and 3). Regarding the other categories, most of the maDMPs provided a decent amount of information, with a few shortcomings here and there. One aspect worth mentioning here is the missing definition of the host element which is an issue that appeared in quite a few maDMPs. Furthermore, the costs were neglected in all maDMPs, the required resources were only specified in one maDMP.

Nevertheless, based on the evaluation with our queries one can argue that all maDMPs are of good (or at least sufficient) quality.

With respect to the quality and usefulness of our queries, as already mentioned in the introduction to this section, some queries could be made more tolerant against not having defined optional schema elements. This would improve the results in gauging maDMPs. Other than that, they proved to be quite useful in assessing the set of input maDMPs.

All in all, the SPARQL queries can certainly serve as a starting point for reviewers. However, it is worth noting that queries are mostly kept rather general in order to be applicable to the quite diverse input files. Hence, one might need to adjust them to fit one's specific domain and requirements.

License

MIT (see LICENSE file)

About

A project to facilitate the automatic assessment of the quality of machine-actionable data management. Especially conceived SPARQL queries to be executed over a DCSO serialization - in this instance, JSON-LD - allow for making solid assumptions about the informational value of a maDMP.

Topics

Resources

License

Stars

Watchers

Forks