Skip to content

Latest commit

 

History

History
156 lines (119 loc) · 5.46 KB

datalad-metadata.rst

File metadata and controls

156 lines (119 loc) · 5.46 KB

datalad metadata

This was somehwat cool in the past -- ignore in the present. Kept for future references

This is a documentation on datalad's approach to metadata. Especially on how the metadata representation currently looks like.

DataLad uses RDF to represent metadata. However, this kind of representation is required by datalad within collections only. A dataset may or may not contain metadata, which is prepared that way. A collection's metadata about a dataset can be imported from any location (within or not within the dataset itself) and various metadata formats (rdf as well as non-rdf). This allows for different collections containing the very same dataset but different metadata. It also means, that any git-annex repository can be a dataset contained in a collection without the need to be touched by datalad before.

dataset metadata

The metadata of a dataset has two levels. The first one contains the metadata about the actual content of a dataset and is provided by whoever is creating or maintaining the dataset. For this purpose, datalad is able to import metadata from different metadata formats and represent this metadata as RDF statements. There are a number of things datalad expects to be expressed by the use of certain terms. In case of a non-rdf format datalad will generate statements, that use these terms and in case of a rdf format already provided, datalad will add statements using these terms while keeping the originally used ones in order to provide the opportunity to use both in queries.

The second level is metadata about the dataset itself. This is generated by datalad.

This sums up to a set of statements datalad expects to be present in the metadata or has to generate respectively. I'll call this set of statements the "datalad dataset descriptor".

Note: To be clear - "expects" means: If the information is available it is provided by using this terms. It doesn't mean that certain information necessarily is available, nor does it mean, that these information aren't provided using other terms, too.

datalad dataset descriptor

This is the set of statements currently considered to be the datalad dataset descriptor. Note: There may be some minor changes or extensions soon.

Used prefixes:

rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> rdfs: <http://www.w3.org/2000/01/rdf-schema#> xsd: <http://www.w3.org/2001/XMLSchema#> prov: <http://www.w3.org/ns/prov#> dcat: <http://www.w3.org/ns/dcat#> dctypes: <http://purl.org/dc/dcmitype/> dct: <http://purl.org/dc/terms/> pav: <http://purl.org/pav/> foaf: <http://xmlns.com/foaf/0.1/> dlns: <http://datalad.org/terms/> : <>

Using RDF we are talking about describing resources. The dataset is a repository, so the resource is the path to (or the URL of) the repository. Therefore, the first statement is to declare this resource is a datalad dataset:

<path/to/dataset> a dlns:Handle .

However, in case of a dataset descriptor, that is stored within the dataset itself, this cannot be done this way for various reasons. In that case we use the 'special resource' dlns:this instead. When imported to a collection, it is replaced by the URI, the collection uses to point to the dataset anyway.

To identify resources within the context of that dataset, like persons that play a role described in the metadata, we use an 'empty prefix':

@prefix : <>

Note: This will most likely slightly change, since it doesn't behave as expected with rdflib.

So, we can now state who created the dataset. Note: We may call this the "author" of a dataset, but this is not necessarily an author of the actual content of the dataset:

:someone a prov:Person, foaf:Person ;

foaf:name "someone"^^xsd:string;

.

:datalad a prov:SoftwareAgent ;

rdfs:label "datalad"^^xsd:string; pav:version "1.0a"^^xsd:string;

.

<path/to/dataset> a dlns:Handle ;

pav:createdBy :someone ; pav:createdWith :datalad ;

Additionally, a dataset has a description, a title and a license:

<path/to/dataset> a dlns:Handle ;

pav:createdBy :someone ; pav:createdWith :datalad ; dct:title "the dataset's name"^^xsd:string; dct:description """This is a dataset and therefore it contains some kind of data. Probably the data is about a certain topic and was generated somehow.""" ; dct:license <uri/of/the/license> ;

.

Now, the dataset has some content, which can be described by different types of data entities. There are a lot of terms that may be used to classify these entities. That's not our concern. We just state, that the dataset contains these entities and that these entities were authored by some people, so we can query the metadata for that information. That very information may be stated using other terms already (see 'content2' below). In that case we keep that statement, but our own:

:content1 a dctypes:Dataset ;

pav:authoredBy :someauthor ; pav:authoredBy :someotherauthor ;

.

:content2 a dcat:Distribution ;

anotherNamespace:creator :someauthor ; pav:authoredBy :someauthor ;

.

<path/to/dataset> a dlns:Handle ;

pav:createdBy :someone ; ... see above ... dct:hasPart :content1 ; dct:hasPart :content2 ;

.

In case the content's metadata doesn't provide data entities using certain terms already, we create one data entity of type 'dctypes:Dataset' to describe the content of the dataset.

# TODO reminders:

collection metadata

(TODO) (very similar) dct:hasPart => dataset

datalad config data

dlns:usesSource