Metadata info

Ondřej Košarko edited this page Sep 16, 2016 · 1 revision

Required

During the submission we require that the user provides at least the following information:

  • The type of the resource; currently allowing only 4 types (corpora, tools, language conceptual resources, language descriptions) (dc.type)
  • The title (dc.title)
  • List of authors (dc.contributor.author)
  • Issue date (dc.date.issued)
  • Description (dc.description)
  • Publisher (dc.publisher)
  • (if applicable) the code(s) of language(s) the resource is about (dc.language.iso)
  • Contact person (the person that is responsible for giving information for the resource) - at least surname and email
  • Distribution information - access rights, license information, license restrictions, distribution media
  • Content information - type of media (eg. text/audio/…), (if applicable) further classification of the resource (eg. onthology/thesaurus for lexical conceptual resources)
  • Size information - size in bytes/words/n-grams/… (if applicable)

From these we are currently able to provide description of the resource as required by the (minimal) metashare schema (http://www.meta-net.eu/meta-share/metadata-schema) and/or our CMDI profile (schema) - http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1403526079380/xsd (or http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1349361150622/xsd)

OAI-PMH

The Open Archives Initiative goal is to ease access to material on the Web. They decided to do so by devising a protocol that enables Data providers (web servers/repositories owners) to expose their “data about data” (metadata). The exposure happens in a way that allows Service providers to programmatically retrieve (harvest) these metadata and build some value added services on top of them (ie. adding them to a larger, searchable, collection). The protocol is called Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).

The service providers issue HTTP requests to data providers and receive XML responses. The protocol does not define a format (schema?) for the metadata. It is up to particular Data and Services providers to come up with format that suits their needs. A Data provider can even offer more formats. However, among these formats, for interoperability purposes, there must be unqualified [Dublin Core](Metadata_info#Dublin Core).

As from the point of view of this project, we want to (have to?) be roles - Data and Service provider. We should support Meta-share schema and also offer metadata in cmdi format.

To see currently supported formats check https://ufal-point.mff.cuni.cz/oai/request?verb=ListMetadataFormats

Crosswalks

Metadata crosswalk contains mapping from one metadata element set to another. IE. which elements are semantically close to witch. The mapping usually discards some information as it usually goes from fine grained elements to more coarse one. Eg. creationDate -> date. Ingestion and dissemination crosswalks kind of describe in which way the mapping “goes”. Suppose you have all metadata in metashare “format”, you have to offer them in DC thus you do some mapping - a dissemination crosswalk. Ingestion (or submission) crosswalk does the mapping other way around - you receive metadata in DC (eg. through harvesting) and want to save them in metashare “format”.

Dublin Core

In DSpace you can meet element sets by Dublin Core Metadata Initiative - DCMI - in two places. You’ll encounter DCMES and DCTERMS

DCMES (or unqualified dc or just dc) contains (and defines/describes) 15 elements for resource description, none of them is mandatory, some can be repeated (eg. more creators). This is the “minimal” format we must support in OAI-PMH. The categories (elements) are very coarse so by mapping to DC we usually “throw” away a lot of information we have about our data. But, on the other hand, everyone, even people outside our “partner projects”, can somehow interpret/use our data.

DCTERMS is another set (another namespace) but it contains the same 15 elements as DCMES and their refinements - together aprox. 50 elements. There are eg. these elements: date, dateAccepted, dateCopyrighted… DSpace uses this set (or some modification of it - it refers to dcmi-terms but the sets are not exactly the same) when you submit an item or display metadata of an item through “show full item record”.

While browsing you see eg. dc.title.alternative (alternative is refinement of title in dcterms), but for OAI-PMH the output is just dc:title.

Our metadata in dc: https://ufal-point.mff.cuni.cz/oai/request?verb=ListRecords&metadataPrefix=oai\_dc

Cmdi

The idea behind CMDI (or Component MetaData Infrastructure) is that one metadata schema can’t “fit” a large community. The elements might be too detailed for one and still to coarse for others, subcommunities might prefer to use different names for same thing etc.

So the solution is to let the users/organizations create own schemas but with ensured “semantic interoperability”. Schema/profile is created from components - sets of metadata elements and other components. The components and profiles are stored in component registry and thus can be shared and reused. User can create new components, but the elements used must be linked to a database of atomic concepts (data categories). This then allows to “see” that for example ‘a noun’ and ‘a substantive’ is the same concept and for example search for tools/articles/… about ‘nouns’ can also refer to ‘substantives’. On such data category registry is Isocat and the “purl” identifiers maintained by dcmi are also accepted.

As I understand it , we’ll have to use a profile registered in the component registry, if we want to offer CMDI over OAI-PMH. To get full access to component registry , I think, we need a login. Is there some institutional one? Or how does this work?

The component registry was populated by components/profiles based on some of the element sets/schemata widely used (eg. IMDI, OLAC…). To allow a quick transfer for organizations (data providers) using these. So there is a dcmi-terms profile already prepared; it seems natural to be using this as long as we have only few items with metashare md, or while there is no profile available for metashare (they plan to do it in future (???) ).

The OAI-PMH cmdi crosswalk is mapping the “DSpace dc” to dcterms, for the metashare elements some mapping is done when inserting an item; the email and validation is not mapped, the rest as follows:

dc.contributor.advisor = contributor
dc.contributor.author = creator
dc.contributor.editor = contributor
dc.contributor.illustrator = contributor
dc.contributor.other = contributor
dc.contributor = contributor
dc.coverage.spatial = spatial
dc.coverage.temporal = temporal
dc.creator = creator
dc.date.accessioned = date
dc.date.available = available
dc.date.copyright = dateCopyrighted
dc.date.created = created
dc.date.issued = issued
dc.date.submitted = dateSubmitted
dc.date.updated = date
dc.date = date
dc.description.abstract = abstract
dc.description.provenance = provenance
dc.description.sponsorship = description
dc.description.statementofresponsibility = description 
dc.description.tableofcontents = tableOfContents
dc.description.uri = description
dc.description.version = description
dc.description = description
dc.format.extent = extent
dc.format.medium = medium
dc.format.mimetype = format
dc.format = format
dc.identifier.citation = bibliographicCitation
dc.identifier.govdoc = identifier
dc.identifier.isbn = identifier
dc.identifier.ismn = identifier
dc.identifier.issn = identifier
dc.identifier.other = identifier
dc.identifier.sici = identifier
dc.identifier.slug = identifier
dc.identifier.uri = identifier
dc.identifier = identifier
dc.language.iso = language
dc.language.rfc3066 = language
dc.language = language
dc.publisher = publisher
dc.relation.haspart = hasPart
dc.relation.hasversion = hasVersion
dc.relation.isbasedon = relation
dc.relation.isformatof = isFormatOf
dc.relation.ispartof = isPartOf
dc.relation.ispartofseries = relation
dc.relation.isreferencedby = isReferencedBy
dc.relation.isreplacedby = isReplacedBy
dc.relation.isversionof = isVersionOf
dc.relation.replaces = replaces
dc.relation.requires = requires
dc.relation.uri = relation
dc.relation = relation
dc.rights.holder = rightsHolder
dc.rights.uri = rights
dc.rights = rights
dc.source.uri = source
dc.source = source
dc.subject.classification = subject
dc.subject.ddc = subject
dc.subject.lcc = subject
dc.subject.lcsh = subject
dc.subject.mesh = subject
dc.subject.other = subject
dc.subject = subject
dc.title.alternative = alternative
dc.title = title
dc.type = type
metashare.ResourceInfo#ContentInfo.mediaType = type
metashare.ResourceInfo#DistributionInfo.availability = rights
metashare.ResourceInfo#DistributionInfo#LicenseInfo.distributionAccessMedium = medium
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse = rights
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType = description
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName = description
metashare.ResourceInfo#TextInfo#SizeInfo.* = extent
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName = contributor

To make a proper CMD file we also have to add the administrative metadata (“headers”):

  • MdCreator is omitted as it has minOccurs=0; but it can be set to something like “UFALCmdiCrosswalk” (name of script can be used)
  • MdCreationDate is set to item.getLastModified()
  • MdSelfLink - should be set to “the URL or PID of this file”; but we actually don’t have any real CMD file so it is set to url that will return CMDI metadata for this particular item. Ie. a link to OAI app with prefilled query (verb, identifier, metadataprefix).
  • MdProfile - URL of profile’s XSD in the component registry.
  • MdCollectionDisplayName - item.getOwningCollection(). Eg. “UFAL - Published Data”. According to FAQ “an (optional but recommended) plain text indication to which collection this file belongs. Used for the Collection facet in the VLO”.
  • Resources - section, containing links to:
    • external files (e.g. an annotation file or a sound recording) and/or other CMDI metadata files (to build hierarchies)
    • The only attached resource, referenced through “hdl.handle.net + item.getHandle()”, is the item’s “webpage” in DSpace.
  • JournalFileProxyList(links to file(s?) tracking the changes of a resource) and ResourceRelationList are left empty.

Tools out there

Meta-share metadata schema

  • description of language resources (data sets and tools)
  • component based mechanism (CMDI see user manual p. 15)
    • grouping together semantically coherent elements
    • elements linked (how exactly? or what does it mean?) to ISOcat data category registry
  • adds relations notion (to for instance link raw and anotated data), these are represented as elements in schema
  • two levels of granularity - minimal and maximal schema
    • minimal contains mandatory and condition-dependent mandatory elements
    • minimal is the schema we want to have
  • resourceInfo component is the core of the model
    • contains components and elements that provide description (eg. identificationInfo (resourceName, description, metaShareId), contactPerson (…))
  • resourceType element categorizes LRs to categories: corpus, lexical/conceptual resource, language description, tool/service (getting bit lost here - is this distinction mandatory?)
  • ? Is it possible to use these components for the needed cmdi output? I assume that means a profile has to exists in component registry…

Updating the schema in new installs

The dspace ant build file contains a job to populate the metadata tables. It does so by running MetadataImporter on xml file with described schema. If the schema should change (eg. new items) this file needs to be changed. That can be done either manually or you can update the schema in any running instance (through xmlui) and use MetadataExporter to create this file.

Run the following from [installation]/config

This worked in 1.6

java -Ddspace.configuration=./dspace.cfg -cp ../lib/dspace-api-1.6.2.jar:../lib/* org.dspace.administer.MetadataExporter \
 -f yourCheckout/sources/dspace/config/registries/metashareSchema.xml -s metashare

In 1.8 it is simmilar (-f contains the path to created file; -s name of the schema in dspace instance you are exporting from)

java -Ddspace.configuration=./dspace.cfg -cp ../lib/dspace-api-1.8.2.jar:../lib/* org.dspace.administer.MetadataExporter -f [git-checkout-dir]/sources/dspace/config/registries/metashareSchema.xml -s metashare

Meta-share crosswalk

First, there is some more info about the schema Metashare_import.

The crosswalk currently works as a transformation (XSLT) from previous version (or what was outputted through previous oai_metashare crosswalk). Basically it reorders previous elements, fills in mandatory but missing values and tries to map our values to values of metashare controlled dictionary where necessary.

  • metadataCreationTime is set to fixed value

  • availability a mapping (our->metashare): notAvailable->notAvailableThroughMetaShare, and for all other values “available-" is prepended this should result in values

  • distributionAccessMedium: download->downloadable, internetBrowsing->webExecutable, or our value is kept

  • restrictionsOfUse: academicUse/nonCommercialUse->academic-nonCommercialUse, commercial use->commercialUse, or our value is kept

  • licence, still have some drawbacks

      allow commercial sharing and changing->CC_BY
      allow commercial sharing and changing with same license->CC_BY-SA_3.0
      allow commercial sharing->CC_BY-ND
      allow non commercial sharing and changing->CC_BY-NC
      allow non commercial sharing and changing with same license->CC_BY-NC-SA_3.0
      allow non commercial sharing->CC_BY-NC-ND
      Creative Commons~~ Attribution 3.0 Unported (CC BY 3.0)->CC_BY
      _anything else->other
    
  • fundingType: EU->euFunds, Own->ownFunds, National->nationalFunds. Here we probably should enforce these values on our side and actually go through the resources and fix it (there is eg. EU, eu, eufunds…)

  • resourceType is currently fixed to”corpus" either text or audio, always monolingual…

Quality

We should try to provide as consistent metadata as possible. That means from time to time checking the values and if we come to conclusion there are multiple values for one entity (author, affiliated organization, etc.) we should replace it with just one… Since we don’t display all the metadata in item browse, our main feedback are currently the organizations that harvest our repository (eg. VLO). One “drawback” with this approach is we actually see our metadata after certain mapping (eg. our publisher and affiliated institution are both organization of some kind, so there is nothing “preventing” from grouping this under one facet called organization).

So identifying from which field the value came might be somewhat tedious. In VLO you can display the source metadata, the dc metadata should be clear and most of the (X)Paths in the resourceInfo component should resemble our database entries. But some values are generated automatically, database uses some “fake” fields (eg. detailedType) which are named differently based on the type of resource and some entries resemble the structure of the first version of schema which might have changed (something moved to other component). The easiest way to identify the field is thus probably following the links to the item display; going to full display and searching the value through browsers find. Or connecting to psql and doing the search there, which is currently probably the best way…since you’ll want to find other items, or do some mass replace.

Other thing is that from time to time we might not like under which facet certain value appears, there’s really not much we can do. We should check the semantics of the field, to see if it really matches our usage (if we are using the right fields for the right values). Some fields might have broader semantics then we choose to use (eg. author = person || institution, we use it just for people; it can make sense when people and institutions are under one facet).

The grouping of different fields under one is usually application specific, so if someone chooses to put stuff like authors and language codes under one facet, we might just point out it seems odd…

Links

Regarding VLO