Dedicated support for HTTP compliant datasets #1086

Aklakan · 2019-09-19T13:14:01Z

I understand that DCAT 2 content is frozen, so this is a feature request to be considered for a future version.

While working with DCAT data catalogs I came across this challenge: The link between datasets and distributions seems to be used pretty much arbitrarily in practice. For example, picking an arbitrary entry from data.gov, I can see a zip file, web resources, REST endpoint. In the typical CKAN-DCAT mapping, all these resources become distributions and my impression is, that the DCAT 2 standard does (intentionally?) not impose many restrictions here.
Of course, a little semantic goes a long way, but after nearly 2 decades of Semantic Web, I think many people in the RDF community want to go a bit further.

And with this lax modeling, it is impossible for application to refer to a (DCAT) dataset and to have it do something smart with it.

So what is a dataset in the first place?
There is 5.1 DCAT scope which states

A dataset in DCAT is defined as a "collection of data, published or curated by a single agent, and available for access or download in one or more serializations or formats".

I would like to make the following proposal:

Definition A dataset is an instance of a data model. Note, that data model and abstract syntax are synonyms.
A distribution denotes a means for access to the specific instance of the data model
All distributions of a dataset should provide access to the same dataset. Hence, if a copy of dataset from one distribution was obtained, there is no more need to fetch further distributions. Alternatively, if one distribution of an RDF dataset (a dataset that is an instance of the RDF model) is a SPARQL endpoint, an application may prefer this distribution over the file download.
A download URL points to a resource that can supply representations whose content type are among the syntactic representations of the abstract syntax: If you have tabular data, the concrete syntaxes are denoted by the mime types e.g. text/csv or text/tab-separated-values, if you have RDF data, they may be application/turtle, application/n-triples or application/rdf+xml.
If resolution of the download URL does not provide specific HTTP headers (e.g. application/octet-steam, such as for DBpedia downloads), then interpretation of the response content type, encoding, charset and language (all standard HTTP headers) may be assumed according to the distribution's DCAT description
A zip archive by itself is typically NOT a dataset - it is simply an archive, and thus a collection of files. Without further references to standards or metadata, no application can reason about what or where is the dataset of a zip archive. A zip archive could contain a DCAT description of its own content in e.g. a dcat.ttl file in the root folder. This file could then describe all CSV, RDF, XML, whatever files in the archive.

Dataset descriptions that adhere to these rules, can be unambigously served according the HTTP principles, notably content negotiation, by a DCAT-based HTTP proxy.

The HTTP proxy internally resolves the URL requested by a client to an entry among a set of DCAT catalogs.
Based on the catalog, the server can automatically provide the appropriate HTTP headers. A *smart server can even choose the appropriate download, perform HTTP caching and convert the available syntaxes and encodings to those requested (TTL to rdf/xml, CSV to TSV or excel, etc)
Note, that HTTP already describes a mechanism for handling encoding (gzip, bzip2, brotli, etc)

As I see it, there is a strong link between how HTTP functions and how datasets - according to the strict definition - correspond to HTTP resources that thus can be served in a standard way based on catalog metadata. This aspect is in my impression not yet adequately considered in the DCAT spec.

andrea-perego · 2019-09-26T09:14:22Z

Thanks for your proposal, @Aklakan .

Indeed, DCAT 2 is frozen, so we are assigning this to future work.

Aklakan · 2019-10-11T20:01:59Z

A quick example to clarify what I mean by the HTTP content negotiation aspect:

Let's say there is a DCAT catalog on the Web with an n-triple and turtle distribution

my:dataset
        a cat:Dataset ;
        cat:distribution my:dist-as-ttl, my:dist-as-nt .

my:dist-as-ttl        a                cat:Distribution ;
        dc:format "application/turtle" ;
        cat:downloadURL  <https://gitlab.com/.../demo.ttl> .

my:dist-as-nt        a                cat:Distribution ;
        dc:format "application/n-triples" ;
        cat:downloadURL  <https://gitlab.com/.../demo.nt> .

Then I would assume that if someone wrote a DCAT HTTP server that can serve datasets based on DCAT (I call that a data node), that a client could do:

curl -X POST \
  -H 'Accept: application/n-triples \
  'http://localhost/my-datanode?id=my:dataset`

And the data node would choose the appropriate distribution from it:

HTTP/1.1 200 OK
Date: Fri, 11 Oct 2019 19:49:09 GMT
Content-Type: application/n-triples; charset=utf-8
Content-Location: https://gitlab.com/.../demo.nt <--- ntriples served

So this establishes quite a strong link between DCAT and HTTP conneg.
I think this is very reasonable behaviour that should be specified in the DCAT spec (or a related one, like DCAT-HTTP). But maybe I am overlooking something, so I'd gladly get opinions on that :)

Of course there are forseeable subtleties, which a data node has to handle, such as avoiding sending out content locations that cause a HTTP 506 Variant Also Negotiates.

Aklakan changed the title ~~Support for HTTP compliant datasets~~ Dedicated support for HTTP compliant datasets Sep 19, 2019

andrea-perego added dcat feedback Issues stemming from external feedback to the WG labels Sep 19, 2019

andrea-perego added the future-work issue deferred to the next standardization round label Sep 19, 2019

andrea-perego added this to the DCAT Future Priority Work milestone Sep 26, 2019

riccardoAlbertoni modified the milestones: DCAT Future Priority Work, DCAT3 FPWD Feb 6, 2020

andrea-perego modified the milestones: DCAT3 FPWD, DCAT Future Priority Work Oct 15, 2020

andrea-perego added the DesignPrinciples Somehow related to the design principles, e.g.levels of machine readability, ontological commitment label Oct 30, 2020

andrea-perego modified the milestones: DCAT Future Priority Work, DCAT3 2PWD Nov 11, 2020

andrea-perego added the requires discussion Issue to be discussed in a telecon (group or plenary) label Mar 13, 2021

andrea-perego modified the milestones: DCAT3 2PWD, DCAT3 3PWD May 4, 2021

andrea-perego modified the milestones: DCAT3 3PWD, DCAT3 4PWD Jan 26, 2022

davebrowning modified the milestones: DCAT3 4PWD, DCAT Future Priority Work Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dedicated support for HTTP compliant datasets #1086

Dedicated support for HTTP compliant datasets #1086

Aklakan commented Sep 19, 2019 •

edited

Loading

andrea-perego commented Sep 26, 2019

Aklakan commented Oct 11, 2019 •

edited

Loading

Dedicated support for HTTP compliant datasets #1086

Dedicated support for HTTP compliant datasets #1086

Comments

Aklakan commented Sep 19, 2019 • edited Loading

andrea-perego commented Sep 26, 2019

Aklakan commented Oct 11, 2019 • edited Loading

Aklakan commented Sep 19, 2019 •

edited

Loading

Aklakan commented Oct 11, 2019 •

edited

Loading