Improve discovery of datacatalogs by registering well-known suffix 'datacatalog' #1290

coret · 2021-02-03T17:18:54Z

RFC 5785 defines a mechanism for reserving 'well-known' URIs on any Web server. By registering the 'datacatalog' suffix and promoting its use, the discovery of datacatalogs can be improved.

Although this proposal is not DCAT specific (eg. schema.org/DataCatalog would also benefit), we do seek support of the DCAT community for this proposal (as well as the schema.org community, therefor a similar issue has been posted at schemaorg/schemaorg#2827).

We have drafted a text which could be included in a specification document (this is highly inspired by https://www.w3.org/TR/void/#well-known):

Discovery with well-known URI

The RFC 5785 defines a mechanism for reserving 'well-known' URIs on any Web server.

The URI /.well-known/datacatalog on any Web server is registered by this specification for a datacatalog with dataset descriptions of datasets hosted on that server. For example, on the host www.example.com, this URI would be http://www.example.com/.well-known/datacatalog.

This URI may be an HTTP redirect to the location of the actual datacatalog file. The most appropriate HTTP redirect code is 302. Clients accessing this well-known URI MUST handle HTTP redirects.

The datacatalog file accessible via the well-known URI should contain descriptions of all datasets hosted on the server. This includes any datasets that have resolvable URIs, a SPARQL endpoint, a data dump, or any other access mechanism whose URI is on the server's hostname. Datacatalogs can be described using http://www.w3.org/ns/dcat#Catalog or https://schema.org/DataCatalog.

Broad support for this proposal will help in getting the 'datacatalog' suffix registered. The registration procedure and template from Section 5.1 of RFC 5785 requires a change controller and specification document. Can this community assist in this process?

andrea-perego · 2021-02-03T22:25:15Z

Thanks for contributing this proposal, @coret .

We have discussed it during the WG call (https://www.w3.org/2021/02/03-dxwgdcat-minutes#t03), and we would like ask you if you can elaborate your use case, to better understand if this requirement falls in scope with DCAT.

We checked the issue you point to (netwerk-digitaal-erfgoed/dataset-register#36) and your spec (https://netwerk-digitaal-erfgoed.github.io/requirements-datasets/), but we were not able to find enough information.

coret · 2021-02-04T10:34:40Z

The Dutch Digital Heritage Network (Netwerk Digitaal Erfgoed) is a partnership in the Netherlands that focuses on developing a system of national facilities and services for improving the visibility, usability, and sustainability of digital heritage. The network is open to all institutions and organisations in the digital heritage field. Together we can make the most of our digital heritage and preserve it for future generations.

One of the goals is to get a better view of the available datasets in the digital heritage field. With a better understanding datasets can be re-used and links between data(sets) can be made, Linked Open Data is important in the strategy. The "Register"-project stimulates institutions and organisations in the digital heritage field to publish their dataset descriptions (and datacatalogs) online. We formulate requirements (this is where schema.org/Dataset and DCAT Application Profiles play an important role) and educate the organisations and their IT-suppliers.

To get the datasetdescriptions (and in the long term build a knowledge graph) we have an API which organisation can use to register their datasetdescriptions. The system contains a validator (SHACL) and crawler to get (and frequently update) the datasetdescriptions (which are stored in a public triple store). This is the re-active side of our crawler. To make our crawler more pro-active in finding datasetsdescriptions, we can have our crawler check the sites of Dutch heritage organisations. But instead of spidering a whole website (like Google does), it would be more efficient if the location of the datacatalog on a website has a fixed URI. This is where the .well-known/datacatalog scheme can help.

I can imagine that in the DCAT specification, a paragraph stimulates the use of .well-known/datacatalog as a means to make datacatalogs more discoverable. This would benefit the publishers of datacatalogs and the automated usage of datacatalogs.

andrea-perego · 2021-02-04T20:43:27Z

Many thanks, @coret .

If I correctly understand, this well-known URI is meant to advertise any data catalogue, irrespective of their thematic content and of the used/supported metadata schema(s). Should this be the case, do you plan to put in place mechanisms (besides harvesting only selected Web sites) to verify (a) if they fit into your domain and (b) if they use a metadata schema you support?

/cc @nicholascar , @rob-metalinkage , @aisaac : Could you please give your perspective on this use case in relation to PROF & CONNEG?

makxdekkers · 2021-02-04T20:55:41Z

Does this presuppose that a domain can host a maximum of one data catalog?

rob-metalinkage · 2021-02-04T21:40:59Z

@andrea-perego think this is largely orthogonal to connegp which allows resources to self describe alternative views rather than list different collections. A data catalogue view of the website itself would be an option to avoid having to specify a 'well known sub resource.

coret · 2021-02-04T22:09:33Z

@makxdekkers yes, a well-known points (redirects) to one resource (the same other well-knowns on the IANA Well-Known URIs list). But if I'm not mistaken, a dcat:Catalog can contain multiple catalogs.

coret · 2021-02-04T22:15:24Z

@rob-metalinkage where on a website could one find a data catalogue view? Is this the root of a website or can this be any URI? In the latter, well-known is a mechanism to specify a URI which redirects to the resource. well-known/datacatalogs helps machines discover datacatalogs.

coret · 2021-02-04T22:39:23Z

@andrea-perego

If I correctly understand, this well-known URI is meant to advertise any data catalogue, irrespective of their thematic content and of the used/supported metadata schema(s).

That's correct.

Should this be the case, do you plan to put in place mechanisms (besides harvesting only selected Web sites) to verify (a) if they fit into your domain and (b) if they use a metadata schema you support?

Our crawler we will be "confined" to heritage institutions and will be able to process datasetsdescriptions in DCAT 2 and schema.org/Dataset, the latter will be converted to DCAT so we can more easily query a uniform set of dataset descriptions to get insights. For the well-known/datacatalog registration I think it's wise to be not to limiting in respect to datacatalog vocabularies.

I would image that products like Google Dataset Search would also benefit from the easy discovery of datacatalogs. Google Dataset Search is of course not limited to a domain and handles schema.org/Dataset (prefered) and DCAT (limited).

rob-metalinkage · 2021-02-05T00:22:36Z

@coret - yes you could have any resource support connegp - you are correct the "well knownedness" is the issue - connegp would certainly be relevant to allow any well known location (either the site root or a known location - or both) to offer multiple different forms of data catalogue - as opposed to having many alternative well known locations for different forms and needing to poll a range of them to find one a client can use.

agreiner · 2021-02-05T00:39:45Z

@makxdekkers yes, a well-known points (redirects) to one resource (the same other well-knowns on the IANA Well-Known URIs list). But if I'm not mistaken, a dcat:Catalog can contain multiple catalogs.

How would a system know that it is encountering a data catalog that includes other data catalogs and then find those catalogs efficiently?

davebrowning · 2023-02-13T16:42:14Z

Project/Milestone modified.

Explanation: As DCAT v3 moves through review and hopefully ratification, we want to make sure that open issues and feedback that have yet to be completely addressed are properly recorded and tagged/assigned in github to both clarify their status and to help review and prioritise as a source of improvements and new requirements in future DCAT versions

This was referenced Feb 3, 2021

Improve discovery of datacatalogs by registering well-known suffix 'datacatalog' schemaorg/schemaorg#2827

Closed

Discover datasets netwerk-digitaal-erfgoed/dataset-register#36

Open

riccardoAlbertoni added dcat feedback Issues stemming from external feedback to the WG labels Feb 3, 2021

andrea-perego added this to the DCAT3 2PWD milestone Mar 11, 2021

andrea-perego added the requires discussion Issue to be discussed in a telecon (group or plenary) label Mar 13, 2021

andrea-perego modified the milestones: DCAT3 2PWD, DCAT3 3PWD May 4, 2021

andrea-perego modified the milestones: DCAT3 3PWD, DCAT3 4PWD Jan 26, 2022

davebrowning added the future-work issue deferred to the next standardization round label Feb 13, 2023

davebrowning modified the milestones: DCAT3 4PWD, DCAT Future Priority Work Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve discovery of datacatalogs by registering well-known suffix 'datacatalog' #1290

Improve discovery of datacatalogs by registering well-known suffix 'datacatalog' #1290

coret commented Feb 3, 2021 •

edited

Loading

andrea-perego commented Feb 3, 2021

coret commented Feb 4, 2021

andrea-perego commented Feb 4, 2021

makxdekkers commented Feb 4, 2021

rob-metalinkage commented Feb 4, 2021

coret commented Feb 4, 2021

coret commented Feb 4, 2021

coret commented Feb 4, 2021

rob-metalinkage commented Feb 5, 2021 •

edited

Loading

agreiner commented Feb 5, 2021

davebrowning commented Feb 13, 2023

Improve discovery of datacatalogs by registering well-known suffix 'datacatalog' #1290

Improve discovery of datacatalogs by registering well-known suffix 'datacatalog' #1290

Comments

coret commented Feb 3, 2021 • edited Loading

andrea-perego commented Feb 3, 2021

coret commented Feb 4, 2021

andrea-perego commented Feb 4, 2021

makxdekkers commented Feb 4, 2021

rob-metalinkage commented Feb 4, 2021

coret commented Feb 4, 2021

coret commented Feb 4, 2021

coret commented Feb 4, 2021

rob-metalinkage commented Feb 5, 2021 • edited Loading

agreiner commented Feb 5, 2021

davebrowning commented Feb 13, 2023

coret commented Feb 3, 2021 •

edited

Loading

rob-metalinkage commented Feb 5, 2021 •

edited

Loading