Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve discovery of datacatalogs by registering well-known suffix 'datacatalog' #1290

Open
coret opened this issue Feb 3, 2021 · 11 comments
Labels
dcat feedback Issues stemming from external feedback to the WG future-work issue deferred to the next standardization round requires discussion Issue to be discussed in a telecon (group or plenary)

Comments

@coret
Copy link

coret commented Feb 3, 2021

RFC 5785 defines a mechanism for reserving 'well-known' URIs on any Web server. By registering the 'datacatalog' suffix and promoting its use, the discovery of datacatalogs can be improved.

Although this proposal is not DCAT specific (eg. schema.org/DataCatalog would also benefit), we do seek support of the DCAT community for this proposal (as well as the schema.org community, therefor a similar issue has been posted at schemaorg/schemaorg#2827).

We have drafted a text which could be included in a specification document (this is highly inspired by https://www.w3.org/TR/void/#well-known):

Discovery with well-known URI

The RFC 5785 defines a mechanism for reserving 'well-known' URIs on any Web server.

The URI /.well-known/datacatalog on any Web server is registered by this specification for a datacatalog with dataset descriptions of datasets hosted on that server. For example, on the host www.example.com, this URI would be http://www.example.com/.well-known/datacatalog.

This URI may be an HTTP redirect to the location of the actual datacatalog file. The most appropriate HTTP redirect code is 302. Clients accessing this well-known URI MUST handle HTTP redirects.

The datacatalog file accessible via the well-known URI should contain descriptions of all datasets hosted on the server. This includes any datasets that have resolvable URIs, a SPARQL endpoint, a data dump, or any other access mechanism whose URI is on the server's hostname. Datacatalogs can be described using http://www.w3.org/ns/dcat#Catalog or https://schema.org/DataCatalog.

Broad support for this proposal will help in getting the 'datacatalog' suffix registered. The registration procedure and template from Section 5.1 of RFC 5785 requires a change controller and specification document. Can this community assist in this process?

@andrea-perego
Copy link
Contributor

Thanks for contributing this proposal, @coret .

We have discussed it during the WG call (https://www.w3.org/2021/02/03-dxwgdcat-minutes#t03), and we would like ask you if you can elaborate your use case, to better understand if this requirement falls in scope with DCAT.

We checked the issue you point to (netwerk-digitaal-erfgoed/dataset-register#36) and your spec (https://netwerk-digitaal-erfgoed.github.io/requirements-datasets/), but we were not able to find enough information.

@coret
Copy link
Author

coret commented Feb 4, 2021

The Dutch Digital Heritage Network (Netwerk Digitaal Erfgoed) is a partnership in the Netherlands that focuses on developing a system of national facilities and services for improving the visibility, usability, and sustainability of digital heritage. The network is open to all institutions and organisations in the digital heritage field. Together we can make the most of our digital heritage and preserve it for future generations.

One of the goals is to get a better view of the available datasets in the digital heritage field. With a better understanding datasets can be re-used and links between data(sets) can be made, Linked Open Data is important in the strategy. The "Register"-project stimulates institutions and organisations in the digital heritage field to publish their dataset descriptions (and datacatalogs) online. We formulate requirements (this is where schema.org/Dataset and DCAT Application Profiles play an important role) and educate the organisations and their IT-suppliers.

To get the datasetdescriptions (and in the long term build a knowledge graph) we have an API which organisation can use to register their datasetdescriptions. The system contains a validator (SHACL) and crawler to get (and frequently update) the datasetdescriptions (which are stored in a public triple store). This is the re-active side of our crawler. To make our crawler more pro-active in finding datasetsdescriptions, we can have our crawler check the sites of Dutch heritage organisations. But instead of spidering a whole website (like Google does), it would be more efficient if the location of the datacatalog on a website has a fixed URI. This is where the .well-known/datacatalog scheme can help.

I can imagine that in the DCAT specification, a paragraph stimulates the use of .well-known/datacatalog as a means to make datacatalogs more discoverable. This would benefit the publishers of datacatalogs and the automated usage of datacatalogs.

@andrea-perego
Copy link
Contributor

Many thanks, @coret .

If I correctly understand, this well-known URI is meant to advertise any data catalogue, irrespective of their thematic content and of the used/supported metadata schema(s). Should this be the case, do you plan to put in place mechanisms (besides harvesting only selected Web sites) to verify (a) if they fit into your domain and (b) if they use a metadata schema you support?

/cc @nicholascar , @rob-metalinkage , @aisaac : Could you please give your perspective on this use case in relation to PROF & CONNEG?

@makxdekkers
Copy link
Contributor

Does this presuppose that a domain can host a maximum of one data catalog?

@rob-metalinkage
Copy link
Contributor

@andrea-perego think this is largely orthogonal to connegp which allows resources to self describe alternative views rather than list different collections. A data catalogue view of the website itself would be an option to avoid having to specify a 'well known sub resource.

@coret
Copy link
Author

coret commented Feb 4, 2021

@makxdekkers yes, a well-known points (redirects) to one resource (the same other well-knowns on the IANA Well-Known URIs list). But if I'm not mistaken, a dcat:Catalog can contain multiple catalogs.

@coret
Copy link
Author

coret commented Feb 4, 2021

@rob-metalinkage where on a website could one find a data catalogue view? Is this the root of a website or can this be any URI? In the latter, well-known is a mechanism to specify a URI which redirects to the resource. well-known/datacatalogs helps machines discover datacatalogs.

@coret
Copy link
Author

coret commented Feb 4, 2021

@andrea-perego

If I correctly understand, this well-known URI is meant to advertise any data catalogue, irrespective of their thematic content and of the used/supported metadata schema(s).

That's correct.

Should this be the case, do you plan to put in place mechanisms (besides harvesting only selected Web sites) to verify (a) if they fit into your domain and (b) if they use a metadata schema you support?

Our crawler we will be "confined" to heritage institutions and will be able to process datasetsdescriptions in DCAT 2 and schema.org/Dataset, the latter will be converted to DCAT so we can more easily query a uniform set of dataset descriptions to get insights. For the well-known/datacatalog registration I think it's wise to be not to limiting in respect to datacatalog vocabularies.

I would image that products like Google Dataset Search would also benefit from the easy discovery of datacatalogs. Google Dataset Search is of course not limited to a domain and handles schema.org/Dataset (prefered) and DCAT (limited).

@rob-metalinkage
Copy link
Contributor

rob-metalinkage commented Feb 5, 2021

@coret - yes you could have any resource support connegp - you are correct the "well knownedness" is the issue - connegp would certainly be relevant to allow any well known location (either the site root or a known location - or both) to offer multiple different forms of data catalogue - as opposed to having many alternative well known locations for different forms and needing to poll a range of them to find one a client can use.

@agreiner
Copy link
Contributor

agreiner commented Feb 5, 2021

@makxdekkers yes, a well-known points (redirects) to one resource (the same other well-knowns on the IANA Well-Known URIs list). But if I'm not mistaken, a dcat:Catalog can contain multiple catalogs.

How would a system know that it is encountering a data catalog that includes other data catalogs and then find those catalogs efficiently?

@andrea-perego andrea-perego added this to the DCAT3 2PWD milestone Mar 11, 2021
@andrea-perego andrea-perego added the requires discussion Issue to be discussed in a telecon (group or plenary) label Mar 13, 2021
@andrea-perego andrea-perego modified the milestones: DCAT3 2PWD, DCAT3 3PWD May 4, 2021
@andrea-perego andrea-perego modified the milestones: DCAT3 3PWD, DCAT3 4PWD Jan 26, 2022
@davebrowning davebrowning added the future-work issue deferred to the next standardization round label Feb 13, 2023
@davebrowning
Copy link
Contributor

Project/Milestone modified.

Explanation: As DCAT v3 moves through review and hopefully ratification, we want to make sure that open issues and feedback that have yet to be completely addressed are properly recorded and tagged/assigned in github to both clarify their status and to help review and prioritise as a source of improvements and new requirements in future DCAT versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dcat feedback Issues stemming from external feedback to the WG future-work issue deferred to the next standardization round requires discussion Issue to be discussed in a telecon (group or plenary)
Projects
None yet
Development

No branches or pull requests

7 participants