Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion to add dct:format and dcat:mediaType to dcat:DataService #1381

Open
jimjyang opened this issue Jun 11, 2021 · 9 comments
Open

Suggestion to add dct:format and dcat:mediaType to dcat:DataService #1381

jimjyang opened this issue Jun 11, 2021 · 9 comments
Labels
dcat:DataService dcat feedback Issues stemming from external feedback to the WG future-work issue deferred to the next standardization round

Comments

@jimjyang
Copy link

jimjyang commented Jun 11, 2021

formats and mediaTypes of a DataService

Status: In real use in the Norwegian national Data Catalog. Added as national extensions in DCAT-AP-NO.

Creator: Jim J. Yang

Deliverable(s): DCAT3

Stakeholders

Providers of DataServices/APIs
Consumers of DataServices/APIs
Providers of DataService Catalogs
Consumers of DataService Catalogs

Problem statement

Which formats/mediaTypes a DataService may support is one of the considerations you need to take when deciding whether a DataService is (re)usable or not.

In the 2nd Working Draft of DCAT3, dcat:DataService (still) doesn’t have properties dct:format and dcat:mediaType as in dcat:Distribution. Possible ways to describe formats/mediaTypes of a DataService are thus:

  1. Using dcat:DataService/dcat:endpointDescription. The main disadvantage is that it doesn’t necessarily ensure a standardized way of expressing formats and mediaTypes, which makes it impossible to e.g. sort/query DataServices according to formats/mediaTypes.
  2. Using dcat:DataService/dcat:servesDataset -> dcat:Dataset/dcat:distribution -> dcat:Distribution/dct:format (or dcat:Distribution/dcat:mediaType). The main disadvantage is the unnecessary overhead in creating and maintaining all those extra Distributions (and Datasets).

Requirements

Add the following properties to dcat:DataService:

  • format, dct:format, dct:MediaTypeOrExtent
  • media type, dcat:mediaType, dct:MediaType

Related use cases

"How to express available formats for a dcat:DataService (Issue #1055)", which at that time (Sep 2019) was already “tagged as a future priority”.

@smrgeoinfo
Copy link
Contributor

a convention to use an OpenAPI service description as the dcat:endpointDescription would address the issue.

@jimjyang
Copy link
Author

Thanks for the suggestion, @smrgeoinfo.

Yes, we do recommend using OpenAPI. However, we need to support DataServices which are described using other description languages as well, among others WSDL20 and wadl from W3C.

As stated above, you need to know formats/mediaTypes of a DataService, already when you decide if a DataService is (re)usable. My main point which might not be clear enough in my proposal above, is, inside the catalog, our user may need to:

  • Know formats/mediaTypes of a DataService (instead of being forced to read an external endpointDescription, as @oystein-asnes also pointed out in Issue #1005).
  • Sort/query DataServices according to formats/mediaTypes.

We need these values stored in the catalog instead of extracting them from external endpointDescriptions (which are based on different description languages) on the fly, each and every time a user needs to display or sort/query on formats/mediaTypes.

@andrea-perego andrea-perego added the feedback Issues stemming from external feedback to the WG label Jun 20, 2021
@andrea-perego andrea-perego added this to To do in DCAT Sprint: Data services via automation Jun 20, 2021
@andrea-perego andrea-perego added this to the DCAT3 3PWD milestone Jun 20, 2021
@riccardoAlbertoni riccardoAlbertoni added this to To do in DCAT Sprint: Feedback via automation Jun 21, 2021
@bertvannuffelen
Copy link

@jimjyang In Flanders we have create a DCAT profile https://data.vlaanderen.be/doc/applicatieprofiel/metadata-dcat capturing Open data, Geospatial data and closed data and standalone services.

When discussing data services this aspect has been kept as part of the endpoint description. So outside of the metadata that is collected in the DCAT catalogue. We observed that those technical details are very different depending from the ecosystem the data service has been designed in. Moreover many dataservices offer content-negotation so the actual syntax XML, json, etc is less important than knowing the business setting or the technical standards it follows. E.g. it is more important to know it is a OGC:WFS service than it is a service which returns XML.

Personally, I am when exploring a catalog of dataservices not interested in the actual data syntax format that is returned. As technical person I am more interested in what the kind of data that provided, it is easy linkable with other service, are there any shared identifiers in the data, etc. The technical format is just added to amount of work and libraries that have to be supported. Note that even if one has the same syntax their is no guarantee that one can connect the two services. So I rather consider the format for a dataservice as a low value property.

Looking into the future, we thought of constraints on the endpointdescription (e.g. SOAP WSDL description, Open API descriptions, ...) so that dataservice builders can use their native ecosystem documentation tools in such a way that generic metadata useful in the DCAT catalog context can be derived.
I personally do not think DCAT should replace the documentation practices in each ecosystem, but we should enable smart bridges between them.

@jimjyang
Copy link
Author

jimjyang commented Jul 1, 2021

Thanks for sharing your experiences, @bertvannuffelen!

I am not technical but our catalog users (application developers) report that they need to know the formats/mediaTypes of a DataService in order to consider whether or not to use a DataService.

Our suggestion is not at all to let DCAT replace/standardize the (practices of) endpointDescriptions, but to make it possible to provide catalog users with the metadata they need without forcing them to leave the Catalog and read external documentations.

@bertvannuffelen
Copy link

@jimjyang,

the challenge I see with the proposal is that I am not sure that it will address your catalog users request. Are they really searching for data services that provide data as "application/atom+xml" but not as "application/xml"?

It seems natural to include more technical details from data services into a meta data description, but this has its drawbacks.
I am more inclined to train teams to improve the natural documentation of the services, rather that a generic catalogue would provide means for that. In the end the technical documentation has to be read by the developer in order to work with the service. For me format information is part of the technical documentation of a service. Training the maintainers/developers of services to build services according to common technical guidelines including e.g. providing content negotiation with support for xml and json with a fixed mediatype is probably more valuable and sustainable.

To illustrate the topic: see for instance the formats in https://data.europa.eu/data/datasets?query=&locale=en&minScoring=0
There are 4 equivalent formats for a user listed: Excel XLS, Excel XLSX, xlsx, .xlsx
Without additional constraints and agreements on the teams to support the same mediatypes your catalogue will show this kind of things, and then it becomes less valuable. And then the next step you will consider as catalogue is start to do mappings from every IANA media type to one of the supported ones in your portal. But that has as consequence that catalogue users will face the situation that although 2 services claim to use XML, they use different media types.
This experience learns me that it is not because we can record it, it automatically leads to usable data.

Although I can understand your request, I am reluctant to include it as a specific properties for data services. I would rather for this case first apply it in an profile and see how it turns out, before adopting it in DCAT at this level.

This request is for me connected with the conformsTo discussion. I think that is a more valuable discussion. As the format is actually a consequence of that value. If the dataservice conforms to the Norwegian REST API guidelines I know as developer a lot more than the format. I know that it will handle errors in a certain way, which headers are supported, etc. Example of such guideline are here: https://www.gcloud.belgium.be/rest/

Illustrating the above with the first example:

_:s1 a dcat:DataService;
      dct:format "application/atom+xml"
      dct:format "application/xml"

_:s2 a dcat:Dataservice:
      dct:conformsTo "https://datatracker.ietf.org/doc/html/rfc4287"

For me the second dataservice description is more informative than the first. The first is reflecting one detail aspect of the second. As such this is for me the core of my reluctance: with the format property proposal the effort in the catalog metadata is towards a derived information rather than recording the source of the decision.
@jimjyang did you explored the dct:conformsTo possibility?

Another source of reluctance is that we should not try to transfer all properties of DCAT1.0 Distribution to DCAT2.0 DataServices. If we do that then the distinction between both entities becomes more blurry. The fact that a DataService does not have a format in the core DCAT and Distribution has, is maybe an good aid to distinguishing them.

I would rather avoid lifting all the technical aspects of data services to the core DCAT and leave that to profiles.

@jimjyang
Copy link
Author

jimjyang commented Jul 5, 2021

Thanks again for sharing your experiences/solutions, @bertvannuffelen!

We do share your concerns, and we are also restrictive regarding introducing (national) extensions. We also share your point with improving natural documentations of the DataServices, which is another discussion (in general, about metadata quality of the resources that are listed in, and referred to from, a catalog).

Are they really searching for data services that provide data as "application/atom+xml" but not as "application/xml"?

Not necessarily while searching, but at least when evaluating if a DataService that is to be found in the catalog, is reusable. A catalog (at least ours) supports its users in 1) searching for potentially reusable Datasets/Distributions/DataServices, 2) evaluating whether or not a particular Dataset/Distribution/DataService is reusable and 3) using a Dataset/Distribution/DataService that is evaluated as reusable. As we wrote under the Problem statement, "Which formats/mediaTypes a DataService may support is one of the considerations you need to take when deciding whether a DataService is (re)usable or not."

did you explored the dct:conformsTo possibility?

As we understand it, dct:conformsTo semantically does not have the meaning as dct:format and dcat:mediaType.

There are 4 equivalent formats for a user listed: Excel XLS, Excel XLSX, xlsx, .xlsx

Your example and arguments against having dct:format/dcat:mediaType in dcat:DataService, may also apply to dcat:Distribution which do have dct:format and dcat:mediaType in addition to dct:conformsTo. In the context of this discussion, the user need which we suggest to cover in DCAT3, is to provide the user with the same set of metadata (dct:format, dcat:mediaType, dct:conformsTo) for a dcat:DataService as (already doing) for a dcat:Distritution.

@bertvannuffelen
Copy link

@jimjyang

Are they really searching for data services that provide data as "application/atom+xml" but not as "application/xml"?

Not necessarily while searching, but at least when evaluating if a DataService that is to be found in the catalog, is reusable. A catalog (at least ours) supports its users in 1) searching for potentially reusable Datasets/Distributions/DataServices, 2) evaluating whether or not a particular Dataset/Distribution/DataService is reusable and 3) using a Dataset/Distribution/DataService that is evaluated as reusable. As we wrote under the Problem statement, "Which formats/mediaTypes a DataService may support is one of the considerations you need to take when deciding whether a DataService is (re)usable or not."

Although I personally would not go for collecting this information for that purpose, I understand your request.
I would persue the data services providers to improve their technical documentation which they refer to in the endpoint description.

did you explored the dct:conformsTo possibility?

As we understand it, dct:conformsTo semantically does not have the meaning as dct:format and dcat:mediaType.

indeed, but maybe it refers to more interesting information to decide if the service is reusable.

There are 4 equivalent formats for a user listed: Excel XLS, Excel XLSX, xlsx, .xlsx

Your example and arguments against having dct:format/dcat:mediaType in dcat:DataService, may also apply to dcat:Distribution which do have dct:format and dcat:mediaType in addition to dct:conformsTo. In the context of this discussion, the user need which we suggest to cover in DCAT3, is to provide the user with the same set of metadata (dct:format, dcat:mediaType, dct:conformsTo) for a dcat:DataService as (already doing) for a dcat:Distritution.

yes, I know. I have given this example to illustrate that the solution is not persé by adding new properties to DCAT, but often is in the ability of the data providers to provide meaningful/useful data. And since that is for distributions (files) non-trivial, I expect that it is for API's more difficult. Many services will offer/accept data in many different formats.

If we follow your suggestion then we best take a look to all related properties for a Distribution and discuss their semantics for a Data Service:

  • format
  • mediatype
  • compression format
  • packaging format

What would be good definitions for them? Are all of them meaningful? Would they make the distinction between Distribution and Data Service more vague or clearer?

@jimjyang
Copy link
Author

jimjyang commented Aug 30, 2021

@bertvannuffelen Sorry for my late reply (because of the summer vacation).

If we follow your suggestion then we best take a look to all related properties for a Distribution and discuss their semantics for a Data Service:

  • format
  • mediatype
  • compression format
  • packaging format

What would be good definitions for them? Are all of them meaningful? Would they make the distinction between Distribution and Data Service more vague or clearer?

  • format: We use the following as the usage note (based on the similar one for Distribution): The data format of the data service. May be repeated for data services which provide data in several formats.
  • mediatype: We use the following as the usage note (based on the similar one for Distribution): The media type of the data service. May be repeated for data services which provide data in several media types.
  • compression format: We have not yet registered any need for this one.
  • packaging format: We have not yet registered any need for this one.
    As far as we see it, including those properties per se doesn't make the distinction between Distribution and Data Service more vague nor more clearer. The distinction should be made at the class level (what they should be used to represent).

@davebrowning
Copy link
Contributor

Project/Milestone modified.

Explanation: As DCAT v3 moves through review and hopefully ratification, we want to make sure that open issues and feedback that have yet to be completely addressed are properly recorded and tagged/assigned in github to both clarify their status and to help review and prioritise as a source of improvements and new requirements in future DCAT versions

@davebrowning davebrowning added the future-work issue deferred to the next standardization round label Feb 13, 2023
@davebrowning davebrowning added this to To analyse in DCAT: Potential new requirements via automation Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dcat:DataService dcat feedback Issues stemming from external feedback to the WG future-work issue deferred to the next standardization round
Development

No branches or pull requests

6 participants