Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: how to catalog relational database data in DCAT? #1240

Closed
dominik-s0 opened this issue Jun 14, 2020 · 24 comments
Closed

Question: how to catalog relational database data in DCAT? #1240

dominik-s0 opened this issue Jun 14, 2020 · 24 comments
Labels
dcat:DataService dcat due for closing Issue that is going to be closed if there are no objection within 6 days feedback Issues stemming from external feedback to the WG
Milestone

Comments

@dominik-s0
Copy link

dominik-s0 commented Jun 14, 2020

Dear DCAT team,
I have a question regarding the correct use of DCAT to catalog data sitting in relational databases such as Oracle/MySQL/Postgres and data lake engines such as Apache Hive.
In my thoughts, I'd either model a database table or an entire database as a distribution of a dataset. This could apply to open accessible and private databases.

For referring to an entire database, I probably could just refer the JDBC string as "accessURL", e.g. jdbc:oracle:thin:@​hostname:1521:my-database or jdbc:hive2://hostname:8443/my-database (or would that even be allowed as an "URL"?)

However, if I want to refer to a single table, things get more complicated. With JDBC strings I can not refer to individual tables, so can option could be to use combination of "accessURL" for the database and reference the table name in the "title". However, from my understanding of the dataset distribution attributes, "title" refers to a speaking name of the distribution, which might be different to the technical table name. Of course it could be an option to artificially attach the table name in the "accessURL" or put the table name in the description, but neither of it would be machine-readable

In order to apply DCAT correctly, what would be your proposal?

Thanks and kind regards,
Dominik

@andrea-perego
Copy link
Contributor

andrea-perego commented Jun 14, 2020

Thanks for bringing to our attention this use case, @dominik-s0 . I give below a preliminary answer, which may be complemented by other WG members.

I think there are two different aspects here: one is how to use DCAT to document a database, the other is how to publish it. The latter is not strictly in scope with DCAT, but rather with existing guidelines and best practices - as the Data on the Web BPs (https://www.w3.org/TR/dwbp/), which I think may provide some useful hints for your use case.

About DCAT, the issue is the appropriate use of dcat:accessURL. Probably this is not made enough explicit in DCAT, but dcat:accessURL is supposed to point to a URL which can be followed to land in a "place" whence users can access the data. Typically, this means HTTP URLs, but other protocols are not excluded a priori, provided that they can be used as actionable links landing somewhere - e.g., the URL of an FTP folder.

In terms of how to address this, a number of options are available, following under the scope of data publication best practices. I outline below some of them, just as an example, describing also how to use DCAT:

  1. Irrespective of whether the database is publicly accessible or not, the access URL can point to a Web page providing instructions on how to access the database. This may include instructions to access the whole database or also single tables (e.g., by supplying the relevant queries).
  2. Additionally, a dump of the database (and/or specific tables) could be made available for download. The link to the database dump (or its subsets) can be included in the Web page above, and also specified with property dcat:downloadURL (https://www.w3.org/TR/vocab-dcat-2/#Property:distribution_download_url). On this option, see DWBP17 (https://www.w3.org/TR/dwbp/#BulkAccess)
  3. Finally, the database (and/or specific tables) can be made accessible via a (Web) API, which can be described as a dcat:DataService (https://www.w3.org/TR/vocab-dcat-2/#Class:Data_Service), and linked from the distribution with property dcat:accessService (https://www.w3.org/TR/vocab-dcat-2/#Property:distribution_access_service). On this option, see DWBP23 (https://www.w3.org/TR/dwbp/#useanAPI)

BTW, about your requirements of describing how to give access to subsets of your database, possibly in a machine-actionable way, this issue is being discussed in DCAT in relation to data available from a service / API - i.e., point (3) above.

Does this answer your question, @dominik-s0 ?

@dominik-s0
Copy link
Author

dominik-s0 commented Jun 22, 2020

Dear @andrea-perego,
thanks a lot for the reply on this!

I think one of the main points I have is that the data I want to catalog is not "on the web", as in "available via a REST API" (which I think is the focus of discussion also in #1230 )
Rather, data could be in an Oracle/Postgres/... database or a Hive big-data service reachable via a JDBC interface, which I think is the case in most enterprises for a substantial amount of datasets.
For this data, it's not feasible to create a REST API/GraphQL/... wrapper around each JDBC interface, as the JDBC/ODBC interface itself already provides a ready-to-use interface for interoperability between a variety of systems.
At the same time, I'd like to have the reference to the data machine-readable, e.g. only putting the connection information on a web page that is referenced in dcat:accessURL would not be sufficient from my perspective.

So, it would be great to catalog these datasets sitting in a database using DCAT and it would be great to have a dedicated way of cataloging such JDBC/ODBC data sources (and maybe having an example of this at https://www.w3.org/TR/vocab-dcat-2/#collection-of-examples), as I think this is a very common need in corporate enterprises.

What would be the best approach here? Do you think it's feasible with the current DCAT vocabulary or would some kind of extension be needed for this?

I'm happy to jump on a call to discuss this topic further and share the background of the request.

@dominik-s0
Copy link
Author

@andrea-perego @riccardoAlbertoni what are your thoughts about my comments? What would be the best approach here? Is this something that can be taken up for the next DCAT release?

@andrea-perego
Copy link
Contributor

@dominik-s0 , we have discussed your use case during our last meeting. Our understanding is that your requirements fit better into a specific DCAT extension (see §14. DCAT Profiles) rather than the "core" DCAT vocabulary, whose scope is meant to address more general use cases, and with a specific focus on Web-based data access. This may of course change in the future if additional use cases will be contributed which demonstrate such requirements being relevant across domains. However, for the moment, we do not plan to support these features in a new version of DCAT.

Coming to the possible DCAT extension, we think that your use case is very much related to "distributions accessible via a service/API" (although your scenario does not concern Web services/APIs) - for some examples, see §5.9 A dataset available through a service. For the specification of the connection string / DSN, the relevant pattern could be:

a:Dataset a dcat:Dataset ; 
  dcat:distribution [ a dcat:Distribution ;
    dcat:accessService [ a dcat:DataService ;
      dcat:endpointDescription :url-pointing-to-a-machine-readable-description-of-the-endpoint 
  ]
] .

So, you may consider defining specific properties and classes to point to a database, instead of a dcat:DataService. E.g., it could be something along these lines:

a:Dataset a dcat:Dataset ; 
  dcat:distribution [ a dcat:Distribution ;
    ex:accessDatabase [ a ex:Database ;
      ex:connectionString "jdbc:oracle:thin:@​hostname:1521:my-database"
  ]
] .

About how to specify access to specific table in a database, as I said a similar issue is currently being discussed, so you may be interested in following and contributing to the discussion in #1230

@dominik-s0
Copy link
Author

Thanks a lot for the reply on this and for the provided example @andrea-perego . I'll take a look at the other referenced examples and at the discussion in #1230. I'll close this issue for now and will follow-up if required.

@zeginis
Copy link

zeginis commented Jul 3, 2020

Hi @dominik-s0 , @andrea-perego
I work for the EU project CYBELE (https://www.cybele-project.eu/). At the prohect we are using DCAT to create metadata for the data that will be inserted to the platform we build. The data will be stored at a relational database while the metadata are stored at Virtuoso triple store. We are using these metadata in order to search for data (e.g. based on the geographical coverage) but we need also some info to link the metadata with the database and the actual table where the data are stored. The aim is to be able to query the actual data by using the information gathered from the metadata.

We plan to use the dcat:DataService and some of its properties to represent such info. Use dcat:endpointURL for the jdbc connection and dct:identifier for the table name.

a:Dataset a dcat:Dataset ; 
  dcat:distribution [ a dcat:Distribution ;
    dcat:accessService [ a dcat:Service ;
      dcat:endpointURL "jdbc:oracle:thin:@​hostname:1521:my-database";
      dct:identifier  THE_TABLE_NAME
  ]
] .

@dominik-s0
Copy link
Author

Thanks for the reply on this @zeginis. I was under the impression that dct:identifier in this case would rather serve as a unique identifier (e.g. UUID/GUID) to this specific dcat:DataService and not to the referenced dataset, e.g. the table in your case. From my understanding how to reference the table itself is kind-of the discussion in #1230. But I may be wrong.
@andrea-perego can you indicate if the approach from @zeginis is also feasible?

@dominik-s0 dominik-s0 reopened this Jul 3, 2020
@dr-shorthair
Copy link
Contributor

Note that there is no dcat:Service class - it is dcat:DataService.

The value of the dct:identifier would usually be a system identifier for the resource.
If the service is just for a single table, then indeed it could make sense to record that as the value of the dct:identifier.
But if the service is for the whole database then it doesn't really make sense to have the identifier for just one table there.

@dominik-s0
Copy link
Author

Ok this makes sense to me. In that case I think it's a feasible approach to make use of dcat:DataService with dcat:endpointURL pointing to the JDBC/ODBC string and dct:identifier to the table name and then having one dcat:DataService per table.

Maybe one could then even relate different dcat:DataService together, e.g. having one dcat:DataService as the service for the database with dct:identifier set as UUID of the database and another dcat:DataService as the service of the individual table with dct:identifier set as table name, linked together with the dcat:Relationship reference?

@zeginis
Copy link

zeginis commented Jul 3, 2020

@dominik-s0 that's the approach we plan to follow --> having one dcat:DataService per table.
Your proposal to relate different dcat:DataService together e.g. using the dcat:Relationship or the dct:isPartOf is also interesting!

@dominik-s0
Copy link
Author

Yes, I think that could allow to still have a unique identifier on the database itself. Otherwise e.g. if you have a table named XYZ in different databases, there would be multiple dcat:DataService with the same XYZ identifier and there would be no way to distinguish them except to do some regex voodoo on the dcat:endpointURL :)

@andrea-perego
Copy link
Contributor

@zeginis , @dominik-s0 , I would recommend against the use of dct:identifier for this purpose. As @dr-shorthair noted, it can work if you are going to have a dcat:DataService per table. However, one could argue that the table is the data available via the service, not the service itself (which is what dct:identifier is meant to be about), and this may lead to misunderstanding. To avoid any ambiguity, you'd better define a specific property for this - e.g., ex:tableName - whose value could be used, e.g., in an SQL query "template".

About whether to use dcat:endpointURL or dcat:endpointDescription for the connection string / DSN, the latter looks more fit, based on its purpose. But, again, you should consider defining a specific extension, as discussed in #1240 (comment). I understand that this may hinder interoperability, especially if you expect to have your dataset records harvested by other catalogues. However, in that case, your dcat:endpointURL / dcat:endpointDescription is not going to be actionable, unless is used in a specific environment with a given configuration (e.g., an environment where a client triggers a JDBC connection when a JDBC connection string is detected).

@zeginis , just a caveat about your example in #1240 (comment): dcat:endpointURL and dcat:endpointDescription are both object properties, so they are not taking a literal as value.

@zeginis
Copy link

zeginis commented Jul 9, 2020

We are thinkig of using this approach

a:Dataset a dcat:Dataset ; 
  dcat:distribution [ a dcat:Distribution ;
    ex:accessDatabase [ a ex:Database ;
      ex:connectionString "jdbc:oracle:thin:@​hostname:1521:my-database";
      ex:tableName "THE_TABLE_NAME"
  ]
] .

Or Alternatively can we define the Database to be the Distribution?
In this case define

  • ex:Database rdfs:subClassOf dcat:Distribution and
  • ex:hasDatabase rdfs:subPropertyOf dcat:distribution

Then we can use:

a:Dataset a dcat:Dataset ; 
 ex:hasDatabase [ a ex:Database ;
      ex:connectionString "jdbc:oracle:thin:@​hostname:1521:my-database";
      ex:tableName "THE_TABLE_NAME"
  ]
] .

@smrgeoinfo
Copy link
Contributor

Another serialization option, putting the extensions in the dcat:endpointDescription, which takes a value rdfs:resource.

a:dataset-004 a dcat:Dataset ; 
  dct:title "Example Relational Database"@en ;
  dct:description "individual tables and views are accessible as separate distributions via jdbc connections" @en ;
  dcat:distribution [ a dcat:Distribution ;
    dct:title "access table 005: THE_TABLE_NAME";
    dcat:accessService a:table-service-005 
  ] .

a:table-service-005
  rdf:type dcat:DataService ;
  dct:title "access table 005: THE_TABLE_NAME";
  dct:conformsTo <https://www.cybele-project.eu/relationaltableprofile> ;
  dct:type <(https://www.cybele-project.eu/table/view> ;
  dcat:endpointDescription [
	  ex:accessDatabase [ a ex:Database ;
		  ex:connectionString "jdbc:oracle:thin:@​hostname:1521:my-database";
		  ex:tableName "THE_TABLE_NAME"
		]
  ]	;
  dcat:servesDataset a:dataset-004 .

Should work as long as ex:accessDatabase subclasses from rdfs:Resource. Nothing is inconsistent with DCAT

@zeginis
Copy link

zeginis commented Jul 10, 2020

@smrgeoinfo thank you for this proposal. However, based on the previous comment by @andrea-perego I think there is a problem since in this case the Dataservice is bound to a specific table. We could add multiple ex:tableName "THE_TABLE_NAME_XXX" but in this case we cannot the map dcat:Dataset to the table name that is what we need to build some queries.

@rob-metalinkage
Copy link
Contributor

Hi @zeginis -
(I am also working for the OGC in the dissemination workpackage for CYBELE, so availalable to help with any promotion (and testing ;-) ) of candidate best practices.

in order to describe data well enough to be able to query it via a service you need at least five different things:

  1. location of a service endpoint
  2. description of the service API - whether it be SPARQL, OGC WFS, OGCAPI, NGSI-LD or any other thing
  3. description of the data schema (what elements will be returned for a query - aka FeatureType in an OGC services
  4. description of queryable aspects of data - either data dimensions or parameters of query functions cover most if not all cases here
  5. knowledge of the allowable range of queryable aspects (spatio-temporal extent, codelists etc)

DCAT provides for 1, and using dcterms:conformsTo to identify serviceType can at least identify any self-descriptive capabilities of the service - such as OAS or OGC GetCapabilities. Any more detail and you need to define your own profile of DCAT with additional metadata properties needed to describe the service.

Some services can describe the data schema - e.g. WFS describeFeatureType - but AFAIK nothing in widespread use does a reasonable job for #4 and 5.

The suggested best practice from the statistics community id the use of RDF-Datacube vocabulary to handle 4 and 5. There is a W3C/OGC Note describing a possible spatio-temporal profile of RDF-Datacube called QB4ST [https://www.w3.org/TR/qb4st/] which directly addresses this gap, but to date little effort has been put into semantic description of query interfaces or even data services. This requires testing in live context and I'd be very happy to assist you with the general challenge of creating expressive enough metadata using available standards.

The one item I know needs to be addressed to achieve a complete solution is the bridge between RDF-Datacube which allows for description of rdf:Property elements, and description of relational database (or JSON, XML or any other meta-model). This needs either:

  • specialised properties for different forms of property reference such as something like qbex:xpath (for the case of XPath elements identifiers)
    or
  • a qualified association to an property description object, where the property description object can declare what type of description it is, and this can be extended with any form of element identifier we might need in future.

Lets talk about how to achieve this and explore any support such as publishig formalised profiles of DCAT that can assist.

@dr-shorthair
Copy link
Contributor

DCATv2 has slots for 1-5

  1. the dcat:landingPage (human readable) and dcat:endpointURL (API) are separate if needed
  2. dcterms:conformsTo shoudl indicate the API standard
  3. the data schema and format/mediaType are part of the description of the payload (i.e. the dcat:Distribution, using dcterms:conformsTo and dcterms:format or dcat:mediaType)
  4. the query model and details are expected to be provided as part of the dcat:endpointDescription
  5. ... which might also indicate ranges, etc

No model for the value of endpointDescription was provided, since it was expected that this would be specified by the API standard, so 5. might not be covered but this is out of scope for DCAT itself.

I agree that the directions of some of the relationships between Dataset, Distribution and DataService might not be quite what you were expecting, but I think there are slots available.

@rob-metalinkage
Copy link
Contributor

@dr-shorthair - the models for the value of conformsTo are not specified either - only an abstract type (dcterms:Standard) so for cases 2,3,4,5 the situtation is exactly the same, as out of scope for DCAT itself.

DCAT makes the relationships (available slots) canonical for cases 1- 3 - which is still a good starting point, but work will need to be done in another place to realize the use case, and if this is to re interoperable then a DCAT profile is recommended.

@riccardoAlbertoni
Copy link
Contributor

I would expect that connecting via JDBC, a query is specified somewhere.
I am not sure where we should include this query. It certainly depends on the kind of endpoint, and other attributes might be required to make it actionable (e.g., is the query an update or a select? I guess SQL 'select' are the most relevant here)

The need to specify the query to apply to the endpoint specifically relates to issue #1230. Though, here we are not dealing with WebAPI/WEB endpoint. I wonder if defining a way to express the query extracting the dataset form the endpoint could provide a further piece on which the DCAT and extensions can interoperate...

@rob-metalinkage
Copy link
Contributor

@riccardoAlbertoni - I thought long and hard about this in the context of WFS - and there seem to be a few possible patterns:

  1. "traditional GIS" - download all the data (empty filter :-) ) then inspect it to work out what queries might make sense
  2. "poke it with a stick" - keep trying possible queries until sensible data is returned (seems to work only the the most trivial cases where users already know the data
  3. "query templates" - in a catalog or somewhere have an artefact which is a query template with possible parameters
  4. "dimensional characterisation" - indicate which properties of a feature type are mapped to data dimensions, and the range of those dimensions - so it is possible to make a query as a "slice" against well documented dimensions.
  5. "parameterised APIs" - a set of parameters against what is essentially some form of "stored query" - in which case parameters need to be mapped to the data model to explain the query
  6. something else

1 and 2 dont really work at scale
3 requires a query templating language with ability to describe parameters
4 can be implemented by a profile of DCAT supporting RDF-QB
5 requires better semantically self-describing APIs and a canonical language to map parameter name and range against data model and data ranges.
6 ???

The underlying reality seems to be that ad-hoc APIs for slicing data proliferate because data providers dont really want to cope with exposing any possible query and comprehensive documentation is too hard to write, find and read.

From what I have seen so far I think dimensional characterisation using RDF-QB is the option for a canonical metadata model that carries the most semantic information and can be used to restrict queries, build queries and document data itself. It also provides an option for mapping API parameters to data structures. At this stage no other candidates have been suggested for the use cases of semantic description of data.

@zeginis
Copy link

zeginis commented Jul 14, 2020

We are also using the 4 "dimensional characterisation" using the RDF-QB vocabulary to describe the structure of the dataset (dimensions, measures, ranges)

Regarding the relation of the dataset with a database/table we (+ @rapw3k) will finally use the following approach:

a:dataset-001 a dcat:Dataset ;
  dct:title "Dataset about YYY"@en ;
  dcat:distribution a:database-001.

a:dataset-002 a dcat:Dataset ;
  dct:title "Dataset about ZZZ"@en ;
  dcat:distribution a:database-001.

a:database-001 a ex:Database ;
                  ex:connectionString "jdbc:oracle:thin:@​hostname:1521:my-database";
                  ex:accessTable [ a ex:Table ;
                                  ex:tableName "Table1";
                                  dct:subject a:dataset-001] ; 
                 ex:accessTable [ a ex:Table ;
                                  ex:tableName "Table2"
                                  dct:subject a:dataset-002] .

Where ex:Database subClassOf dcat:Distribution.
So we will model the Database as the distribution . The database has multiple tables so in order to map the table to a dataset we use the dct:subject. Note: each dataset is stored at a separate DB table.

@andrea-perego
Copy link
Contributor

@zeginis , making a database a subclass of dcat:Distribution have some issues.

Formally speaking, a database is a service from which a distribution is available. The approach you describe conflicts with the pattern defined in DCAT where, in such cases, the service should be pointed to from a distribution, and not described itself as a distribution. This also leads to interoperability issues.

You may consider revising your approach as follows:

a:dataset-001 a dcat:Dataset ;
  dct:title "Dataset about YYY"@en ;
  dcat:distribution [ a dcat:Distribution ;
    dct:title "..."@en ;
    ex:accessDatabase a:database-001 ] .

a:dataset-002 a dcat:Dataset ;
  dct:title "Dataset about ZZZ"@en ;
  dcat:distribution [ a dcat:Distribution ;
    dct:title "..."@en ;
    ex:accessDatabase a:database-001 ] .

a:database-001 a ex:Database ;
                  ex:connectionString "jdbc:oracle:thin:@​hostname:1521:my-database";
                  ex:accessTable [ a ex:Table ;
                                  ex:tableName "Table1";
                                  dct:subject a:dataset-001] ; 
                 ex:accessTable [ a ex:Table ;
                                  ex:tableName "Table2"
                                  dct:subject a:dataset-002] .

@andrea-perego
Copy link
Contributor

@dominik-s0 , @zeginis , do you have any further point you would like to discuss? Otherwise, we are going to close this issue.

@andrea-perego andrea-perego added the due for closing Issue that is going to be closed if there are no objection within 6 days label Mar 11, 2021
@dominik-s0
Copy link
Author

@andrea-perego Thx for following up on that, fine for me to close.

DCAT revision automation moved this from To do to Done Mar 11, 2021
DCAT Sprint: Data services automation moved this from In progress to Done Mar 11, 2021
DCAT Sprint: Feedback automation moved this from In progress to Done Mar 11, 2021
@andrea-perego andrea-perego moved this from In progress to Done in DCAT Sprint: Subsetting / composition Mar 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dcat:DataService dcat due for closing Issue that is going to be closed if there are no objection within 6 days feedback Issues stemming from external feedback to the WG
Development

No branches or pull requests

7 participants