-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: how to catalog relational database data in DCAT? #1240
Comments
Thanks for bringing to our attention this use case, @dominik-s0 . I give below a preliminary answer, which may be complemented by other WG members. I think there are two different aspects here: one is how to use DCAT to document a database, the other is how to publish it. The latter is not strictly in scope with DCAT, but rather with existing guidelines and best practices - as the Data on the Web BPs (https://www.w3.org/TR/dwbp/), which I think may provide some useful hints for your use case. About DCAT, the issue is the appropriate use of In terms of how to address this, a number of options are available, following under the scope of data publication best practices. I outline below some of them, just as an example, describing also how to use DCAT:
BTW, about your requirements of describing how to give access to subsets of your database, possibly in a machine-actionable way, this issue is being discussed in DCAT in relation to data available from a service / API - i.e., point (3) above. Does this answer your question, @dominik-s0 ? |
Dear @andrea-perego, I think one of the main points I have is that the data I want to catalog is not "on the web", as in "available via a REST API" (which I think is the focus of discussion also in #1230 ) So, it would be great to catalog these datasets sitting in a database using DCAT and it would be great to have a dedicated way of cataloging such JDBC/ODBC data sources (and maybe having an example of this at https://www.w3.org/TR/vocab-dcat-2/#collection-of-examples), as I think this is a very common need in corporate enterprises. What would be the best approach here? Do you think it's feasible with the current DCAT vocabulary or would some kind of extension be needed for this? I'm happy to jump on a call to discuss this topic further and share the background of the request. |
@andrea-perego @riccardoAlbertoni what are your thoughts about my comments? What would be the best approach here? Is this something that can be taken up for the next DCAT release? |
@dominik-s0 , we have discussed your use case during our last meeting. Our understanding is that your requirements fit better into a specific DCAT extension (see §14. DCAT Profiles) rather than the "core" DCAT vocabulary, whose scope is meant to address more general use cases, and with a specific focus on Web-based data access. This may of course change in the future if additional use cases will be contributed which demonstrate such requirements being relevant across domains. However, for the moment, we do not plan to support these features in a new version of DCAT. Coming to the possible DCAT extension, we think that your use case is very much related to "distributions accessible via a service/API" (although your scenario does not concern Web services/APIs) - for some examples, see §5.9 A dataset available through a service. For the specification of the connection string / DSN, the relevant pattern could be: a:Dataset a dcat:Dataset ;
dcat:distribution [ a dcat:Distribution ;
dcat:accessService [ a dcat:DataService ;
dcat:endpointDescription :url-pointing-to-a-machine-readable-description-of-the-endpoint
]
] . So, you may consider defining specific properties and classes to point to a database, instead of a a:Dataset a dcat:Dataset ;
dcat:distribution [ a dcat:Distribution ;
ex:accessDatabase [ a ex:Database ;
ex:connectionString "jdbc:oracle:thin:@hostname:1521:my-database"
]
] . About how to specify access to specific table in a database, as I said a similar issue is currently being discussed, so you may be interested in following and contributing to the discussion in #1230 |
Thanks a lot for the reply on this and for the provided example @andrea-perego . I'll take a look at the other referenced examples and at the discussion in #1230. I'll close this issue for now and will follow-up if required. |
Hi @dominik-s0 , @andrea-perego We plan to use the dcat:DataService and some of its properties to represent such info. Use
|
Thanks for the reply on this @zeginis. I was under the impression that |
Note that there is no The value of the |
Ok this makes sense to me. In that case I think it's a feasible approach to make use of Maybe one could then even relate different |
@dominik-s0 that's the approach we plan to follow --> having one dcat:DataService per table. |
Yes, I think that could allow to still have a unique identifier on the database itself. Otherwise e.g. if you have a table named XYZ in different databases, there would be multiple |
@zeginis , @dominik-s0 , I would recommend against the use of About whether to use @zeginis , just a caveat about your example in #1240 (comment): |
We are thinkig of using this approach
Or Alternatively can we define the Database to be the Distribution?
Then we can use:
|
Another serialization option, putting the extensions in the dcat:endpointDescription, which takes a value rdfs:resource.
Should work as long as ex:accessDatabase subclasses from rdfs:Resource. Nothing is inconsistent with DCAT |
@smrgeoinfo thank you for this proposal. However, based on the previous comment by @andrea-perego I think there is a problem since in this case the Dataservice is bound to a specific table. We could add multiple |
Hi @zeginis - in order to describe data well enough to be able to query it via a service you need at least five different things:
DCAT provides for 1, and using dcterms:conformsTo to identify serviceType can at least identify any self-descriptive capabilities of the service - such as OAS or OGC GetCapabilities. Any more detail and you need to define your own profile of DCAT with additional metadata properties needed to describe the service. Some services can describe the data schema - e.g. WFS describeFeatureType - but AFAIK nothing in widespread use does a reasonable job for #4 and 5. The suggested best practice from the statistics community id the use of RDF-Datacube vocabulary to handle 4 and 5. There is a W3C/OGC Note describing a possible spatio-temporal profile of RDF-Datacube called QB4ST [https://www.w3.org/TR/qb4st/] which directly addresses this gap, but to date little effort has been put into semantic description of query interfaces or even data services. This requires testing in live context and I'd be very happy to assist you with the general challenge of creating expressive enough metadata using available standards. The one item I know needs to be addressed to achieve a complete solution is the bridge between RDF-Datacube which allows for description of rdf:Property elements, and description of relational database (or JSON, XML or any other meta-model). This needs either:
Lets talk about how to achieve this and explore any support such as publishig formalised profiles of DCAT that can assist. |
DCATv2 has slots for 1-5
No model for the value of I agree that the directions of some of the relationships between Dataset, Distribution and DataService might not be quite what you were expecting, but I think there are slots available. |
@dr-shorthair - the models for the value of conformsTo are not specified either - only an abstract type (dcterms:Standard) so for cases 2,3,4,5 the situtation is exactly the same, as out of scope for DCAT itself. DCAT makes the relationships (available slots) canonical for cases 1- 3 - which is still a good starting point, but work will need to be done in another place to realize the use case, and if this is to re interoperable then a DCAT profile is recommended. |
I would expect that connecting via JDBC, a query is specified somewhere. The need to specify the query to apply to the endpoint specifically relates to issue #1230. Though, here we are not dealing with WebAPI/WEB endpoint. I wonder if defining a way to express the query extracting the dataset form the endpoint could provide a further piece on which the DCAT and extensions can interoperate... |
@riccardoAlbertoni - I thought long and hard about this in the context of WFS - and there seem to be a few possible patterns:
1 and 2 dont really work at scale The underlying reality seems to be that ad-hoc APIs for slicing data proliferate because data providers dont really want to cope with exposing any possible query and comprehensive documentation is too hard to write, find and read. From what I have seen so far I think dimensional characterisation using RDF-QB is the option for a canonical metadata model that carries the most semantic information and can be used to restrict queries, build queries and document data itself. It also provides an option for mapping API parameters to data structures. At this stage no other candidates have been suggested for the use cases of semantic description of data. |
We are also using the 4 "dimensional characterisation" using the RDF-QB vocabulary to describe the structure of the dataset (dimensions, measures, ranges) Regarding the relation of the dataset with a database/table we (+ @rapw3k) will finally use the following approach:
Where |
@zeginis , making a database a subclass of Formally speaking, a database is a service from which a distribution is available. The approach you describe conflicts with the pattern defined in DCAT where, in such cases, the service should be pointed to from a distribution, and not described itself as a distribution. This also leads to interoperability issues. You may consider revising your approach as follows: a:dataset-001 a dcat:Dataset ;
dct:title "Dataset about YYY"@en ;
dcat:distribution [ a dcat:Distribution ;
dct:title "..."@en ;
ex:accessDatabase a:database-001 ] .
a:dataset-002 a dcat:Dataset ;
dct:title "Dataset about ZZZ"@en ;
dcat:distribution [ a dcat:Distribution ;
dct:title "..."@en ;
ex:accessDatabase a:database-001 ] .
a:database-001 a ex:Database ;
ex:connectionString "jdbc:oracle:thin:@hostname:1521:my-database";
ex:accessTable [ a ex:Table ;
ex:tableName "Table1";
dct:subject a:dataset-001] ;
ex:accessTable [ a ex:Table ;
ex:tableName "Table2"
dct:subject a:dataset-002] . |
@dominik-s0 , @zeginis , do you have any further point you would like to discuss? Otherwise, we are going to close this issue. |
@andrea-perego Thx for following up on that, fine for me to close. |
Dear DCAT team,
I have a question regarding the correct use of DCAT to catalog data sitting in relational databases such as Oracle/MySQL/Postgres and data lake engines such as Apache Hive.
In my thoughts, I'd either model a database table or an entire database as a distribution of a dataset. This could apply to open accessible and private databases.
For referring to an entire database, I probably could just refer the JDBC string as "accessURL", e.g. jdbc:oracle:thin:@hostname:1521:my-database or jdbc:hive2://hostname:8443/my-database (or would that even be allowed as an "URL"?)
However, if I want to refer to a single table, things get more complicated. With JDBC strings I can not refer to individual tables, so can option could be to use combination of "accessURL" for the database and reference the table name in the "title". However, from my understanding of the dataset distribution attributes, "title" refers to a speaking name of the distribution, which might be different to the technical table name. Of course it could be an option to artificially attach the table name in the "accessURL" or put the table name in the description, but neither of it would be machine-readable
In order to apply DCAT correctly, what would be your proposal?
Thanks and kind regards,
Dominik
The text was updated successfully, but these errors were encountered: