Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monthly DBpedia releases #1085

Open
kurzum opened this issue Sep 18, 2019 · 10 comments
Open

Monthly DBpedia releases #1085

kurzum opened this issue Sep 18, 2019 · 10 comments

Comments

@kurzum
Copy link

@kurzum kurzum commented Sep 18, 2019

DBpedia Releases

Status:
Identifier: https://databus.dbpedia.org/dbpeda/
Creator: Sebastian Hellmann

Description

We are releasing several thousand files per month now and I have specific questions about dcat:Distribution .

I our case group each version according to the generating Scala code: https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/2019.09.01
In this example, each month the code is run over 40 different wikipedia dumps and generates 40 different files according to their language variant. All these files together make up the dataset and each file is a partial distribution. See the metadata here:
https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-objects/2019.09.01/dataid.ttl#Dataset

I could not find an appropriate model in the current draft to describe this properly. It is more structured than the bag of file approach as the data uses the Maven model with group/artifact/version and then content/format/compression - variants.

Note that we consider language/different source a variant. All files make up the version snapshot dataset, while you would only need a subset of files for any given use case. A similar example would be the split of files into consequent compressed parts (e.g. 20 * 50mb of 1GB data) with the difference that you would need all files there to get the distribution. How would this be modelled in the current draft?

@makxdekkers

This comment has been minimized.

Copy link
Contributor

@makxdekkers makxdekkers commented Sep 18, 2019

@kurzum
From a cursory look at the metadata at https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-objects/2019.09.01/dataid.ttl#Dataset, it looks to me that your dataid:SingleFile is very similar to dcat:Distribution.
If it is the case that all files contain the same data in different languages, they could be modelled as separate dcat:Distributions under one dcat:Dataset.
If they contain different data, they could be modelled as different dcat:Datasets.

@kurzum

This comment has been minimized.

Copy link
Author

@kurzum kurzum commented Sep 18, 2019

The problem is the fine line between same and different. Here, the main thing all distributions have in common is that they were created by the same code in the same activity. Content-wise they are true variants of each other. All of them make up the dataset, but they are useful individually and in combination. dataid:SingleFile is already subclassOf dcat:Distribution, but we might should switch to dataid:FileCollection to better model the semantics. DCAT 2 seems to evolve in the DataID (https://wiki.dbpedia.org/projects/dbpedia-dataid) direction, which is quite good.

I am specifically asking here, because we host all the metadata of 5k monthly files in a sparql endpoint: https://databus.dbpedia.org/yasgui/

Having an extra dataset node for each file would be infeasible and impractical. For us it would be helpful to have a better definition of variants. But we can also create one ourselves as an extension.

@makxdekkers

This comment has been minimized.

Copy link
Contributor

@makxdekkers makxdekkers commented Sep 18, 2019

Indeed, there is a fine line between same and different. It's maybe useful to think from the perspective of a 'general' user, a person who is not aware of the way data is produced and how it is structured. While I guess that your regular users know what to expect, a non-initiated user might rightly expect that a dataset has distributions that all contain the same data. In fact, the current DCAT draft says in section 6.7: "all distributions of one dataset should broadly contain the same data".
Maybe your use case could be added for consideration for the next version of DCAT?

@kurzum

This comment has been minimized.

Copy link
Author

@kurzum kurzum commented Sep 18, 2019

Maybe your use case could be added for consideration for the next version of DCAT?
hm, I was under the impression, that I already added it for DCAT 2.0 by posting it here. What else do I need to do?

all distributions of one dataset should broadly contain the same data -> still true. They are variants of the same data. You can fuse them consistently into one as well. The definition really depends on the part of the real world the data is supposed to describe, right? So in a person dataset, distributions could be partitioned alphabetically.

General users of DBpedia - We tried to figure out what it means to be a general DBpedia user. Our conclusion is that we don't have those. They all want a different partition of the data. Hence, the popularity of the SPARQL endpoint. We separated the technical file layer from the ability to create collections (which are dcat:Catalogue).

@makxdekkers

This comment has been minimized.

Copy link
Contributor

@makxdekkers makxdekkers commented Sep 18, 2019

@kurzum The content of DCAT 2 is frozen. We're about to transition to Candidate Recommendation. In this phase we can't include new use cases, so this could be on the list for DCAT 3.

As far as I can tell, 'partitioning of distributions' is not currently something that DCAT supports. The current note in section 6.7 uses the example of budget data for different years where you could imagine partitioning the distributions per year. However, the draft suggests that those 'would usually be' modelled as different datasets. So if a future version of DCAT wants to formalise the partitioning of Distributions, there is some modelling work to do.

@kurzum

This comment has been minimized.

Copy link
Author

@kurzum kurzum commented Sep 18, 2019

Ok, this works for us in a sense. We consider this a contentVariant by time, whereas versions would be tied to updates of the dataset. There are good reasons to use distributions here. So this is a SHOULD
in terms of the standard.

I guess DCAT 2 still doesn't tackle abstract dataset identity.

I will check 6.7 more closely. Does DCAT 2 use SHACL for anything?

@kurzum

This comment has been minimized.

Copy link
Author

@kurzum kurzum commented Sep 18, 2019

Thanks for the good explanation

@dr-shorthair

This comment has been minimized.

Copy link
Contributor

@dr-shorthair dr-shorthair commented Sep 19, 2019

If each of the dumps is intended to be a representation of the same conceptual dataset, then even if the content is different (because they have different time-stamps) then they can still all be legitimately considered to be 'distributions' of that dataset. The dcat:distribution relationship is mostly about intention.

But I see that your application has some axes of complexity. There are some relevant tools in DCAT2:

I suspect that these might provide a basis for describing your data. But it likely would be more reproducible if there were a couple more classes, something like:

  • dcat:DatasetSeries (another sub-class of dcat:Resource) - a sequence of datasets sharing most of the description but with just the temporal or spatial footprint differing (see #868 on the backlog).
  • dcat:DistributionPackage - a set of resources, which used together provide a representation of a Dataset - a richer version of bag-of-files
@makxdekkers

This comment has been minimized.

Copy link
Contributor

@makxdekkers makxdekkers commented Sep 19, 2019

@dr-shorthair Yes, that's what I meant by "there is some modelling work to do".

@andrea-perego andrea-perego added this to To do in DCAT revision via automation Sep 26, 2019
@kurzum

This comment has been minimized.

Copy link
Author

@kurzum kurzum commented Oct 7, 2019

Here is the model which we will adopt:

  • Datasets for us are version snapshots with a list of files
  • Datasets are not glued together with dcat:DatasetSeries, but with these properties as theyy follow the Maven POM:
<https://boa.lmcloud.vse.cz/databus/linked-hypernyms/2016.04.01/dataid.ttl#Dataset>
       # subclass of dcat:Dataset
        a                       dataid:Dataset ;
        dataid:account          databus:propan ;
        dataid:group            <https://databus.dbpedia.org/propan/lhd> ;
        dataid:artifact         <https://databus.dbpedia.org/propan/lhd/linked-hypernyms> ;
        dataid:version          <https://databus.dbpedia.org/propan/lhd/linked-hypernyms/2016.04.01> ;
        dct:hasVersion          "2016.04.01" ;

Order will be lexicographically over the version string, so it goes with SPARQL Order by.

Then we will create:

dataid:DatabusDistribution rdfs:subClassOf dcat:Distribution.

which will have:

  • each dataset will have several dataid:DatabusDistributions which together form the Distribution.
  • the same definition as the above dcat:DistributionPackage but limited to files, which have contentVariant tags and different format and compressionvariants

This way datasets will be flat and can be aggregated and queried more easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
DCAT revision
  
To do
5 participants
You can’t perform that action at this time.