Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize dcat:byteSize to dcat:size #313

Closed
agbeltran opened this issue Aug 9, 2018 · 22 comments
Closed

Generalize dcat:byteSize to dcat:size #313

agbeltran opened this issue Aug 9, 2018 · 22 comments
Labels
dcat due for closing Issue that is going to be closed if there are no objection within 6 days statistics
Milestone

Comments

@agbeltran
Copy link
Member

At the moment, DCAT provides a property to indicate the size of a distribution in bytes (dcat:byteSize). We discussed that this should be generalized to dcat:size with an additional indication of the unit of measurement. For the latter, we would consider an existing ontology (such as UO, QUDT, OM etc).

Related to #125

As per discussions in meeting (https://www.w3.org/2018/07/19-dxwgdcat-minutes.html#x07) and action (https://www.w3.org/2017/dxwg/track/actions/158).

@agbeltran agbeltran added the dcat label Aug 9, 2018
@dr-shorthair
Copy link
Contributor

dr-shorthair commented Aug 13, 2018

I agree that a discussion on this topic is meritted. But I am not sure an additional property is warranted.

Too many ways to do the same thing has a cost. While a more flexible property makes things easier for the data provider, it creates more work for the consumer. I believe dcat:byteSize is enough, though I suggest that its range should be xsd:positiveInteger (which is a valid OWL-2 datatype - see https://www.w3.org/TR/owl2-quick-reference/#Built-in_Datatypes ) see #125 .

@agbeltran
Copy link
Member Author

agbeltran commented Aug 14, 2018

In fact, DCAT 2014 originally had a property dcat:size that was deprecated

dcat:size a rdf:Property;
    rdfs:isDefinedBy dcat:;
    rdfs:label "size (Deprecated)";
    rdfs:comment "the size of a distribution. This term has been deprecated";
    rdfs:domain dcat:Distribution;
    owl:deprecated true ;
    rdfs:subPropertyOf dct:extent .

I found some of the old discussions here:

https://lists.w3.org/Archives/Public/public-gld-wg/2012Oct/0117.html

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Aug 15, 2018

Right. The additional point in that post

stating that the value can be approximate

addresses the lurking issue ('what if I only want to indicate the size in round numbers?')
The usage note https://w3c.github.io/dxwg/dcat/#Property:distribution_size states

The size in bytes can be approximated when the precise size is not known.

Perhaps this could be clarified with an example or two?

@agbeltran agbeltran self-assigned this Aug 23, 2018
@makxdekkers
Copy link
Contributor

@agbeltran It seems to me that dcat:size is fundamentally different from dcat:byteSize, because it would necessarily have a resource as its range -- as you point out, and as the example in the old discussion shows, it needs a link to a controlled vocabulary for its unit of measurement. As I understand it, dcat:byteSize was defined as a simpler way to express something that people saw as a main requirement at the time. So you can't generalise dcat:byteSize, but would need to define a parallel property for more general cases.
One process question: would we be able to 'un-deprecate' a property in the namespace, or would this need to get a new URI?

@agbeltran
Copy link
Member Author

agbeltran commented Aug 30, 2018

Thanks @makxdekkers - according to the discussion, I think we would not un-deprecate the property, but keep dcat:byteSize while revising its axioms (see also #125 #110), for example considering the change of range to xsd:positiveIntegeras suggested by @dr-shorthair above. In terms of the point raised by this issue, we need also to confirm that keeping byte as the unit provides enough flexibility to describe large distributions (we also need to decide if relaxing the domain).

@agbeltran
Copy link
Member Author

Some examples of the use of dcat:byteSize in this SPARQL endpoint with query select * where { ?s a dcat:Dataset. ?d a dcat:Distribution. ?s dcat:distribution ?d. ?d dcat:byteSize ?size. FILTER ( STRLEN(?size) > 10) } LIMIT 100

@makxdekkers
Copy link
Contributor

@agbeltran Are you now proposing to drop the idea of adding a more general 'size' property, and just revise the axiom (datatype) of dcat:byteSize?
As to changing the datatype from xsd:decimal to xsd:positiveInteger, I wonder if that would break implementations that currently specify a number with ^^xsd:decimal? For example, I see
<dcat:byteSize rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal">246629.0</dcat:byteSize> in https://www.govdata.de/ckan/catalog/catalog.rdf. I guess this would become non-conformant if the axiom were changed.
If it's just for elegance, does it make sense to force people to convert their existing data?

@agbeltran
Copy link
Member Author

Yes, we opened the issue to investigate if the property dcat:size plus a unit of measurement provided more flexibility than dcat:byteSize but we tracked back the reasons why dcat:size was deprecated (e.g. it would usually require the use of a blank node, see link above for more info). So, unless you (or others) think it would be necessary, I don't think we need to undeprecate dcat:size.

I wonder though if with the current representation is too cumbersome (or if there are limitations) to represent dataset distributions that are actually terabytes of data (e.g. multi-dimensional microscopy images can weigh up to several TB each and datasets can be hundreds of TB in total).

@agbeltran
Copy link
Member Author

About changing the datatype, I agree that we should be careful about current implementations. Maybe we can continue that discussion in the specific issue #125

@makxdekkers
Copy link
Contributor

I just saw the proposal from @riccardoAlbertoni in today's call https://www.w3.org/2018/08/30-dxwgdcat-minutes#x10 to create a new class for size with a number and a unit of measurement. @agbeltran then said that the object would be assigned a IRI.
I think this is not realistic. Who would assign IRIs to "1024 bytes" and to any other number of bytes? In my mind, assigning IRIs to these kinds of things with low reusability does not make sense. Don't forget that to do this right, you would need to resolve an IRI like http://foo.bar/size/bytes/1024. As I wrote in #300, minting a URI creates a maintenance commitment.
It's far more likely that it would be done as a blank node. This was precisely why the original dcat:size did not make it into DCAT2014.
Also note that VOID took the simple approach, defining a set of properties for various measures: https://www.w3.org/TR/void/#statistics.

@agbeltran
Copy link
Member Author

Thanks @makxdekkers - I totally agree with your view and my comment on the call was pointing out that I don't think that creating a size object is useful, as it would require to assign an IRI to such object which is not really reusable and bears the maintenance costs that you referred to.

@riccardoAlbertoni
Copy link
Contributor

@makxdekkers I am not sure about what is realistic and what is not. My comment in today's call was more a reaction to an emerging proposals to have distinct size properties for every possible unit of measures, which sound to me as bad modelling, and dangerous in a longer-term perspective.

If the rationale behind this discussion is to make users more comfortable in expressing and reading the size, we have to consider that the name for multiples of bytes will evolve and which scale to use might be application dependent: if we add the property TerabyteSize, sooner or later we might need to add exabyteSize ... etc.

I am not against the use of blank node in this specific case n-ary relation if there is such a dire need of expressing the size in different unit of measures.

However, I tend to agree with you, If we do not want to have blank nodes, and no other solutions than adding new properties with hard-coded scale/size are on the table, we should replicate the simple approach from VOID which probably corresponds to live with bytesize.

@makxdekkers
Copy link
Contributor

@riccardoAlbertoni The issue of granularity/scale -- whether the size is expressed in bytes, kilobytes, megabytes etc -- is really a case of trying to be helpful to people at the expense of efficiency of data. Creating a complex mechanism with an additional class to reduce the number of digits, e.g. from "1000000000" (bytes) to "1" (terabyte) will actually increase the number of bytes on the wire.
dcat:byteSize "1000000000000" is actually shorter than (inventing some properties) dcat:scaledSize [dcat:scale "TB" ; dcat:number "1"].
The other thing is a potential requirement to express different types of sizes, e.g. number of observations, number of rows in a spreadsheet, number of articles in a legal text etc. If there is a small number of such types, the VOID approach makes sense. If there are a large number of types, a structured approach should be better, which is what Data Cube does with sdmx-attribute:unitMeasure.
In my mind, in DCAT we're just talking about byte size so I don't see the need for a more complex approach.

@rob-metalinkage
Copy link
Contributor

There seem to be four distinct issues with dcat:byteSize as the only option:

  1. very large numbers - certainly not human readable practically
  2. exact semantics - is this expected to be exact or approximate? what if the resource varies over time and the exact value cannot be predicted
  3. cost of computation of exact bytesize
  4. difference in values with different encoding choices that may be negotiated for a distribution

what feels to me "reasonable" is to keep byteSize with tighter definition about its expected semantics and introduce a new term with a simple string literal with a microformat

eg dcat:approxSize "23 MB"

such microformats are extremely common, but I havent had too much luck tracking down a standard for such a format, but there are ones for the actual postfix part

  • and a confusion over K = 1000 or 1024 and some ISO rules - and there are explict (e.g. KB and KiB) postfixes for these cases. IMHO this would not matter if approximation is the semantics - though would still need to be careful about byte-vs-bit (KB vs Kb)- which is an effective order of magnitude.

Here are two major development platforms that explicitly support such formats, without citing standards conformance, but do reference this issue of interpretation.

https://developer.android.com/reference/android/text/format/Formatter#formatFileSize(android.content.Context,%20long)

https://docs.microsoft.com/en-us/windows/desktop/api/shlwapi/nf-shlwapi-strformatbytesizeex

@jakubklimek
Copy link
Contributor

@makxdekkers @agbeltran As in #300, I have to say that I do not see the problem in creating an IRI such as https://mycatalog.com/resource/dataset/XXX/distribution/YYY/size as and instance of, probably, https://schema.org/QuantitativeValue.

  1. If I manage https://mycatalog.com/resource/dataset/XXX/distribution/YYY, the additional cost of managing https://mycatalog.com/resource/dataset/XXX/distribution/YYY/size still seems minimal to me - I probably use a generic method of assuring dereference of IRIs, so it does not matter how many IRIs there are.
  2. As to the (re)usability of such IRIs - You never know when someone decides to monitor the size of a given distribution. For them, the IRI would make sense. Nevertheless I admit that the reusability of such IRI will definitely be lower that that of a dataset.

@agreiner
Copy link
Contributor

agreiner commented Sep 4, 2018

I have a strong preference for using actual values rather than URIs for things like numbers or timestamps. For programming and for human readability, looking up a URI for such a thing strikes me as far more complex than necessary, to the point of being somewhat comical.

@agreiner
Copy link
Contributor

agreiner commented Sep 4, 2018

Though the examples of programmatic formatting of numbers of bytes are the reverse of what I would call programmatic support of the suggested microformats (They take a long and turn it into a string with a convenient number and unit. Support of the microformats suggested would require a function to read the particular microformat and return the long.) I don't think it's too much to ask of a programmer to write such a thing, if we can specify the microformat. I would not worry about KiB etc, as they can be converted to KB etc, and they are rarely used.

@nicholascar
Copy link
Contributor

nicholascar commented Sep 4, 2018

Any reason not to relate dcat:byteSize to qudt:bytes (from http://qudt.org/2.0/schema/SCHEMA_QUDT-DATATYPES-v2.0.ttl - no domain, range is xsd:integer)? Then people can use a single, named property (simple case) but those wanting more detail can apply QUDT qualifiers like qudt:Mega or qudt:Mebi (http://qudt.org/2.0/vocab/VOCAB_QUDT-UNITS-BASE-v2.0.ttl) if desired.

So even for the simple, single-property-including-units case, you relate to a comprehensive ontology for complex cases.

There is also qudt:bits.

I can't see anything in QUDT about approximate values but perhaps there are.

@makxdekkers
Copy link
Contributor

@nicholascar What would be the advantage of including a relationships between dcat:bytSize and qudt:bytes?
One reason maybe not to rely on QUDT is that it is developed by an organisation that does not seem to be a formal standards organisation. Their website does not say anything about their processes other than stating that the Board of Directors have the power of approval, but there is no visible community beyond that board.
Just as a minor comment, I browsed through the QUDT specification and could not find a definition of the semantic meaning of qudt:bytes: clicking on the link in http://www.qudt.org/doc/2017/DOC_SCHEMA-QUDT-DATATYPES-v2.0.html just tells you it is an owl:DatatypeProperty. Now it might be obvious -- "the number of bytes in the described resource" -- but I think it would be good practice to actually say that somewhere.

@davebrowning
Copy link
Contributor

There is clearly an area that could has the potential for revision as part of future work beyond DCAT 2. As well as dcat:bytesize, there is the adjacent area of statistics for datasets as a whole (#84) which could pick up other "dimensions" (such a number of entities in some logical view of the dataset) beyond the size of the physical representation.

Tagging for future work, and moving to future milestone (alongside #84)

@davebrowning davebrowning added the future-work issue deferred to the next standardization round label Sep 25, 2019
@andrea-perego andrea-perego added this to To do in DCAT revision via automation Sep 26, 2019
@andrea-perego
Copy link
Contributor

There was no further discussion on this issue since 2018, and DCAT 2 has not eventually included a dcat:size property.

@agbeltran , do you think we can close it?

@andrea-perego andrea-perego added the due for closing Issue that is going to be closed if there are no objection within 6 days label Mar 13, 2021
@andrea-perego
Copy link
Contributor

Noting no objections, I'm closing this issue.

DCAT revision automation moved this from To do to Done Mar 20, 2021
@andrea-perego andrea-perego removed the future-work issue deferred to the next standardization round label Mar 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dcat due for closing Issue that is going to be closed if there are no objection within 6 days statistics
Projects
DCAT revision
  
Done
Development

No branches or pull requests

10 participants