Generalize dcat:byteSize to dcat:size #313

agbeltran · 2018-08-09T06:56:51Z

At the moment, DCAT provides a property to indicate the size of a distribution in bytes (dcat:byteSize). We discussed that this should be generalized to dcat:size with an additional indication of the unit of measurement. For the latter, we would consider an existing ontology (such as UO, QUDT, OM etc).

Related to #125

As per discussions in meeting (https://www.w3.org/2018/07/19-dxwgdcat-minutes.html#x07) and action (https://www.w3.org/2017/dxwg/track/actions/158).

dr-shorthair · 2018-08-13T00:53:45Z

I agree that a discussion on this topic is meritted. But I am not sure an additional property is warranted.

Too many ways to do the same thing has a cost. While a more flexible property makes things easier for the data provider, it creates more work for the consumer. I believe dcat:byteSize is enough, though I suggest that its range should be xsd:positiveInteger (which is a valid OWL-2 datatype - see https://www.w3.org/TR/owl2-quick-reference/#Built-in_Datatypes ) see #125 .

agbeltran · 2018-08-14T18:59:50Z

In fact, DCAT 2014 originally had a property dcat:size that was deprecated

dcat:size a rdf:Property;
    rdfs:isDefinedBy dcat:;
    rdfs:label "size (Deprecated)";
    rdfs:comment "the size of a distribution. This term has been deprecated";
    rdfs:domain dcat:Distribution;
    owl:deprecated true ;
    rdfs:subPropertyOf dct:extent .

I found some of the old discussions here:

https://lists.w3.org/Archives/Public/public-gld-wg/2012Oct/0117.html

dr-shorthair · 2018-08-15T00:39:39Z

Right. The additional point in that post

stating that the value can be approximate

addresses the lurking issue ('what if I only want to indicate the size in round numbers?')
The usage note https://w3c.github.io/dxwg/dcat/#Property:distribution_size states

The size in bytes can be approximated when the precise size is not known.

Perhaps this could be clarified with an example or two?

makxdekkers · 2018-08-24T07:57:25Z

@agbeltran It seems to me that dcat:size is fundamentally different from dcat:byteSize, because it would necessarily have a resource as its range -- as you point out, and as the example in the old discussion shows, it needs a link to a controlled vocabulary for its unit of measurement. As I understand it, dcat:byteSize was defined as a simpler way to express something that people saw as a main requirement at the time. So you can't generalise dcat:byteSize, but would need to define a parallel property for more general cases.
One process question: would we be able to 'un-deprecate' a property in the namespace, or would this need to get a new URI?

agbeltran · 2018-08-30T05:28:25Z

Thanks @makxdekkers - according to the discussion, I think we would not un-deprecate the property, but keep dcat:byteSize while revising its axioms (see also #125 #110), for example considering the change of range to xsd:positiveIntegeras suggested by @dr-shorthair above. In terms of the point raised by this issue, we need also to confirm that keeping byte as the unit provides enough flexibility to describe large distributions (we also need to decide if relaxing the domain).

agbeltran · 2018-08-30T05:56:17Z

Some examples of the use of dcat:byteSize in this SPARQL endpoint with query select * where { ?s a dcat:Dataset. ?d a dcat:Distribution. ?s dcat:distribution ?d. ?d dcat:byteSize ?size. FILTER ( STRLEN(?size) > 10) } LIMIT 100

makxdekkers · 2018-08-30T06:48:49Z

@agbeltran Are you now proposing to drop the idea of adding a more general 'size' property, and just revise the axiom (datatype) of dcat:byteSize?
As to changing the datatype from xsd:decimal to xsd:positiveInteger, I wonder if that would break implementations that currently specify a number with ^^xsd:decimal? For example, I see
<dcat:byteSize rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal">246629.0</dcat:byteSize> in https://www.govdata.de/ckan/catalog/catalog.rdf. I guess this would become non-conformant if the axiom were changed.
If it's just for elegance, does it make sense to force people to convert their existing data?

agbeltran · 2018-08-30T07:57:03Z

Yes, we opened the issue to investigate if the property dcat:size plus a unit of measurement provided more flexibility than dcat:byteSize but we tracked back the reasons why dcat:size was deprecated (e.g. it would usually require the use of a blank node, see link above for more info). So, unless you (or others) think it would be necessary, I don't think we need to undeprecate dcat:size.

I wonder though if with the current representation is too cumbersome (or if there are limitations) to represent dataset distributions that are actually terabytes of data (e.g. multi-dimensional microscopy images can weigh up to several TB each and datasets can be hundreds of TB in total).

agbeltran · 2018-08-30T07:58:32Z

About changing the datatype, I agree that we should be careful about current implementations. Maybe we can continue that discussion in the specific issue #125

makxdekkers · 2018-08-30T12:32:05Z

I just saw the proposal from @riccardoAlbertoni in today's call https://www.w3.org/2018/08/30-dxwgdcat-minutes#x10 to create a new class for size with a number and a unit of measurement. @agbeltran then said that the object would be assigned a IRI.
I think this is not realistic. Who would assign IRIs to "1024 bytes" and to any other number of bytes? In my mind, assigning IRIs to these kinds of things with low reusability does not make sense. Don't forget that to do this right, you would need to resolve an IRI like http://foo.bar/size/bytes/1024. As I wrote in #300, minting a URI creates a maintenance commitment.
It's far more likely that it would be done as a blank node. This was precisely why the original dcat:size did not make it into DCAT2014.
Also note that VOID took the simple approach, defining a set of properties for various measures: https://www.w3.org/TR/void/#statistics.

agbeltran · 2018-08-30T12:37:11Z

Thanks @makxdekkers - I totally agree with your view and my comment on the call was pointing out that I don't think that creating a size object is useful, as it would require to assign an IRI to such object which is not really reusable and bears the maintenance costs that you referred to.

riccardoAlbertoni · 2018-08-30T17:54:15Z

@makxdekkers I am not sure about what is realistic and what is not. My comment in today's call was more a reaction to an emerging proposals to have distinct size properties for every possible unit of measures, which sound to me as bad modelling, and dangerous in a longer-term perspective.

If the rationale behind this discussion is to make users more comfortable in expressing and reading the size, we have to consider that the name for multiples of bytes will evolve and which scale to use might be application dependent: if we add the property TerabyteSize, sooner or later we might need to add exabyteSize ... etc.

I am not against the use of blank node in this specific case n-ary relation if there is such a dire need of expressing the size in different unit of measures.

However, I tend to agree with you, If we do not want to have blank nodes, and no other solutions than adding new properties with hard-coded scale/size are on the table, we should replicate the simple approach from VOID which probably corresponds to live with bytesize.

makxdekkers · 2018-08-30T21:48:55Z

@riccardoAlbertoni The issue of granularity/scale -- whether the size is expressed in bytes, kilobytes, megabytes etc -- is really a case of trying to be helpful to people at the expense of efficiency of data. Creating a complex mechanism with an additional class to reduce the number of digits, e.g. from "1000000000" (bytes) to "1" (terabyte) will actually increase the number of bytes on the wire.
dcat:byteSize "1000000000000" is actually shorter than (inventing some properties) dcat:scaledSize [dcat:scale "TB" ; dcat:number "1"].
The other thing is a potential requirement to express different types of sizes, e.g. number of observations, number of rows in a spreadsheet, number of articles in a legal text etc. If there is a small number of such types, the VOID approach makes sense. If there are a large number of types, a structured approach should be better, which is what Data Cube does with sdmx-attribute:unitMeasure.
In my mind, in DCAT we're just talking about byte size so I don't see the need for a more complex approach.

rob-metalinkage · 2018-08-30T22:31:05Z

There seem to be four distinct issues with dcat:byteSize as the only option:

very large numbers - certainly not human readable practically
exact semantics - is this expected to be exact or approximate? what if the resource varies over time and the exact value cannot be predicted
cost of computation of exact bytesize
difference in values with different encoding choices that may be negotiated for a distribution

what feels to me "reasonable" is to keep byteSize with tighter definition about its expected semantics and introduce a new term with a simple string literal with a microformat

eg dcat:approxSize "23 MB"

such microformats are extremely common, but I havent had too much luck tracking down a standard for such a format, but there are ones for the actual postfix part

and a confusion over K = 1000 or 1024 and some ISO rules - and there are explict (e.g. KB and KiB) postfixes for these cases. IMHO this would not matter if approximation is the semantics - though would still need to be careful about byte-vs-bit (KB vs Kb)- which is an effective order of magnitude.

Here are two major development platforms that explicitly support such formats, without citing standards conformance, but do reference this issue of interpretation.

https://developer.android.com/reference/android/text/format/Formatter#formatFileSize(android.content.Context,%20long)

https://docs.microsoft.com/en-us/windows/desktop/api/shlwapi/nf-shlwapi-strformatbytesizeex

jakubklimek · 2018-08-31T05:56:50Z

@makxdekkers @agbeltran As in #300, I have to say that I do not see the problem in creating an IRI such as https://mycatalog.com/resource/dataset/XXX/distribution/YYY/size as and instance of, probably, https://schema.org/QuantitativeValue.

If I manage https://mycatalog.com/resource/dataset/XXX/distribution/YYY, the additional cost of managing https://mycatalog.com/resource/dataset/XXX/distribution/YYY/size still seems minimal to me - I probably use a generic method of assuring dereference of IRIs, so it does not matter how many IRIs there are.
As to the (re)usability of such IRIs - You never know when someone decides to monitor the size of a given distribution. For them, the IRI would make sense. Nevertheless I admit that the reusability of such IRI will definitely be lower that that of a dataset.

agreiner · 2018-09-04T21:19:42Z

I have a strong preference for using actual values rather than URIs for things like numbers or timestamps. For programming and for human readability, looking up a URI for such a thing strikes me as far more complex than necessary, to the point of being somewhat comical.

agreiner · 2018-09-04T21:43:41Z

Though the examples of programmatic formatting of numbers of bytes are the reverse of what I would call programmatic support of the suggested microformats (They take a long and turn it into a string with a convenient number and unit. Support of the microformats suggested would require a function to read the particular microformat and return the long.) I don't think it's too much to ask of a programmer to write such a thing, if we can specify the microformat. I would not worry about KiB etc, as they can be converted to KB etc, and they are rarely used.

nicholascar · 2018-09-04T22:26:19Z

Any reason not to relate dcat:byteSize to qudt:bytes (from http://qudt.org/2.0/schema/SCHEMA_QUDT-DATATYPES-v2.0.ttl - no domain, range is xsd:integer)? Then people can use a single, named property (simple case) but those wanting more detail can apply QUDT qualifiers like qudt:Mega or qudt:Mebi (http://qudt.org/2.0/vocab/VOCAB_QUDT-UNITS-BASE-v2.0.ttl) if desired.

So even for the simple, single-property-including-units case, you relate to a comprehensive ontology for complex cases.

There is also qudt:bits.

I can't see anything in QUDT about approximate values but perhaps there are.

makxdekkers · 2018-09-05T09:34:50Z

@nicholascar What would be the advantage of including a relationships between dcat:bytSize and qudt:bytes?
One reason maybe not to rely on QUDT is that it is developed by an organisation that does not seem to be a formal standards organisation. Their website does not say anything about their processes other than stating that the Board of Directors have the power of approval, but there is no visible community beyond that board.
Just as a minor comment, I browsed through the QUDT specification and could not find a definition of the semantic meaning of qudt:bytes: clicking on the link in http://www.qudt.org/doc/2017/DOC_SCHEMA-QUDT-DATATYPES-v2.0.html just tells you it is an owl:DatatypeProperty. Now it might be obvious -- "the number of bytes in the described resource" -- but I think it would be good practice to actually say that somewhere.

davebrowning · 2019-09-25T06:42:17Z

There is clearly an area that could has the potential for revision as part of future work beyond DCAT 2. As well as dcat:bytesize, there is the adjacent area of statistics for datasets as a whole (#84) which could pick up other "dimensions" (such a number of entities in some logical view of the dataset) beyond the size of the physical representation.

Tagging for future work, and moving to future milestone (alongside #84)

andrea-perego · 2021-03-13T11:55:05Z

There was no further discussion on this issue since 2018, and DCAT 2 has not eventually included a dcat:size property.

@agbeltran , do you think we can close it?

andrea-perego · 2021-03-20T09:20:19Z

Noting no objections, I'm closing this issue.

agbeltran added the dcat label Aug 9, 2018

dr-shorthair added this to the DCAT Second Public Working Draft milestone Aug 23, 2018

agbeltran self-assigned this Aug 23, 2018

agbeltran removed this from the DCAT Second Public Working Draft milestone Sep 20, 2018

davebrowning mentioned this issue Sep 20, 2018

Review global domain axioms on dcat properties #110

Closed

dr-shorthair added the statistics label Feb 6, 2019

davebrowning added the future-work issue deferred to the next standardization round label Sep 25, 2019

davebrowning unassigned agbeltran Sep 25, 2019

davebrowning added this to the DCAT Future Priority Work milestone Sep 25, 2019

davebrowning mentioned this issue Sep 25, 2019

Use case: Dataset size characteristics #161

Closed

andrea-perego added this to To do in DCAT revision via automation Sep 26, 2019

andrea-perego modified the milestones: DCAT Future Priority Work, DCAT3 2PWD Mar 13, 2021

andrea-perego added the due for closing Issue that is going to be closed if there are no objection within 6 days label Mar 13, 2021

andrea-perego closed this as completed Mar 20, 2021

DCAT revision automation moved this from To do to Done Mar 20, 2021

andrea-perego removed the future-work issue deferred to the next standardization round label Mar 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize dcat:byteSize to dcat:size #313

Generalize dcat:byteSize to dcat:size #313

agbeltran commented Aug 9, 2018

dr-shorthair commented Aug 13, 2018 •

edited

Loading

agbeltran commented Aug 14, 2018 •

edited

Loading

dr-shorthair commented Aug 15, 2018 •

edited

Loading

makxdekkers commented Aug 24, 2018

agbeltran commented Aug 30, 2018 •

edited

Loading

agbeltran commented Aug 30, 2018

makxdekkers commented Aug 30, 2018

agbeltran commented Aug 30, 2018

agbeltran commented Aug 30, 2018

makxdekkers commented Aug 30, 2018

agbeltran commented Aug 30, 2018

riccardoAlbertoni commented Aug 30, 2018

makxdekkers commented Aug 30, 2018

rob-metalinkage commented Aug 30, 2018

jakubklimek commented Aug 31, 2018

agreiner commented Sep 4, 2018

agreiner commented Sep 4, 2018

nicholascar commented Sep 4, 2018 •

edited

Loading

makxdekkers commented Sep 5, 2018

davebrowning commented Sep 25, 2019

andrea-perego commented Mar 13, 2021

andrea-perego commented Mar 20, 2021

Generalize dcat:byteSize to dcat:size #313

Generalize dcat:byteSize to dcat:size #313

Comments

agbeltran commented Aug 9, 2018

dr-shorthair commented Aug 13, 2018 • edited Loading

agbeltran commented Aug 14, 2018 • edited Loading

dr-shorthair commented Aug 15, 2018 • edited Loading

makxdekkers commented Aug 24, 2018

agbeltran commented Aug 30, 2018 • edited Loading

agbeltran commented Aug 30, 2018

makxdekkers commented Aug 30, 2018

agbeltran commented Aug 30, 2018

agbeltran commented Aug 30, 2018

makxdekkers commented Aug 30, 2018

agbeltran commented Aug 30, 2018

riccardoAlbertoni commented Aug 30, 2018

makxdekkers commented Aug 30, 2018

rob-metalinkage commented Aug 30, 2018

jakubklimek commented Aug 31, 2018

agreiner commented Sep 4, 2018

agreiner commented Sep 4, 2018

nicholascar commented Sep 4, 2018 • edited Loading

makxdekkers commented Sep 5, 2018

davebrowning commented Sep 25, 2019

andrea-perego commented Mar 13, 2021

andrea-perego commented Mar 20, 2021

dr-shorthair commented Aug 13, 2018 •

edited

Loading

agbeltran commented Aug 14, 2018 •

edited

Loading

dr-shorthair commented Aug 15, 2018 •

edited

Loading

agbeltran commented Aug 30, 2018 •

edited

Loading

nicholascar commented Sep 4, 2018 •

edited

Loading