Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize dcat:byteSize to dcat:size #313

Open
agbeltran opened this issue Aug 9, 2018 · 20 comments

Comments

@agbeltran
Copy link
Member

@agbeltran agbeltran commented Aug 9, 2018

At the moment, DCAT provides a property to indicate the size of a distribution in bytes (dcat:byteSize). We discussed that this should be generalized to dcat:size with an additional indication of the unit of measurement. For the latter, we would consider an existing ontology (such as UO, QUDT, OM etc).

Related to #125

As per discussions in meeting (https://www.w3.org/2018/07/19-dxwgdcat-minutes.html#x07) and action (https://www.w3.org/2017/dxwg/track/actions/158).

@agbeltran agbeltran added the dcat label Aug 9, 2018
@dr-shorthair

This comment has been minimized.

Copy link
Contributor

@dr-shorthair dr-shorthair commented Aug 13, 2018

I agree that a discussion on this topic is meritted. But I am not sure an additional property is warranted.

Too many ways to do the same thing has a cost. While a more flexible property makes things easier for the data provider, it creates more work for the consumer. I believe dcat:byteSize is enough, though I suggest that its range should be xsd:positiveInteger (which is a valid OWL-2 datatype - see https://www.w3.org/TR/owl2-quick-reference/#Built-in_Datatypes ) see #125 .

@agbeltran

This comment has been minimized.

Copy link
Member Author

@agbeltran agbeltran commented Aug 14, 2018

In fact, DCAT 2014 originally had a property dcat:size that was deprecated

dcat:size a rdf:Property;
    rdfs:isDefinedBy dcat:;
    rdfs:label "size (Deprecated)";
    rdfs:comment "the size of a distribution. This term has been deprecated";
    rdfs:domain dcat:Distribution;
    owl:deprecated true ;
    rdfs:subPropertyOf dct:extent .

I found some of the old discussions here:

https://lists.w3.org/Archives/Public/public-gld-wg/2012Oct/0117.html

@dr-shorthair

This comment has been minimized.

Copy link
Contributor

@dr-shorthair dr-shorthair commented Aug 15, 2018

Right. The additional point in that post

stating that the value can be approximate

addresses the lurking issue ('what if I only want to indicate the size in round numbers?')
The usage note https://w3c.github.io/dxwg/dcat/#Property:distribution_size states

The size in bytes can be approximated when the precise size is not known.

Perhaps this could be clarified with an example or two?

@agbeltran agbeltran self-assigned this Aug 23, 2018
@makxdekkers

This comment has been minimized.

Copy link
Contributor

@makxdekkers makxdekkers commented Aug 24, 2018

@agbeltran It seems to me that dcat:size is fundamentally different from dcat:byteSize, because it would necessarily have a resource as its range -- as you point out, and as the example in the old discussion shows, it needs a link to a controlled vocabulary for its unit of measurement. As I understand it, dcat:byteSize was defined as a simpler way to express something that people saw as a main requirement at the time. So you can't generalise dcat:byteSize, but would need to define a parallel property for more general cases.
One process question: would we be able to 'un-deprecate' a property in the namespace, or would this need to get a new URI?

@agbeltran

This comment has been minimized.

Copy link
Member Author

@agbeltran agbeltran commented Aug 30, 2018

Thanks @makxdekkers - according to the discussion, I think we would not un-deprecate the property, but keep dcat:byteSize while revising its axioms (see also #125 #110), for example considering the change of range to xsd:positiveIntegeras suggested by @dr-shorthair above. In terms of the point raised by this issue, we need also to confirm that keeping byte as the unit provides enough flexibility to describe large distributions (we also need to decide if relaxing the domain).

@agbeltran

This comment has been minimized.

Copy link
Member Author

@agbeltran agbeltran commented Aug 30, 2018

Some examples of the use of dcat:byteSize in this SPARQL endpoint with query select * where { ?s a dcat:Dataset. ?d a dcat:Distribution. ?s dcat:distribution ?d. ?d dcat:byteSize ?size. FILTER ( STRLEN(?size) > 10) } LIMIT 100

@makxdekkers

This comment has been minimized.

Copy link
Contributor

@makxdekkers makxdekkers commented Aug 30, 2018

@agbeltran Are you now proposing to drop the idea of adding a more general 'size' property, and just revise the axiom (datatype) of dcat:byteSize?
As to changing the datatype from xsd:decimal to xsd:positiveInteger, I wonder if that would break implementations that currently specify a number with ^^xsd:decimal? For example, I see
<dcat:byteSize rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal">246629.0</dcat:byteSize> in https://www.govdata.de/ckan/catalog/catalog.rdf. I guess this would become non-conformant if the axiom were changed.
If it's just for elegance, does it make sense to force people to convert their existing data?

@agbeltran

This comment has been minimized.

Copy link
Member Author

@agbeltran agbeltran commented Aug 30, 2018

Yes, we opened the issue to investigate if the property dcat:size plus a unit of measurement provided more flexibility than dcat:byteSize but we tracked back the reasons why dcat:size was deprecated (e.g. it would usually require the use of a blank node, see link above for more info). So, unless you (or others) think it would be necessary, I don't think we need to undeprecate dcat:size.

I wonder though if with the current representation is too cumbersome (or if there are limitations) to represent dataset distributions that are actually terabytes of data (e.g. multi-dimensional microscopy images can weigh up to several TB each and datasets can be hundreds of TB in total).

@agbeltran

This comment has been minimized.

Copy link
Member Author

@agbeltran agbeltran commented Aug 30, 2018

About changing the datatype, I agree that we should be careful about current implementations. Maybe we can continue that discussion in the specific issue #125

@makxdekkers

This comment has been minimized.

Copy link
Contributor

@makxdekkers makxdekkers commented Aug 30, 2018

I just saw the proposal from @riccardoAlbertoni in today's call https://www.w3.org/2018/08/30-dxwgdcat-minutes#x10 to create a new class for size with a number and a unit of measurement. @agbeltran then said that the object would be assigned a IRI.
I think this is not realistic. Who would assign IRIs to "1024 bytes" and to any other number of bytes? In my mind, assigning IRIs to these kinds of things with low reusability does not make sense. Don't forget that to do this right, you would need to resolve an IRI like http://foo.bar/size/bytes/1024. As I wrote in #300, minting a URI creates a maintenance commitment.
It's far more likely that it would be done as a blank node. This was precisely why the original dcat:size did not make it into DCAT2014.
Also note that VOID took the simple approach, defining a set of properties for various measures: https://www.w3.org/TR/void/#statistics.

@agbeltran

This comment has been minimized.

Copy link
Member Author

@agbeltran agbeltran commented Aug 30, 2018

Thanks @makxdekkers - I totally agree with your view and my comment on the call was pointing out that I don't think that creating a size object is useful, as it would require to assign an IRI to such object which is not really reusable and bears the maintenance costs that you referred to.

@riccardoAlbertoni

This comment has been minimized.

Copy link
Collaborator

@riccardoAlbertoni riccardoAlbertoni commented Aug 30, 2018

@makxdekkers I am not sure about what is realistic and what is not. My comment in today's call was more a reaction to an emerging proposals to have distinct size properties for every possible unit of measures, which sound to me as bad modelling, and dangerous in a longer-term perspective.

If the rationale behind this discussion is to make users more comfortable in expressing and reading the size, we have to consider that the name for multiples of bytes will evolve and which scale to use might be application dependent: if we add the property TerabyteSize, sooner or later we might need to add exabyteSize ... etc.

I am not against the use of blank node in this specific case n-ary relation if there is such a dire need of expressing the size in different unit of measures.

However, I tend to agree with you, If we do not want to have blank nodes, and no other solutions than adding new properties with hard-coded scale/size are on the table, we should replicate the simple approach from VOID which probably corresponds to live with bytesize.

@makxdekkers

This comment has been minimized.

Copy link
Contributor

@makxdekkers makxdekkers commented Aug 30, 2018

@riccardoAlbertoni The issue of granularity/scale -- whether the size is expressed in bytes, kilobytes, megabytes etc -- is really a case of trying to be helpful to people at the expense of efficiency of data. Creating a complex mechanism with an additional class to reduce the number of digits, e.g. from "1000000000" (bytes) to "1" (terabyte) will actually increase the number of bytes on the wire.
dcat:byteSize "1000000000000" is actually shorter than (inventing some properties) dcat:scaledSize [dcat:scale "TB" ; dcat:number "1"].
The other thing is a potential requirement to express different types of sizes, e.g. number of observations, number of rows in a spreadsheet, number of articles in a legal text etc. If there is a small number of such types, the VOID approach makes sense. If there are a large number of types, a structured approach should be better, which is what Data Cube does with sdmx-attribute:unitMeasure.
In my mind, in DCAT we're just talking about byte size so I don't see the need for a more complex approach.

@rob-metalinkage

This comment has been minimized.

Copy link
Contributor

@rob-metalinkage rob-metalinkage commented Aug 30, 2018

There seem to be four distinct issues with dcat:byteSize as the only option:

  1. very large numbers - certainly not human readable practically
  2. exact semantics - is this expected to be exact or approximate? what if the resource varies over time and the exact value cannot be predicted
  3. cost of computation of exact bytesize
  4. difference in values with different encoding choices that may be negotiated for a distribution

what feels to me "reasonable" is to keep byteSize with tighter definition about its expected semantics and introduce a new term with a simple string literal with a microformat

eg dcat:approxSize "23 MB"

such microformats are extremely common, but I havent had too much luck tracking down a standard for such a format, but there are ones for the actual postfix part

  • and a confusion over K = 1000 or 1024 and some ISO rules - and there are explict (e.g. KB and KiB) postfixes for these cases. IMHO this would not matter if approximation is the semantics - though would still need to be careful about byte-vs-bit (KB vs Kb)- which is an effective order of magnitude.

Here are two major development platforms that explicitly support such formats, without citing standards conformance, but do reference this issue of interpretation.

https://developer.android.com/reference/android/text/format/Formatter#formatFileSize(android.content.Context,%20long)

https://docs.microsoft.com/en-us/windows/desktop/api/shlwapi/nf-shlwapi-strformatbytesizeex

@jakubklimek

This comment has been minimized.

Copy link
Contributor

@jakubklimek jakubklimek commented Aug 31, 2018

@makxdekkers @agbeltran As in #300, I have to say that I do not see the problem in creating an IRI such as https://mycatalog.com/resource/dataset/XXX/distribution/YYY/size as and instance of, probably, https://schema.org/QuantitativeValue.

  1. If I manage https://mycatalog.com/resource/dataset/XXX/distribution/YYY, the additional cost of managing https://mycatalog.com/resource/dataset/XXX/distribution/YYY/size still seems minimal to me - I probably use a generic method of assuring dereference of IRIs, so it does not matter how many IRIs there are.
  2. As to the (re)usability of such IRIs - You never know when someone decides to monitor the size of a given distribution. For them, the IRI would make sense. Nevertheless I admit that the reusability of such IRI will definitely be lower that that of a dataset.
@agreiner

This comment has been minimized.

Copy link
Contributor

@agreiner agreiner commented Sep 4, 2018

I have a strong preference for using actual values rather than URIs for things like numbers or timestamps. For programming and for human readability, looking up a URI for such a thing strikes me as far more complex than necessary, to the point of being somewhat comical.

@agreiner

This comment has been minimized.

Copy link
Contributor

@agreiner agreiner commented Sep 4, 2018

Though the examples of programmatic formatting of numbers of bytes are the reverse of what I would call programmatic support of the suggested microformats (They take a long and turn it into a string with a convenient number and unit. Support of the microformats suggested would require a function to read the particular microformat and return the long.) I don't think it's too much to ask of a programmer to write such a thing, if we can specify the microformat. I would not worry about KiB etc, as they can be converted to KB etc, and they are rarely used.

@nicholascar

This comment has been minimized.

Copy link
Contributor

@nicholascar nicholascar commented Sep 4, 2018

Any reason not to relate dcat:byteSize to qudt:bytes (from http://qudt.org/2.0/schema/SCHEMA_QUDT-DATATYPES-v2.0.ttl - no domain, range is xsd:integer)? Then people can use a single, named property (simple case) but those wanting more detail can apply QUDT qualifiers like qudt:Mega or qudt:Mebi (http://qudt.org/2.0/vocab/VOCAB_QUDT-UNITS-BASE-v2.0.ttl) if desired.

So even for the simple, single-property-including-units case, you relate to a comprehensive ontology for complex cases.

There is also qudt:bits.

I can't see anything in QUDT about approximate values but perhaps there are.

@makxdekkers

This comment has been minimized.

Copy link
Contributor

@makxdekkers makxdekkers commented Sep 5, 2018

@nicholascar What would be the advantage of including a relationships between dcat:bytSize and qudt:bytes?
One reason maybe not to rely on QUDT is that it is developed by an organisation that does not seem to be a formal standards organisation. Their website does not say anything about their processes other than stating that the Board of Directors have the power of approval, but there is no visible community beyond that board.
Just as a minor comment, I browsed through the QUDT specification and could not find a definition of the semantic meaning of qudt:bytes: clicking on the link in http://www.qudt.org/doc/2017/DOC_SCHEMA-QUDT-DATATYPES-v2.0.html just tells you it is an owl:DatatypeProperty. Now it might be obvious -- "the number of bytes in the described resource" -- but I think it would be good practice to actually say that somewhere.

@davebrowning

This comment has been minimized.

Copy link
Contributor

@davebrowning davebrowning commented Sep 25, 2019

There is clearly an area that could has the potential for revision as part of future work beyond DCAT 2. As well as dcat:bytesize, there is the adjacent area of statistics for datasets as a whole (#84) which could pick up other "dimensions" (such a number of entities in some logical view of the dataset) beyond the size of the physical representation.

Tagging for future work, and moving to future milestone (alongside #84)

@andrea-perego andrea-perego added this to To do in DCAT revision via automation Sep 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
DCAT revision
  
To do
9 participants
You can’t perform that action at this time.