-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing "unions" of datatypes? #223
Comments
+1, this seems like a reasonable change. |
I am not convinced. I would like to see more use cases requiring it. It is yet another feature creep and we have to have very strong arguments for allowing any more. So... -1 for the time being. Ivan Ivan Herman (Written on mobile, sorry for brevity and misspellings...)
|
With simplified merge semantics this reduces the merge burden. If sufficiently motivated this is reasonable, and could even reduce the need for multiple |
@jumbrich to do an analysis to see how common it is to have columns with union values in real data. Will make a decision based on that analysis. |
Over many years of dealing with CSV data in the social sciences, I can say that it is rare to find data which has got no funny values in it
the DfE also uses things like "N/D" non-disclosed, "N/A" not-available in some datasets Small comments: it is easy to go automatically from a "complicated" type to a simpler one, the reverse is not as easy. Also I don't think that a NULL is the same as a "SUPP", "N/A" and so on I would have no problem to deal with datatypes as described by JeniT, as they seem to me to be a "rewriting" of what is in the W3C schema datatypes |
I understand. However... isn't it enough to model those 'a'/'c' etc cases by the simpler approach, described in the (accepted) issue #218? Ie, to be able to list a number of strings that act as "null"? The current issue is whether it is required to have, say, a column whose values can be integers or dates, for example. Ivan
Ivan Herman, W3C |
Hi On 19/02/15 08:48, Ivan Herman wrote:
|
Thanks. That is an important input for our discussion...
Right. From our processing and metadata point of view, is it o.k. to be able to declare: "null" : ["NA", "NE", "SUPP"] that is those strings are accepted as null values (ie, no output is generated in the JSON and/or RDF output for those cases, and validators may consider those data sets valid). Is that enough for your requirements? Ivan
Ivan Herman, W3C |
What I would like to say is that in a column I may have "integer" or On 19/02/15 09:03, Ivan Herman wrote:
|
Indeed, the definition of the I seem to remember we discussed, at some point, whether the original cell value (ie, not transformed via, say, datatype transformation) should appear as an Ivan
Ivan Herman, W3C |
thanks On 19/02/15 10:51, Ivan Herman wrote:
|
@robald7 said:
Agreed; I don't think that is a legitimate use case. What I have seen is date and dateTime information intermingled in the same column, so having a datatype of ['date', 'dateTime'] would be useful. Your own example of |
'date' and 'dateTime' maybe, but I think that the person writing the On 19/02/15 20:42, Gregg Kellogg wrote:
This message and the information contained within it is intended for |
discussed |
Discussed on call on 25-02-2015 - @jumbrich looked at 9k CSV files, looked at different datatypes within single columns. About 1/3 documents have columns with a mix of strings & integer/numeric values (2,282 docs). |
http://www.w3.org/2015/03/04-csvw-irc#T15-51-49 - we'll leave this as an issue in the spec but not add support for now |
Most data processing I have been engaged around education over the last 20 years can be approached as follows
|
@JeniT to flag this issue to the group to say that we really want input about whether we do or do not support union datatypes. |
Good evening I think that the issue is not so much about the union of datatypes but On 29/04/15 15:58, Jeni Tennison wrote:
|
We did discuss this on yesterday's call, and there seemed to be a few issues at play. The first two examples seem to use a constrained vocabulary, for which foreign key constraints are a good solution (IMO, anyway). The third example uses mixed values, some from a constrained vocabulary, other integer. You suggest that you might reference an XSD schema element to provide a mechanism to validate such values, but in our estimation, this would add significant complexity to implementations expected to validate cell contents using this. The multiple datatype mechanism would seem to address this use case by allowing multiple classes of data within a cell. For example, this might be Another thing discussed would be to allow the datatype base to be set to an arbitrary URL (or prefixed name, where the prefix comes from the CSVW context). This would be just to allow a different reference to a datatype, without any validation beyond what is discussed in Formats for other types. The use of a regular expression |
Thank you for this reply On 30/04/15 16:18, Gregg Kellogg wrote:
best wishes |
A validating implementation is supposed to look at the data and the metadata and decide whether the data is valid. This implementation may run in Python, in Javascript, Ruby, whatever, on very different operating systems or in the web browser. Ie, perl, awk, or xmllint are not valid options; the only option is to make an XML Schema validation on the data. This means having an access to XML Schema validation (I do not know of any browser doing that, for example), ie, incorporating a (potentially large) library or writing one's own XML Schema validation procedure for a, relatively, minor use case. |
Thank you for this reply, I think I understand what you say but there This being said, the main point is to have a way to describe the data as Best wishes having written that I only noticed the existence of ODI's csvlint. I On 01/05/15 08:54, Ivan Herman wrote:
|
Asked for general opinion from the list: https://lists.w3.org/Archives/Public/public-csv-wg/2015May/0002.html |
-1 this is additional complexity for an edge case. People can try using "any" or similar and we can review again in future revisions of the spec if a major issue |
@rgrp I'm not sure yet whether I view it as additional complexity, or as pragmatism that acknowledges how many real world tabular data files look in practice. Could we e.g. tighten up your "People can try using 'any'" into something implementable, and mark it as an at-risk / experimental feature, rather than leaving it completely open? |
Good morning In my opinion, to have a type like "any" or "string" when not able for On 06/05/15 10:27, Dan Brickley wrote:
This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error. |
Good evening I have looked at the use-case 11 ( Palo Alto trees), and I wonder what |
I do not know what you mean by 'natively'. Can you explain? Ivan
Ivan Herman, W3C |
Good morning On 16/06/15 05:56, Ivan Herman wrote:
|
The current spec already goes beyond the pure XSD offering insofar as the various restrictions (min/max values, regexp patterns, etc) are typically those features that people use for refining the core types like integers or strings. Ie, CSV validators can already check values using those 'extended' datatypes. I do not think it is justified to go beyond that, ie, to reproduce the full XSD typing mechanism in our specifications. Ivan |
I have no problem with your answer! In this case I would then use as datatype for SPECIES a xsd reference Best wishes On 16/06/15 08:05, Ivan Herman wrote:
|
Yes, this is exactly the kind of usage we had in mind (inspired by you:-).
I am not sure how foreign keys would be playing out for this. My instinctive reaction is to use patterns. |
Foreign Keys would allow you to describe the enumerations in a separate table (for which output could be suppressed). This would allow you to get warnings for any cells which were outside of this range. Not that you can also use a Datatype format, which would allow a certain amount of output datatype fidelity as well. A foreign key violation won't affect the converted datatype, but will generate a warning. A pattern violation will make the output datatype |
On 16/06/15 10:39, Ivan Herman wrote:
Best wishes
|
This feature stems from use case reported in #212 open by @robald7. Technically, the issue is:
datatype
can have an array of values and not only a single one, where each value is an object with the datatype and the corresponding format.The text was updated successfully, but these errors were encountered: