Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing "unions" of datatypes? #223

Open
iherman opened this issue Feb 16, 2015 · 37 comments
Open

Allowing "unions" of datatypes? #223

iherman opened this issue Feb 16, 2015 · 37 comments

Comments

@iherman
Copy link
Member

iherman commented Feb 16, 2015

This feature stems from use case reported in #212 open by @robald7. Technically, the issue is:

  • Allow for a "union" of datatypes for a cell. What it means is that the cell's raw value can be of one of the constituent datatypes.
  • Syntactically, what this would probably mean that the datatype can have an array of values and not only a single one, where each value is an object with the datatype and the corresponding format.
@gkellogg
Copy link
Member

+1, this seems like a reasonable change.

@gkellogg
Copy link
Member

Although, issue #136 suggests making null only take a single value (see #218), and by correlary, this would go against making datatype take an array in order to simplify merge. This comes down to functionality vs singularity.

@iherman
Copy link
Member Author

iherman commented Feb 16, 2015

I am not convinced. I would like to see more use cases requiring it. It is yet another feature creep and we have to have very strong arguments for allowing any more.

So... -1 for the time being.

Ivan


Ivan Herman
Tel:+31 641044153
http://www.ivan-herman.net

(Written on mobile, sorry for brevity and misspellings...)

On 16 Feb 2015, at 20:07, Gregg Kellogg notifications@github.com wrote:

+1, this seems like a reasonable change.


Reply to this email directly or view it on GitHub.

@gkellogg
Copy link
Member

With simplified merge semantics this reduces the merge burden. If sufficiently motivated this is reasonable, and could even reduce the need for multiple null values, as they can be matched using datatype patterns.

@JeniT JeniT added the Blocked label Feb 18, 2015
@JeniT
Copy link

JeniT commented Feb 18, 2015

@jumbrich to do an analysis to see how common it is to have columns with union values in real data. Will make a decision based on that analysis.

@robald7
Copy link

robald7 commented Feb 18, 2015

Over many years of dealing with CSV data in the social sciences, I can say that it is rare to find data which has got no funny values in it
examples)
For instance from Education at a Glance Interim Report: Update of Employment and Educational Attainment Indicators © OECD 2015
These symbols and abbreviations are used in the tables and charts:
a Data are not applicable because the category does not apply.
c There are too few observations to provide reliable estimates.
m Data are not available.
n Magnitude is either negligible or zero.

  • Values are below a certain reliability threshold and should be interpreted with
    caution (see the Annex for country-specific definitions).
    w Data have been withdrawn at the request of the country concerned.
    x Data included in another category or column of the table (e.g. x(2) means that data
    are included in column 2 of the table).
    ~ Average is not comparable with other levels of education.

the DfE also uses things like "N/D" non-disclosed, "N/A" not-available in some datasets

Small comments: it is easy to go automatically from a "complicated" type to a simpler one, the reverse is not as easy. Also I don't think that a NULL is the same as a "SUPP", "N/A" and so on

I would have no problem to deal with datatypes as described by JeniT, as they seem to me to be a "rewriting" of what is in the W3C schema datatypes

@iherman
Copy link
Member Author

iherman commented Feb 19, 2015

I understand. However... isn't it enough to model those 'a'/'c' etc cases by the simpler approach, described in the (accepted) issue #218? Ie, to be able to list a number of strings that act as "null"?

The current issue is whether it is required to have, say, a column whose values can be integers or dates, for example.

Ivan

On 18 Feb 2015, at 21:54 , robald7 notifications@github.com wrote:

Over many years of dealing with CSV data in the social sciences, I can say that it is rare to find data which has got no funny values in it
examples)
For instance from Education at a Glance Interim Report: Update of Employment and Educational Attainment Indicators © OECD 2015
These symbols and abbreviations are used in the tables and charts:
a Data are not applicable because the category does not apply.
c There are too few observations to provide reliable estimates.
m Data are not available.
n Magnitude is either negligible or zero.

• Values are below a certain reliability threshold and should be interpreted with caution (see the Annex for country-specific definitions). w Data have been withdrawn at the request of the country concerned. x Data included in another category or column of the table (e.g. x(2) means that data are included in column 2 of the table). ~ Average is not comparable with other levels of education.
the DfE also uses things like "N/D" non-disclosed, "N/A" not-available in some datasets

Small comments: it is easy to go automatically from a "complicated" type to a simpler one, the reverse is not as easy. Also I don't think that a NULL is the same as a "SUPP", "N/A" and so on

I would have no problem to deal with datatypes as described by JeniT, as they seem to me to be a "rewriting" of what is in the W3C schema datatypes


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

@robald7
Copy link

robald7 commented Feb 19, 2015

Hi
In all my many years of dealing with data, I have never met a situation
where in a column data could be either an integer or a date, and I think
that most people would think it it to be a very bad idea! But most of
the times (if not always) in a column with basic data type, say integer
we get integer values, NULL, and codes like NA, NE, SUPP and so on.
I am not following all the conversations (and the conventions) you have,
but what strikes me is that with the XML schema datatypes it would be
possible to deal with all the situations you describe fairly easily
best wishes
r,

On 19/02/15 08:48, Ivan Herman wrote:

I understand. However... isn't it enough to model those 'a'/'c' etc
cases by the simpler approach, described in the (accepted) issue #218?
Ie, to be able to list a number of strings that act as "null"?

The current issue is whether it is required to have, say, a column
whose values can be integers or dates, for example.

Ivan

On 18 Feb 2015, at 21:54 , robald7 notifications@github.com wrote:

Over many years of dealing with CSV data in the social sciences, I
can say that it is rare to find data which has got no funny values in it
examples)
For instance from Education at a Glance Interim Report: Update of
Employment and Educational Attainment Indicators © OECD 2015
These symbols and abbreviations are used in the tables and charts:
a Data are not applicable because the category does not apply.
c There are too few observations to provide reliable estimates.
m Data are not available.
n Magnitude is either negligible or zero.

• Values are below a certain reliability threshold and should be
interpreted with caution (see the Annex for country-specific
definitions). w Data have been withdrawn at the request of the country
concerned. x Data included in another category or column of the table
(e.g. x(2) means that data are included in column 2 of the table). ~
Average is not comparable with other levels of education.
the DfE also uses things like "N/D" non-disclosed, "N/A"
not-available in some datasets

Small comments: it is easy to go automatically from a "complicated"
type to a simpler one, the reverse is not as easy. Also I don't think
that a NULL is the same as a "SUPP", "N/A" and so on

I would have no problem to deal with datatypes as described by
JeniT, as they seem to me to be a "rewriting" of what is in the W3C
schema datatypes


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704


Reply to this email directly or view it on GitHub
#223 (comment).

@iherman
Copy link
Member Author

iherman commented Feb 19, 2015

On 19 Feb 2015, at 10:00 , robald7 notifications@github.com wrote:

Hi
In all my many years of dealing with data, I have never met a situation
where in a column data could be either an integer or a date, and I think
that most people would think it it to be a very bad idea!

Thanks. That is an important input for our discussion...

But most of
the times (if not always) in a column with basic data type, say integer
we get integer values, NULL, and codes like NA, NE, SUPP and so on.

Right. From our processing and metadata point of view, is it o.k. to be able to declare:

"null" : ["NA", "NE", "SUPP"]

that is those strings are accepted as null values (ie, no output is generated in the JSON and/or RDF output for those cases, and validators may consider those data sets valid). Is that enough for your requirements?

Ivan

I am not following all the conversations (and the conventions) you have,
but what strikes me is that with the XML schema datatypes it would be
possible to deal with all the situations you describe fairly easily
best wishes
r,

On 19/02/15 08:48, Ivan Herman wrote:

I understand. However... isn't it enough to model those 'a'/'c' etc
cases by the simpler approach, described in the (accepted) issue #218?
Ie, to be able to list a number of strings that act as "null"?

The current issue is whether it is required to have, say, a column
whose values can be integers or dates, for example.

Ivan

On 18 Feb 2015, at 21:54 , robald7 notifications@github.com wrote:

Over many years of dealing with CSV data in the social sciences, I
can say that it is rare to find data which has got no funny values in it
examples)
For instance from Education at a Glance Interim Report: Update of
Employment and Educational Attainment Indicators © OECD 2015
These symbols and abbreviations are used in the tables and charts:
a Data are not applicable because the category does not apply.
c There are too few observations to provide reliable estimates.
m Data are not available.
n Magnitude is either negligible or zero.

• Values are below a certain reliability threshold and should be
interpreted with caution (see the Annex for country-specific
definitions). w Data have been withdrawn at the request of the country
concerned. x Data included in another category or column of the table
(e.g. x(2) means that data are included in column 2 of the table). ~
Average is not comparable with other levels of education.
the DfE also uses things like "N/D" non-disclosed, "N/A"
not-available in some datasets

Small comments: it is easy to go automatically from a "complicated"
type to a simpler one, the reverse is not as easy. Also I don't think
that a NULL is the same as a "SUPP", "N/A" and so on

I would have no problem to deal with datatypes as described by
JeniT, as they seem to me to be a "rewriting" of what is in the W3C
schema datatypes


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704


Reply to this email directly or view it on GitHub
#223 (comment).


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

@robald7
Copy link

robald7 commented Feb 19, 2015

What I would like to say is that in a column I may have "integer" or
["NA", "NE", "SUPP"], and that also NULL may be present. it is maybe
what you had in mind, but the ":" made me think that you were thinking
of defining null? Not always, but there are occasions when I may be
interested in records where a value has been suppressed so I would not
want them to be turned into a simple null. A NULL value to me means
there is no data for whatever reason, the other ones are reasons.
r

On 19/02/15 09:03, Ivan Herman wrote:

On 19 Feb 2015, at 10:00 , robald7 notifications@github.com wrote:

Hi
In all my many years of dealing with data, I have never met a situation
where in a column data could be either an integer or a date, and I think
that most people would think it it to be a very bad idea!

Thanks. That is an important input for our discussion...

But most of
the times (if not always) in a column with basic data type, say integer
we get integer values, NULL, and codes like NA, NE, SUPP and so on.

Right. From our processing and metadata point of view, is it o.k. to
be able to declare:

"null" : ["NA", "NE", "SUPP"]

that is those strings are accepted as null values (ie, no output is
generated in the JSON and/or RDF output for those cases, and
validators may consider those data sets valid). Is that enough for
your requirements?

Ivan

I am not following all the conversations (and the conventions) you have,
but what strikes me is that with the XML schema datatypes it would be
possible to deal with all the situations you describe fairly easily
best wishes
r,

On 19/02/15 08:48, Ivan Herman wrote:

I understand. However... isn't it enough to model those 'a'/'c' etc
cases by the simpler approach, described in the (accepted) issue #218?
Ie, to be able to list a number of strings that act as "null"?

The current issue is whether it is required to have, say, a column
whose values can be integers or dates, for example.

Ivan

On 18 Feb 2015, at 21:54 , robald7 notifications@github.com wrote:

Over many years of dealing with CSV data in the social sciences, I
can say that it is rare to find data which has got no funny values
in it
examples)
For instance from Education at a Glance Interim Report: Update of
Employment and Educational Attainment Indicators © OECD 2015
These symbols and abbreviations are used in the tables and charts:
a Data are not applicable because the category does not apply.
c There are too few observations to provide reliable estimates.
m Data are not available.
n Magnitude is either negligible or zero.

• Values are below a certain reliability threshold and should be
interpreted with caution (see the Annex for country-specific
definitions). w Data have been withdrawn at the request of the country
concerned. x Data included in another category or column of the table
(e.g. x(2) means that data are included in column 2 of the table). ~
Average is not comparable with other levels of education.
the DfE also uses things like "N/D" non-disclosed, "N/A"
not-available in some datasets

Small comments: it is easy to go automatically from a "complicated"
type to a simpler one, the reverse is not as easy. Also I don't think
that a NULL is the same as a "SUPP", "N/A" and so on

I would have no problem to deal with datatypes as described by
JeniT, as they seem to me to be a "rewriting" of what is in the W3C
schema datatypes


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704


Reply to this email directly or view it on GitHub
#223 (comment).


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704


Reply to this email directly or view it on GitHub
#223 (comment).

@iherman
Copy link
Member Author

iherman commented Feb 19, 2015

Indeed, the definition of the null property in the metadata means defining the corresponding strings may appear as cell values; however no triples or key/value pair is generated (for the RDF and JSON conversion, respectively), and the a validator will not raise an error condition if one of those string values will be found in the cell instead of, say, an integer. (I carefully tried to avoid the SQL-like term "null" in the effect:-).

I seem to remember we discussed, at some point, whether the original cell value (ie, not transformed via, say, datatype transformation) should appear as an rdfs:label value in the RDF output. The combination with the above might be what you describe...

Ivan

On 19 Feb 2015, at 11:10 , robald7 notifications@github.com wrote:

What I would like to say is that in a column I may have "integer" or
["NA", "NE", "SUPP"], and that also NULL may be present. it is maybe
what you had in mind, but the ":" made me think that you were thinking
of defining null? Not always, but there are occasions when I may be
interested in records where a value has been suppressed so I would not
want them to be turned into a simple null. A NULL value to me means
there is no data for whatever reason, the other ones are reasons.
r

On 19/02/15 09:03, Ivan Herman wrote:

On 19 Feb 2015, at 10:00 , robald7 notifications@github.com wrote:

Hi
In all my many years of dealing with data, I have never met a situation
where in a column data could be either an integer or a date, and I think
that most people would think it it to be a very bad idea!

Thanks. That is an important input for our discussion...

But most of
the times (if not always) in a column with basic data type, say integer
we get integer values, NULL, and codes like NA, NE, SUPP and so on.

Right. From our processing and metadata point of view, is it o.k. to
be able to declare:

"null" : ["NA", "NE", "SUPP"]

that is those strings are accepted as null values (ie, no output is
generated in the JSON and/or RDF output for those cases, and
validators may consider those data sets valid). Is that enough for
your requirements?

Ivan

I am not following all the conversations (and the conventions) you have,
but what strikes me is that with the XML schema datatypes it would be
possible to deal with all the situations you describe fairly easily
best wishes
r,

On 19/02/15 08:48, Ivan Herman wrote:

I understand. However... isn't it enough to model those 'a'/'c' etc
cases by the simpler approach, described in the (accepted) issue #218?
Ie, to be able to list a number of strings that act as "null"?

The current issue is whether it is required to have, say, a column
whose values can be integers or dates, for example.

Ivan

On 18 Feb 2015, at 21:54 , robald7 notifications@github.com wrote:

Over many years of dealing with CSV data in the social sciences, I
can say that it is rare to find data which has got no funny values
in it
examples)
For instance from Education at a Glance Interim Report: Update of
Employment and Educational Attainment Indicators © OECD 2015
These symbols and abbreviations are used in the tables and charts:
a Data are not applicable because the category does not apply.
c There are too few observations to provide reliable estimates.
m Data are not available.
n Magnitude is either negligible or zero.

• Values are below a certain reliability threshold and should be
interpreted with caution (see the Annex for country-specific
definitions). w Data have been withdrawn at the request of the country
concerned. x Data included in another category or column of the table
(e.g. x(2) means that data are included in column 2 of the table). ~
Average is not comparable with other levels of education.
the DfE also uses things like "N/D" non-disclosed, "N/A"
not-available in some datasets

Small comments: it is easy to go automatically from a "complicated"
type to a simpler one, the reverse is not as easy. Also I don't think
that a NULL is the same as a "SUPP", "N/A" and so on

I would have no problem to deal with datatypes as described by
JeniT, as they seem to me to be a "rewriting" of what is in the W3C
schema datatypes


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704


Reply to this email directly or view it on GitHub
#223 (comment).


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704


Reply to this email directly or view it on GitHub
#223 (comment).


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

@robald7
Copy link

robald7 commented Feb 19, 2015

thanks
if the definition of "null" includes also the NULL like the second field
in this data "10,,2.3" and if the validator does not complain then I am
happy as it will still be up to me to do something or not with theses
values in my applications. But with this facility present, I would be
able to tell whoever is providing us with the data that it is not as
expected if other values are present, ie "retour a l'envoyeur".
bw
r

On 19/02/15 10:51, Ivan Herman wrote:

Indeed, the definition of the null property in the metadata means
defining the corresponding strings may appear as cell values; however
no triples or key/value pair is generated (for the RDF and JSON
conversion, respectively), and the a validator will not raise an error
condition if one of those string values will be found in the cell
instead of, say, an integer. (I carefully tried to avoid the SQL-like
term "null" in the effect:-).

I seem to remember we discussed, at some point, whether the original
cell value (ie, not transformed via, say, datatype transformation)
should appear as an rdfs:label value in the RDF output. The
combination with the above might be what you describe...

Ivan

On 19 Feb 2015, at 11:10 , robald7 notifications@github.com wrote:

What I would like to say is that in a column I may have "integer" or
["NA", "NE", "SUPP"], and that also NULL may be present. it is maybe
what you had in mind, but the ":" made me think that you were thinking
of defining null? Not always, but there are occasions when I may be
interested in records where a value has been suppressed so I would not
want them to be turned into a simple null. A NULL value to me means
there is no data for whatever reason, the other ones are reasons.
r

On 19/02/15 09:03, Ivan Herman wrote:

On 19 Feb 2015, at 10:00 , robald7 notifications@github.com wrote:

Hi
In all my many years of dealing with data, I have never met a
situation
where in a column data could be either an integer or a date, and
I think
that most people would think it it to be a very bad idea!

Thanks. That is an important input for our discussion...

But most of
the times (if not always) in a column with basic data type, say
integer
we get integer values, NULL, and codes like NA, NE, SUPP and so on.

Right. From our processing and metadata point of view, is it o.k. to
be able to declare:

"null" : ["NA", "NE", "SUPP"]

that is those strings are accepted as null values (ie, no output is
generated in the JSON and/or RDF output for those cases, and
validators may consider those data sets valid). Is that enough for
your requirements?

Ivan

I am not following all the conversations (and the conventions)
you have,
but what strikes me is that with the XML schema datatypes it
would be
possible to deal with all the situations you describe fairly easily
best wishes
r,

On 19/02/15 08:48, Ivan Herman wrote:

I understand. However... isn't it enough to model those
'a'/'c' etc
cases by the simpler approach, described in the (accepted)
issue #218?
Ie, to be able to list a number of strings that act as "null"?

The current issue is whether it is required to have, say, a column
whose values can be integers or dates, for example.

Ivan

On 18 Feb 2015, at 21:54 , robald7
notifications@github.com wrote:

Over many years of dealing with CSV data in the social
sciences, I
can say that it is rare to find data which has got no funny values
in it
examples)
For instance from Education at a Glance Interim Report:
Update of
Employment and Educational Attainment Indicators © OECD 2015
These symbols and abbreviations are used in the tables and
charts:
a Data are not applicable because the category does not apply.
c There are too few observations to provide reliable estimates.
m Data are not available.
n Magnitude is either negligible or zero.

• Values are below a certain reliability threshold and should be
interpreted with caution (see the Annex for country-specific
definitions). w Data have been withdrawn at the request of the
country
concerned. x Data included in another category or column of
the table
(e.g. x(2) means that data are included in column 2 of the
table). ~
Average is not comparable with other levels of education.
the DfE also uses things like "N/D" non-disclosed, "N/A"
not-available in some datasets

Small comments: it is easy to go automatically from a
"complicated"
type to a simpler one, the reverse is not as easy. Also I
don't think
that a NULL is the same as a "SUPP", "N/A" and so on

I would have no problem to deal with datatypes as described by
JeniT, as they seem to me to be a "rewriting" of what is in
the W3C
schema datatypes


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704


Reply to this email directly or view it on GitHub
#223 (comment).


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704


Reply to this email directly or view it on GitHub
#223 (comment).


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704


Reply to this email directly or view it on GitHub
#223 (comment).

@gkellogg
Copy link
Member

@robald7 said:

In all my many years of dealing with data, I have never met a situation
where in a column data could be either an integer or a date, and I think
that most people would think it it to be a very bad idea!

Agreed; I don't think that is a legitimate use case. What I have seen is date and dateTime information intermingled in the same column, so having a datatype of ['date', 'dateTime'] would be useful. Your own example of "10,,2.3" could also match ['integer', 'float'], which is where prioritized order becomes useful, as 10 also matches float.

@robald7
Copy link

robald7 commented Feb 19, 2015

'date' and 'dateTime' maybe, but I think that the person writing the
dataset should decide what it is and eventually as dateTime is more
general simply add T00:00:00 to its data missing the time element so to
make things simpler
But in the case of integer and float, 10 should be accepted as a float,
but not 10.0 as an integer
r

On 19/02/15 20:42, Gregg Kellogg wrote:

@robald7 https://github.com/robald7 said:

In all my many years of dealing with data, I have never met a
situation
where in a column data could be either an integer or a date, and I
think
that most people would think it it to be a very bad idea!

Agreed; I don't think that is a legitimate use case. What I have seen
is date and dateTime information intermingled in the same column, so
having a datatype of ['date', 'dateTime'] would be useful. Your own
example of |"10,,2.3"| could also match ['integer', 'float'], which is
where prioritized order becomes useful, as |10| also matches float.


Reply to this email directly or view it on GitHub
#223 (comment).

This message and the information contained within it is intended for
the recipient alone and any unintentional recipient should not act upon
the information apart from notifying the sender that the message has
been inadvertently diverted. The unintended recipient should delete the
message and inform the sender of the error. Please consider the
environment before printing this email.

@danbri
Copy link
Contributor

danbri commented Feb 25, 2015

@JeniT
Copy link

JeniT commented Feb 25, 2015

Discussed on call on 25-02-2015 - @jumbrich looked at 9k CSV files, looked at different datatypes within single columns. About 1/3 documents have columns with a mix of strings & integer/numeric values (2,282 docs).

@JeniT
Copy link

JeniT commented Mar 4, 2015

http://www.w3.org/2015/03/04-csvw-irc#T15-51-49 - we'll leave this as an issue in the spec but not add support for now

@robald7
Copy link

robald7 commented Mar 5, 2015

Most data processing I have been engaged around education over the last 20 years can be approached as follows
encoding, structure and data types, meaning and linking

  1. encoding mostly solved as a problem by using UTF-8 (or converting to it) but every so often some issues (getting rarer)
  2. before XML, dealt with codebooks so variety of interfaces to build. XML is a good solution (JSON is rarely used, but I see it to be the same as XML less the tools), but resistance to use it, and in some ways Excel (the most painful to deal with) and CSV are still the preferred way by civil servants to move data around, reason being mostly the tools they use (excel, sql server, no awk, no perl, ..)
  3. the future
    so in a way the CSV standard should have the same advantages as XML and make 3) possible. If datatypes don't allow things like "this variable is an integer or a code among {NA,NE} or empty", then it is easy to imagine what will happen: the variable shall be declared as an integer, it will be flagged by the validator, people will complain and the government always obliging will then say that the variable is of type string.
    small comment about CSV or TSV: the comma is not a very good separator, the tab is these days usually better, any other separator should be allowed.

@JeniT
Copy link

JeniT commented Apr 29, 2015

@JeniT JeniT self-assigned this Apr 29, 2015
@JeniT
Copy link

JeniT commented Apr 29, 2015

@JeniT to flag this issue to the group to say that we really want input about whether we do or do not support union datatypes.

@robald7
Copy link

robald7 commented Apr 29, 2015

Good evening

I think that the issue is not so much about the union of datatypes but
more about using or not the W3C schema datatypes. It works well, is
fairly complete (probably difficult to describe a percentage like 99%,
but it is anyway not a very good idea to exchange data with "%" added).
At the end of the day, data in csv files has to be used and the more
information we have about it the better it is. Using current tools, it
would be fairly easy to validate data specified with a schema, and it
could also be possible to "degrade" the validation rules
In short, I am not quite sure why there is a certain reluctance to
accept the description of variables using the W3C schema datatypes as
they are used in other W3C standards?
Best wishes
r,

On 29/04/15 15:58, Jeni Tennison wrote:

@JeniT https://github.com/JeniT to flag this issue to the group to
say that we really want input about whether we do or do not support
union datatypes.


Reply to this email directly or view it on GitHub
#223 (comment).

@gkellogg
Copy link
Member

We did discuss this on yesterday's call, and there seemed to be a few issues at play. The first two examples seem to use a constrained vocabulary, for which foreign key constraints are a good solution (IMO, anyway). The third example uses mixed values, some from a constrained vocabulary, other integer.

You suggest that you might reference an XSD schema element to provide a mechanism to validate such values, but in our estimation, this would add significant complexity to implementations expected to validate cell contents using this.

The multiple datatype mechanism would seem to address this use case by allowing multiple classes of data within a cell. For example, this might be integer and string, where the string datatype definition included a format which is a regular expression used to match values, allowing you to define your constrained vocabulary there.

Another thing discussed would be to allow the datatype base to be set to an arbitrary URL (or prefixed name, where the prefix comes from the CSVW context). This would be just to allow a different reference to a datatype, without any validation beyond what is discussed in Formats for other types. The use of a regular expression format could also match these cell types without resorting to the use of multiple datatypes, but you wouldn't have the ability to describe some values as being integer, while others are something else.

@robald7
Copy link

robald7 commented Apr 30, 2015

Thank you for this reply

On 30/04/15 16:18, Gregg Kellogg wrote:

We did discuss this on yesterday's call, and there seemed to be a few
issues at play. The first two examples
https://lists.w3.org/Archives/Public/public-csv-wg-comments/2015Apr/0002.html
seem to use a constrained vocabulary, for which foreign key
constraints are a good solution (IMO, anyway). The third example uses
mixed values, some from a constrained vocabulary, other integer.

As you say "foreign key" is a possible solution, but not ideal. For
instance with the LA, it is not so bad as I can create a table with 2
columns: the code and the name of the LA, in the second example the
table would have only one column., but why not? In a way, I prefer to
think set of values defining the type of the variable, rather than
being keys.

You suggest that you might reference an XSD schema element to provide
a mechanism to validate such values, but in our estimation, this would
add significant complexity to implementations expected to validate
cell contents using this.

On this one, I am not at all convinced! Without writing anything new, it
could be done relatively easily: to start with perl/awk and xmllint for
instance, later maybe in a single program.
Also, there are already existing XSD schemas for transferring XML data,
they could be reused when producing CSV files

The multiple datatype mechanism would seem to address this use case by
allowing multiple classes of data within a cell. For example, this
might be |integer| and |string|, where the |string| datatype
definition included a |format| which is a regular expression used to
match values, allowing you to define your constrained vocabulary there.

Iagree that multiple datatypes and regular expressions could do the job,
in fact I would then rewrite the 3 examples as follows

  1. as string and format (201|202|....|938|LA|NAT)
  2. as string and format (BOYS|GIRLS|MIXED)
  3. decimal + string and format (SUPP|NE)

Another thing discussed would be to allow the datatype /base/ to be
set to an arbitrary URL (or prefixed name, where the prefix comes from
the CSVW context). This would be just to allow a different reference
to a datatype, without any validation beyond what is discussed in
Formats for other types
http://www.w3.org/TR/tabular-data-model/#formats-for-other-types.
The use of a regular expression |format| could also match these cell
types without resorting to the use of multiple datatypes, but you
wouldn't have the ability to describe some values as being |integer|,
while others are something else.


Reply to this email directly or view it on GitHub
#223 (comment).

Maybe a naive question, but why not have two ways (or more?) to define
datatypes.

best wishes
r,

@iherman
Copy link
Member Author

iherman commented May 1, 2015

On 30 Apr 2015, at 18:02 , robald7 notifications@github.com wrote:

You suggest that you might reference an XSD schema element to provide
a mechanism to validate such values, but in our estimation, this would
add significant complexity to implementations expected to validate
cell contents using this.

On this one, I am not at all convinced! Without writing anything new, it
could be done relatively easily: to start with perl/awk and xmllint for
instance, later maybe in a single program.
Also, there are already existing XSD schemas for transferring XML data,
they could be reused when producing CSV files

A validating implementation is supposed to look at the data and the metadata and decide whether the data is valid. This implementation may run in Python, in Javascript, Ruby, whatever, on very different operating systems or in the web browser. Ie, perl, awk, or xmllint are not valid options; the only option is to make an XML Schema validation on the data. This means having an access to XML Schema validation (I do not know of any browser doing that, for example), ie, incorporating a (potentially large) library or writing one's own XML Schema validation procedure for a, relatively, minor use case.

@robald7
Copy link

robald7 commented May 1, 2015

Thank you for this reply, I think I understand what you say but there
are some small points that I wish to make
I have used CSV files for many years and created many sites displaying
such data, it would not come to my mind to send a csv file to the
browser and then do some JS work (parsing/validation/..) to display it.
More likely,I should on the server validate the data, transform it into
json (I have once or twice send xml data to the browser) and send result
to the browser. I don't think that many people intend to send CSV data
to the browser, but simply an opinion, not something based on facts.
I choose to mention awk/perl and xmllint because there are the ones I
use all the time, and they work well. I am sure that other languages
have similar possibilities. And I also really meant that these tools
could be used in the first instance before a proper csvlint
It is maybe a minor use case but a very frequent one! For instance,
statisticians describing data in SPSS could have for age integer and -99
"DNK", -98 "NA". Ignoring that information could give funny results.

This being said, the main point is to have a way to describe the data as
well as possible, as it is really that which is important! XSD datatypes
are quite complete, but if the CSV specification comes close to it but
without using it, why not?

Best wishes
r.

having written that I only noticed the existence of ODI's csvlint. I
shall look at it as I suppose that it is the model implementation you
are aiming at?

On 01/05/15 08:54, Ivan Herman wrote:

On 30 Apr 2015, at 18:02 , robald7 notifications@github.com wrote:

You suggest that you might reference an XSD schema element to provide
a mechanism to validate such values, but in our estimation, this would
add significant complexity to implementations expected to validate
cell contents using this.

On this one, I am not at all convinced! Without writing anything new, it
could be done relatively easily: to start with perl/awk and xmllint for
instance, later maybe in a single program.
Also, there are already existing XSD schemas for transferring XML data,
they could be reused when producing CSV files

A validating implementation is supposed to look at the data and the
metadata and decide whether the data is valid. This implementation may
run in Python, in Javascript, Ruby, whatever, on very different
operating systems or in the web browser. Ie, perl, awk, or xmllint are
not valid options; the only option is to make an XML Schema validation
on the data. This means having an access to XML Schema validation (I
do not know of any browser doing that, for example), ie, incorporating
a (potentially large) library or writing one's own XML Schema
validation procedure for a, relatively, minor use case.


Reply to this email directly or view it on GitHub
#223 (comment).

@JeniT
Copy link

JeniT commented May 3, 2015

Asked for general opinion from the list: https://lists.w3.org/Archives/Public/public-csv-wg/2015May/0002.html

@JeniT JeniT removed their assignment May 3, 2015
@rufuspollock
Copy link

-1 this is additional complexity for an edge case. People can try using "any" or similar and we can review again in future revisions of the spec if a major issue

@danbri
Copy link
Contributor

danbri commented May 6, 2015

@rgrp I'm not sure yet whether I view it as additional complexity, or as pragmatism that acknowledges how many real world tabular data files look in practice.

Could we e.g. tighten up your "People can try using 'any'" into something implementable, and mark it as an at-risk / experimental feature, rather than leaving it completely open?

@robald7
Copy link

robald7 commented May 6, 2015

Good morning

In my opinion, to have a type like "any" or "string" when not able for
whatever reasons to indicate a better type will not encourage producers
of data (and here I am specifically thinking on my experience of working
with government data in the UK) to make the effort to describe the data
better. Then the CSV standard will be mostly metadata describing the
origin,..(this is needed and good) and the description of the data
itself will be mostly "string" (with or without pattern) and possibly
"date". And for me once again it would be much easier not to use a
possibly existing feature than not have it at all
Best wishes
r,

On 06/05/15 10:27, Dan Brickley wrote:

@rgrp https://github.com/rgrp I'm not sure yet whether I view it as
additional complexity, or as pragmatism that acknowledges how many
real world tabular data files look in practice.

Could we e.g. tighten up your "People can try using 'any'" into
something implementable, and mark it as an at-risk / experimental
feature, rather than leaving it completely open?


Reply to this email directly or view it on GitHub
#223 (comment).

This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error.
Please consider the environment before printing this email.

@robald7
Copy link

robald7 commented Jun 15, 2015

Good evening

I have looked at the use-case 11 ( Palo Alto trees), and I wonder what
would be for you the best way to define the datatype for the variable
"SPECIES"? As besides genuine species, and empty cells, there are values
like
"UNKNOWN SPP", "Vacant site (large/medium) tree"),"private tree",
"OBSOLETE SITE","unknown"
Having taken another look at the published documents, I now realise that
in a note it says that XSD/OWL could be used? With XSD, it would be
possible to manage the SPECIES variable as being either proper species
or funny values. Would it not be better to have it "natively"?
Best wishes
r,

@iherman
Copy link
Member Author

iherman commented Jun 16, 2015

On 15 Jun 2015, at 20:32 , robald7 notifications@github.com wrote:

Good evening

I have looked at the use-case 11 ( Palo Alto trees), and I wonder what
would be for you the best way to define the datatype for the variable
"SPECIES"? As besides genuine species, and empty cells, there are values
like
"UNKNOWN SPP", "Vacant site (large/medium) tree"),"private tree",
"OBSOLETE SITE","unknown"
Having taken another look at the published documents, I now realise that
in a note it says that XSD/OWL could be used? With XSD, it would be
possible to manage the SPECIES variable as being either proper species
or funny values. Would it not be better to have it "natively"?

I do not know what you mean by 'natively'. Can you explain?

Ivan

Best wishes
r,


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

@robald7
Copy link

robald7 commented Jun 16, 2015

Good morning
"native" here would be "json-like"
Sorry not be that clear, I meant that while it would be possible to
describe the species (on their own) without having to resort to xsd , ie
as foreign key (but I don't like that too much :-) ) or string with
pattern matching (Cryptomeria japonica| Myoporum insulare|..), but to
have the funny values would mean resorting in some ways to xsd. So my
question was "is there any hope to have it done without going outside?".
While I have pushed to have XSD types, and I am glad that they are
allowed, I have found that to validate datatypes which can be "native"
or "xsd", the best way to do that (for me) would be to convert the
"json-like" to xsd and then use a tool like xmllint
To complicate a bit further, it is also clear that among the funny
values some are probably mistakes, some are "genuine" (unknown)
Best wishes
r,

On 16/06/15 05:56, Ivan Herman wrote:

On 15 Jun 2015, at 20:32 , robald7 notifications@github.com wrote:

Good evening

I have looked at the use-case 11 ( Palo Alto trees), and I wonder what
would be for you the best way to define the datatype for the variable
"SPECIES"? As besides genuine species, and empty cells, there are values
like
"UNKNOWN SPP", "Vacant site (large/medium) tree"),"private tree",
"OBSOLETE SITE","unknown"
Having taken another look at the published documents, I now realise that
in a note it says that XSD/OWL could be used? With XSD, it would be
possible to manage the SPECIES variable as being either proper species
or funny values. Would it not be better to have it "natively"?

I do not know what you mean by 'natively'. Can you explain?

Ivan

Best wishes
r,


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704


Reply to this email directly or view it on GitHub
#223 (comment).

@iherman
Copy link
Member Author

iherman commented Jun 16, 2015

On 16 Jun 2015, at 08:35 , robald7 notifications@github.com wrote:

Good morning
"native" here would be "json-like"
Sorry not be that clear, I meant that while it would be possible to
describe the species (on their own) without having to resort to xsd , ie
as foreign key (but I don't like that too much :-) ) or string with
pattern matching (Cryptomeria japonica| Myoporum insulare|..), but to
have the funny values would mean resorting in some ways to xsd. So my
question was "is there any hope to have it done without going outside?".
While I have pushed to have XSD types, and I am glad that they are
allowed, I have found that to validate datatypes which can be "native"
or "xsd", the best way to do that (for me) would be to convert the
"json-like" to xsd and then use a tool like xmllint
To complicate a bit further, it is also clear that among the funny
values some are probably mistakes, some are "genuine" (unknown)
Best wishes
r,

The current spec already goes beyond the pure XSD offering insofar as the various restrictions (min/max values, regexp patterns, etc) are typically those features that people use for refining the core types like integers or strings. Ie, CSV validators can already check values using those 'extended' datatypes. I do not think it is justified to go beyond that, ie, to reproduce the full XSD typing mechanism in our specifications.

Ivan

@robald7
Copy link

robald7 commented Jun 16, 2015

I have no problem with your answer!

In this case I would then use as datatype for SPECIES a xsd reference
which would be the union of an enumeration of species and a funny value
"unknown" with the possibility of the element to be nil. All other
values would be reported as errors. Does it seem ok to you?
Having looked at most of the use-cases, it strikes me that a lot of them
have variables with values belonging to "controlled lists". Do you think
that treating them as foreign keys or strings with patterns are the best
ways to deal with them?

Best wishes
r,

On 16/06/15 08:05, Ivan Herman wrote:

On 16 Jun 2015, at 08:35 , robald7 notifications@github.com wrote:

Good morning
"native" here would be "json-like"
Sorry not be that clear, I meant that while it would be possible to
describe the species (on their own) without having to resort to xsd , ie
as foreign key (but I don't like that too much :-) ) or string with
pattern matching (Cryptomeria japonica| Myoporum insulare|..), but to
have the funny values would mean resorting in some ways to xsd. So my
question was "is there any hope to have it done without going outside?".
While I have pushed to have XSD types, and I am glad that they are
allowed, I have found that to validate datatypes which can be "native"
or "xsd", the best way to do that (for me) would be to convert the
"json-like" to xsd and then use a tool like xmllint
To complicate a bit further, it is also clear that among the funny
values some are probably mistakes, some are "genuine" (unknown)
Best wishes
r,

The current spec already goes beyond the pure XSD offering insofar as
the various restrictions (min/max values, regexp patterns, etc) are
typically those features that people use for refining the core types
like integers or strings. Ie, CSV validators can already check values
using those 'extended' datatypes. I do not think it is justified to go
beyond that, ie, to reproduce the full XSD typing mechanism in our
specifications.

Ivan


Reply to this email directly or view it on GitHub
#223 (comment).

@iherman
Copy link
Member Author

iherman commented Jun 16, 2015

On 16 Jun 2015, at 09:57 , robald7 notifications@github.com wrote:

I have no problem with your answer!

In this case I would then use as datatype for SPECIES a xsd reference
which would be the union of an enumeration of species and a funny value
"unknown" with the possibility of the element to be nil. All other
values would be reported as errors. Does it seem ok to you?

Yes, this is exactly the kind of usage we had in mind (inspired by you:-).

Having looked at most of the use-cases, it strikes me that a lot of them
have variables with values belonging to "controlled lists". Do you think
that treating them as foreign keys or strings with patterns are the best
ways to deal with them?

I am not sure how foreign keys would be playing out for this. My instinctive reaction is to use patterns.

@gkellogg
Copy link
Member

Foreign Keys would allow you to describe the enumerations in a separate table (for which output could be suppressed). This would allow you to get warnings for any cells which were outside of this range. Not that you can also use a Datatype format, which would allow a certain amount of output datatype fidelity as well. A foreign key violation won't affect the converted datatype, but will generate a warning. A pattern violation will make the output datatype string, and will also generate a warning.

@robald7
Copy link

robald7 commented Jun 16, 2015

On 16/06/15 10:39, Ivan Herman wrote:

On 16 Jun 2015, at 09:57 , robald7 notifications@github.com wrote:

I have no problem with your answer!

In this case I would then use as datatype for SPECIES a xsd reference
which would be the union of an enumeration of species and a funny value
"unknown" with the possibility of the element to be nil. All other
values would be reported as errors. Does it seem ok to you?

Yes, this is exactly the kind of usage we had in mind (inspired by you:-).

Thanks

Having looked at most of the use-cases, it strikes me that a lot of them
have variables with values belonging to "controlled lists". Do you think
that treating them as foreign keys or strings with patterns are the best
ways to deal with them?

I am not sure how foreign keys would be playing out for this. My
instinctive reaction is to use patterns.

if I had a variable "gender" with values 1 and 2, could have another csv
file with 2 columns, 3 rows (value, label; 1,Male; 2,Female) and
declare the datatype for gender to be string and say that gender is a
foreign key (values being indexes in this new file).
While possible, it could lead to a multiplication of such files. If I am
using a pattern, how do I say that 1 stands for Male? So in fact I would
probably decide to go all the way and use XSD schemas as I would be able
to use annotations on the values and have only one file with all my types.

Best wishes
r,


Reply to this email directly or view it on GitHub
#223 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants