a possible use case and comments #212

robald7 · 2015-02-15T17:50:45Z

Every year, the Department for Education in England publishes the Attainment and Achievement Tables (AAT), also known as League Tables, see

http://www.education.gov.uk/schools/performance/

From this site, it is also possible to download various datasets.

http://www.education.gov.uk/schools/performance/download_data.html

It is on these datasesets that I would like to add some comments and suggest what, at least to us, would be very useful for further processing.

For each local education authority, there are 8 data files. The first record in each of these files lists the variables. Quick examination shows that the variables have got the following types: id, string, integer, float, date, percentage, and enumeration. Note that in most files the "%" sign is appended to the number, but at least not in one of them; the scale seems to be important (ie, one digit after the decimal point and also in many cases rather that a number there is a code like "SUPP" (data was suppressed for some reason), "NE" (not entered").

There is also a section

http://www.education.gov.uk/schools/performance/metadata.html
describing the data. These descriptions while coming from one place and being about the same topic (AAT) are not consistent.

This last remark shows to me that the first aim in describing data coming from one single place should be consistency, and that since data will be coming from many different places the need for standards is obvious. This explains our interest in the work that the W3C is doing on the CSV format (or TSV, which in many ways is much more practical, but not a big issue).

When dealing with this type of data, it is important to know what possible values the variables can take. This is why the type should be possibly more than "string" as all csv data is a string anyway! I find that the idea of "@type" in JSON-LD 1.0 does that pretty well, allowing the possibility of having own types if need be.

The link to http://infotap.sda-ltd.com/dfe-like.html is simply to put together in one place various files (original and some created by be more or less automatically).

It is clear to me, that this data if described the W3C way + the "JSON" description of type would be very useful, the only thing which I find not possible to deal with is the"%" sign in the data, (also not very nice with usual spreadsheets), so i simply delete it all the data files

JeniT · 2015-02-15T22:08:54Z

Thanks for these pointers. There seem to be a few things that you're pointing at in this use case:

the need to provide a datatype for each cell; we do this with the datatype property
the need to support percentages; we do allow numeric values to include percentages, see formats for numeric types
the need to support enumerated values; this is best done by listing the values in a separate CSV file and referencing it using a foreign key
the need for cells to be able to take either the normal type of value or a special value such as SUPP or NA or NE

The last of these is the only one that I think we don't currently support; I've opened #218 to discuss that separately.

Let me know if I've missed anything.

robald7 · 2015-02-16T09:11:20Z

Thanks for your reply

I have no problem with what you write but it seems to me that the use of
the W3C XSD datatypes is a good way to deal with all these points. Very
complete datatypes, possibility of union of types to deal with 4). and
also dealing with NULL which is not a special funny value. In addition
lots of tools around.
One thing to note is that these special values may differ from variable
to variable. 4) is very common when dealing with csv data, think of
"MISSING VALUES" in SPSS for example
Best wishes
r

On 15/02/15 23:08, Jeni Tennison wrote:

Thanks for these pointers. There seem to be a few things that you're
pointing at in this use case:

the need to provide a datatype for each cell; we do this with the
|datatype| http://w3c.github.io/csvw/metadata/#cell-datatype
property

the need to support percentages; we do allow numeric values to
include percentages, see formats for numeric types
http://w3c.github.io/csvw/metadata/#formats-for-numeric-types

the need to support enumerated values; this is best done by
listing the values in a separate CSV file and referencing it using
a foreign key http://w3c.github.io/csvw/metadata/#table-foreignKeys

the need for cells to be able to take either the normal type of
value or a special value such as |SUPP| or |NA| or |NE|

The last of these is the only one that I think we don't currently
support; I've opened #218 #218 to
discuss that separately.

Let me know if I've missed anything.

—
Reply to this email directly or view it on GitHub
#212 (comment).

This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error.
Please consider the environment before printing this email.

iherman · 2015-02-16T10:37:09Z

I think there is a fifth issue for this case:

define a "union" of datatypes, ie, some sort of a logical disjunction: "the cell value can have a one of these types"

If this is indeed the case, this should be migrated into a separate issue.

Ivan

On 15 Feb 2015, at 23:08 , Jeni Tennison notifications@github.com wrote:

Thanks for these pointers. There seem to be a few things that you're pointing at in this use case:

• the need to provide a datatype for each cell; we do this with the datatype property
• the need to support percentages; we do allow numeric values to include percentages, see formats for numeric types
• the need to support enumerated values; this is best done by listing the values in a separate CSV file and referencing it using a foreign key
• the need for cells to be able to take either the normal type of value or a special value such as SUPP or NA or NE
The last of these is the only one that I think we don't currently support; I've opened #218 to discuss that separately.

Let me know if I've missed anything.

—
Reply to this email directly or view it on GitHub.

Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

JeniT · 2015-02-16T10:40:55Z

@iherman OK, are you going to do that?

iherman · 2015-02-16T11:00:55Z

Yes I will

On 16 Feb 2015, at 11:40 , Jeni Tennison notifications@github.com wrote:

@iherman OK, are you going to do that?

—
Reply to this email directly or view it on GitHub.

Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

iherman · 2015-03-08T13:09:00Z

Closing the issue, the essence has migrated to issue #223

JeniT mentioned this issue Feb 15, 2015

Categories of null values #218

Closed

JeniT added the Use Case Document label Feb 15, 2015

iherman mentioned this issue Feb 16, 2015

Allowing "unions" of datatypes? #223

Open

iherman closed this as completed Mar 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a possible use case and comments #212

a possible use case and comments #212

robald7 commented Feb 15, 2015

JeniT commented Feb 15, 2015

robald7 commented Feb 16, 2015

iherman commented Feb 16, 2015

JeniT commented Feb 16, 2015

iherman commented Feb 16, 2015

iherman commented Mar 8, 2015

a possible use case and comments #212

a possible use case and comments #212

Comments

robald7 commented Feb 15, 2015

JeniT commented Feb 15, 2015

robald7 commented Feb 16, 2015

iherman commented Feb 16, 2015

JeniT commented Feb 16, 2015

iherman commented Feb 16, 2015

iherman commented Mar 8, 2015