Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a possible use case and comments #212

Closed
robald7 opened this issue Feb 15, 2015 · 6 comments
Closed

a possible use case and comments #212

robald7 opened this issue Feb 15, 2015 · 6 comments

Comments

@robald7
Copy link

robald7 commented Feb 15, 2015

Every year, the Department for Education in England publishes the Attainment and Achievement Tables (AAT), also known as League Tables, see

http://www.education.gov.uk/schools/performance/

From this site, it is also possible to download various datasets.

http://www.education.gov.uk/schools/performance/download_data.html

It is on these datasesets that I would like to add some comments and suggest what, at least to us, would be very useful for further processing.

For each local education authority, there are 8 data files. The first record in each of these files lists the variables. Quick examination shows that the variables have got the following types: id, string, integer, float, date, percentage, and enumeration. Note that in most files the "%" sign is appended to the number, but at least not in one of them; the scale seems to be important (ie, one digit after the decimal point and also in many cases rather that a number there is a code like "SUPP" (data was suppressed for some reason), "NE" (not entered").

There is also a section

http://www.education.gov.uk/schools/performance/metadata.html
describing the data. These descriptions while coming from one place and being about the same topic (AAT) are not consistent.

This last remark shows to me that the first aim in describing data coming from one single place should be consistency, and that since data will be coming from many different places the need for standards is obvious. This explains our interest in the work that the W3C is doing on the CSV format (or TSV, which in many ways is much more practical, but not a big issue).

When dealing with this type of data, it is important to know what possible values the variables can take. This is why the type should be possibly more than "string" as all csv data is a string anyway! I find that the idea of "@type" in JSON-LD 1.0 does that pretty well, allowing the possibility of having own types if need be.

The link to http://infotap.sda-ltd.com/dfe-like.html is simply to put together in one place various files (original and some created by be more or less automatically).

It is clear to me, that this data if described the W3C way + the "JSON" description of type would be very useful, the only thing which I find not possible to deal with is the"%" sign in the data, (also not very nice with usual spreadsheets), so i simply delete it all the data files

@JeniT
Copy link

JeniT commented Feb 15, 2015

Thanks for these pointers. There seem to be a few things that you're pointing at in this use case:

  1. the need to provide a datatype for each cell; we do this with the datatype property
  2. the need to support percentages; we do allow numeric values to include percentages, see formats for numeric types
  3. the need to support enumerated values; this is best done by listing the values in a separate CSV file and referencing it using a foreign key
  4. the need for cells to be able to take either the normal type of value or a special value such as SUPP or NA or NE

The last of these is the only one that I think we don't currently support; I've opened #218 to discuss that separately.

Let me know if I've missed anything.

@robald7
Copy link
Author

robald7 commented Feb 16, 2015

Thanks for your reply

I have no problem with what you write but it seems to me that the use of
the W3C XSD datatypes is a good way to deal with all these points. Very
complete datatypes, possibility of union of types to deal with 4). and
also dealing with NULL which is not a special funny value. In addition
lots of tools around.
One thing to note is that these special values may differ from variable
to variable. 4) is very common when dealing with csv data, think of
"MISSING VALUES" in SPSS for example
Best wishes
r

On 15/02/15 23:08, Jeni Tennison wrote:

Thanks for these pointers. There seem to be a few things that you're
pointing at in this use case:

  1. the need to provide a datatype for each cell; we do this with the
    |datatype| http://w3c.github.io/csvw/metadata/#cell-datatype
    property
  2. the need to support percentages; we do allow numeric values to
    include percentages, see formats for numeric types
    http://w3c.github.io/csvw/metadata/#formats-for-numeric-types
  3. the need to support enumerated values; this is best done by
    listing the values in a separate CSV file and referencing it using
    a foreign key http://w3c.github.io/csvw/metadata/#table-foreignKeys
  4. the need for cells to be able to take either the normal type of
    value or a special value such as |SUPP| or |NA| or |NE|

The last of these is the only one that I think we don't currently
support; I've opened #218 #218 to
discuss that separately.

Let me know if I've missed anything.


Reply to this email directly or view it on GitHub
#212 (comment).

This message and the information contained within it is intended for the recipient alone and any unintentional recipient should not act upon the information apart from notifying the sender that the message has been inadvertently diverted. The unintended recipient should delete the message and inform the sender of the error.
Please consider the environment before printing this email.

@iherman
Copy link
Member

iherman commented Feb 16, 2015

I think there is a fifth issue for this case:

  • define a "union" of datatypes, ie, some sort of a logical disjunction: "the cell value can have a one of these types"

If this is indeed the case, this should be migrated into a separate issue.

Ivan

On 15 Feb 2015, at 23:08 , Jeni Tennison notifications@github.com wrote:

Thanks for these pointers. There seem to be a few things that you're pointing at in this use case:

• the need to provide a datatype for each cell; we do this with the datatype property
• the need to support percentages; we do allow numeric values to include percentages, see formats for numeric types
• the need to support enumerated values; this is best done by listing the values in a separate CSV file and referencing it using a foreign key
• the need for cells to be able to take either the normal type of value or a special value such as SUPP or NA or NE
The last of these is the only one that I think we don't currently support; I've opened #218 to discuss that separately.

Let me know if I've missed anything.


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

@JeniT
Copy link

JeniT commented Feb 16, 2015

@iherman OK, are you going to do that?

@iherman
Copy link
Member

iherman commented Feb 16, 2015

Yes I will

On 16 Feb 2015, at 11:40 , Jeni Tennison notifications@github.com wrote:

@iherman OK, are you going to do that?


Reply to this email directly or view it on GitHub.


Ivan Herman, W3C
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704

@iherman
Copy link
Member

iherman commented Mar 8, 2015

Closing the issue, the essence has migrated to issue #223

@iherman iherman closed this as completed Mar 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants