From 9e76cfb66b3414d096560a85d00567ff5eee1ba5 Mon Sep 17 00:00:00 2001 From: Jeni Tennison Date: Wed, 23 Sep 2015 17:57:33 +0100 Subject: [PATCH] initial primer work --- primer/index.html | 280 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 280 insertions(+) create mode 100644 primer/index.html diff --git a/primer/index.html b/primer/index.html new file mode 100644 index 00000000..eceec613 --- /dev/null +++ b/primer/index.html @@ -0,0 +1,280 @@ + + + + + + Tabular Data on the Web: A Primer + + + + + + + + +
+

+ CSV is one of the most popular formats for publishing data on the web. It is concise, easy to understand by both humans and computers, and aligns nicely to the tabular nature of most data. +

+

+ But CSV is also a poor format for data. There is no mechanism within CSV to indicate the type of data in a particular column, or whether values in a particular column must be unique. It is therefore hard to validate and prone to errors such as missing values or mismatching formats. +

+

+ The CSV on the Web Working Group has developed standard ways to express useful metadata about CSV files and other kinds of tabular data. This primer takes you through the ways in which these standards work together, covering: +

+ +

+ Where possible, this primer links back to the normative definitions of terms and properties in the standards. Nothing in this primer overrides those normative definitions. +

+
+
+

+ The CSV on the Web Working Group was chartered to produce a recommendation "Access methods for CSV Metadata" as well as recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various formats (e.g., RDF, JSON, or XML)". This non-normative document is a primer that describes how these standards work together for new readers. The normative standards are: +

+ +
+
+

What is tabular data and CSV?

+

+ Tabular data is any data that can be arranged in a table, like the one here: +

+ + + + + +
column 1column 2column 3
row 1cell in column 1 and row 1cell in column 2 and row 1cell in column 3 and row 1
row 2cell in column 1 and row 2cell in column 2 and row 2cell in column 3 and row 2
row 3cell in column 1 and row 3cell in column 2 and row 3cell in column 3 and row 3
+

+ There are lots of syntaxes for expressing tabular data on the web. You can put it in HTML tables, pass it around as Excel spreadsheets, or store it in a SQL database. +

+

+ One easy way to pass around tabular data is as CSV: as comma-separated values. A CSV file writes each row on a separate line and each cell is separated from the next with a comma. The values of cells can be written with double quotes around them; this is necessary when a cell value contains a line break or a comma. So the tabular data above can be expressed in CSV as: +

+
+cell in column 1 and row 1,cell in column 2 and row 1,cell in column 3 and row 1
+cell in column 1 and row 2,cell in column 2 and row 2,cell in column 3 and row 2
+cell in column 1 and row 3,cell in column 2 and row 3,cell in column 3 and row 3
+
+

+ or, with double quotes around cell values: +

+
+"cell in column 1 and row 1","cell in column 2 and row 1","cell in column 3 and row 1"
+"cell in column 1 and row 2","cell in column 2 and row 2","cell in column 3 and row 2"
+"cell in column 1 and row 3","cell in column 2 and row 3","cell in column 3 and row 3"
+
+

+ CSV files usually have an additional row at the top called a header row, which gives human-readable names or titles for each of the columns. Here is a sample CSV file that contains a header row: +

+
+"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"
+"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
+"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
+"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"
+
+

+ Column titles are a type of annotation on a column, not part of the data itself. For example, they don't count when you're counting the rows in a table: +

+ + + + + + +
column 1column 2column 3column 4
titlescountrycountry groupname (en)name (fr)name (de)latitudelongitude
row 1ATeuAustriaAutricheÖsterreich47.696554513.34598005
row 2BEeuBelgiumBelgiqueBelgien50.5010454.47667405
row 3BGeuBulgariaBulgarieBulgarien42.7256737525.4823218
+
+
+

How can you provide metadata for CSV?

+

+ You can provide metadata about CSV files using a JSON metadata file. If you're just providing metadata about one file, the easiest thing to do is to name the CSV file by adding -metadata.json to the end of the name of the CSV file. For example, if your CSV file is called countries.csv then call the metadata file countries.csv-metadata.json. +

+

+ The simplest metadata file you can create contains a single table description and looks like: +

+
+{
+  "@context": "http://www.w3.org/ns/csvw",
+  "url": "countries.csv"
+}
+      
+

+ Metadata files must always include the @context property with that value: this enables implementations to tell that these are CSV metadata files. The url property points to the CSV file that the metadata file describes. +

+

+ By default, if implementations can't find a metadata file by appending -metadata.json to the filename of the CSV file, they'll just look for a file called csv-metadata.json in the same directory. +

+

+ Metadata files can also describe several CSV files at once, using a slightly different syntax: +

+
+{
+  "@context": "http://www.w3.org/ns/csvw",
+  "tables": [{
+    "url": "countries.csv"
+  }, {
+    "url": "country-groups.csv"
+  }, {
+    "url": "unemployment.csv"
+  }]
+}
+      
+

+ Here, the tables property holds an array of table descriptions, each with the URL of the CSV file that it's describing. The metadata file as a whole describes a group of tables. +

+
+
+

What kind of information can you provide about a CSV file?

+

+ The description of a table within a metadata file can include: +

+ +

+ We'll come on to how to define the structure of the CSV file, and how to transform CSV into other formats, later. For now, let's look at the other metadata that you can provide about a CSV file. Here's an example: +

+
+{
+  "@context": "http://www.w3.org/ns/csvw",
+  "dc:title": "Unemployment in Europe (monthly)"
+  "dc:description": "Harmonized unemployment data for European countries."
+  "dc:creator": "Eurostat",
+  "tables": [{
+    "url": "countries.csv",
+    "dc:title": "Countries"
+  }, {
+    "url": "country-groups.csv",
+    "dc:title": "Country groups"
+  }, {
+    "url": "unemployment.csv",
+    "dc:title": "Unemployment (monthly)",
+    "dc:description": "The total number of people unemployed"
+  }]
+}
+
+

+ This example uses Dublin Core as a vocabulary for providing metadata. You can tell that's the vocabulary that's being used because the terms like dc:title and dc:description begin with the prefix dc, which stands for Dublin Core. +

+

+ There are several different metadata vocabularies in common use around the web. Some people use Dublin Core. Some people use schema.org. Some people use DCAT. All of these vocabularies can be used independently or together. A publisher could alternatively use: +

+
+{
+  "@context": "http://www.w3.org/ns/csvw",
+  "schema:name": "Unemployment in Europe (monthly)"
+  "schema:description": "Harmonized unemployment data for European countries."
+  "schema:creator": { "schema:name": "Eurostat" },
+  "tables": [{
+    "url": "countries.csv",
+    "schema:name": "Countries"
+  }, {
+    "url": "country-groups.csv",
+    "schema:name": "Country groups"
+  }, {
+    "url": "unemployment.csv",
+    "schema:name": "Unemployment (monthly)",
+    "dc:description": "The total number of people unemployed"
+  }]
+}
+
+

+ It's not clear at the moment which metadata vocabulary will give publishers the most benefits. Search engines are likely to recognise schema.org. RDF-based systems are more likely to recognise Dublin Core. +

+

+ More generally, you can use prefixed properties like these on any of the objects in a metadata document. The prefixes that are recognised are those used in the RDFa 1.1 Initial Context. Other properties must be named with full URLs. +

+
+
+

How do you support units of measure?

+
+
+

What about multi-lingual CSV files?

+
+
+

What about CSV that isn't standard CSV?

+
+ + +