The conversion of CSV content to JSON is intended for web developers who need not care about the complexities of RDF [[!rdf11-concepts]]. Where the formality of RDF is required, [[!csv2rdf]] provides the procedures for mapping from CSV content to RDF which may be serialized to [[json-ld]].
The conversion procedure described in this specification operates on the tabular data. This specification does not specify the processes needed to convert CSV-encoded data into tabular data form. Please refer to [[!tabular-data-model]] for details of parsing tabular data.
+
The conversion procedure described in this specification operates on the annotated tabular data model. This specification does not specify the processes needed to convert CSV-encoded data into tabular data form. Please refer to [[!tabular-data-model]] for details of parsing tabular data.
Conversion applications MUST provide at least two modes of operation: standard and minimal.
-
Standard mode conversion frames the information gleaned from the cells of the tabular data with details of the rows, tables, and a table group within which that information is provided.
+
Standard mode conversion frames the information gleaned from the cells of the tabular data with details of the rows, tables, and a group of tables within which that information is provided.
Minimal mode conversion includes only the information gleaned from the cells of the tabular data within the output.
@@ -126,13 +126,14 @@
Introduction
Conversion applications MAY offer additional implementation specific conversion modes.
-
Conversion specifications, as defined in [[!tabular-metadata]] MAY be used to specify how tabular data can be transformed into another format using a script or template. Such a conversion specification MAY use the JSON output described in this specification as input.
-
-
The conversion procedure described in this specification is considered to be entirely textual. There is no requirement on conversion applications to check the semantic consistency of the data during the conversion, nor to validate the output against JSON syntax rules. Downstream applications SHOULD be aware of the potential for syntax errors and take appropriate action.
+
Transformation definitions, as defined in [[!tabular-metadata]] MAY be used to specify how tabular data can be transformed into another format using a script or template. Such a transformation definitions MAY use the JSON output described in this specification as input.
-
Tabular data MUST conform to the description from [[!tabular-data-model]]. In particular note that each row MUST contain the same number of cells (although some of these cells may be empty). Given this constraint, not all CSV-encoded data can be considered to be tabular data. As such, the conversion procedure described in this specification cannot be applied to all CSV files.
+
Tabular data MUST conform to the description from [[!tabular-data-model]]. In particular note that each row MUST contain the same number of cells (although some of these cells may be empty).
+
+ Not all CSV-encoded data can be parsed into a tabular data model. An algorithm for parsing CSV-based files is described in [[!tabular-data-model]].
+
The annotated table is defined in [[!tabular-data-model]] as describing a particular table and its metadata.
+
The annotated table is defined in [[!tabular-data-model]] as describing a particular table and its annotations.
array
An array is defined in JSON ([[!RFC7159]]) as an ordered sequence of zero or more values, where a value is a string, number, boolean, null, object, or array.
@@ -159,28 +160,31 @@
Algorithm terms
Cell errors are defined in [[!tabular-data-model]] as a (possibly empty) list of validation errors generated while parsing the literal content of a cell to generate the semantic value.
cell value
-
A cell value is defined in [[!tabular-data-model]] as the semantic value of the cell; this MAY be null or, in the case that the cell specifies a separator property, a sequence of values.
+
A cell value is defined in [[!tabular-data-model]] as the semantic value of the cell; this MAY be null or a sequence of values.
column
A column is defined in [[!tabular-data-model]] as a vertical arrangement of cells within a table.
The group of tables is defined in [[!tabular-data-model]] as comprising a set of annotated tables and a set of annotations that relate to that group.
-
identifier
-
The identifier is the evaluation of the @id property for the current resource. As defined in [[!tabular-data-model]], the identifier is null if the @id property is undefined. The identifier MAY be applied to either a table group or a table.
+
group of tables identifier
+
The group of tables identifier is the id annotation on a group of tables. As defined in [[!tabular-data-model]].
name
In the context of this specification, name is used as defined in JSON ([[!RFC7159]]); that is, that name is a string that provides a unique key within a set of name-value pairs within a JSON object.
+
non-core annotations
+
Core annotations are listed in [[!tabular-data-model]]; groups of tables and tables may also have other annotations that are not defined in that specification; these are known as non-core annotations.
+
notes
-
A list of notes, as defined in [[!tabular-data-model]], attached to an annotated table or table group using the notes property. This may be an empty list.
+
A list of notes, as defined in [[!tabular-data-model]], attached to an annotated table or group of tables using the notes property. This may be an empty list.
object
An object is defined in JSON ([[!RFC7159]]) as unordered collection of zero or more name-value pairs, where a name is a string and a value is a string, number, boolean, null, object, or array.
The row is defined in [[!tabular-data-model]] as a horizontal arrangement of cells within a table.
@@ -192,16 +196,13 @@
Algorithm terms
A row source number is defined in [[!tabular-data-model]] as the position of the row within the source tabular data file. Provision of the row source number is dependent on parsing applications and may be reported as null.
subject
-
Within this algorithm, a subject is the resource that the value of a given cell refers to. This may be specified using the aboutUrl property.
+
Within this algorithm, a subject is the resource that the value of a given cell refers to. This may be specified using about URL.
-
table group
-
The table group is defined in [[!tabular-data-model]] as comprising a set of annotated tables and a set of annotations that relate to those tables.
A conformant JSON conversion application MUST produce output conforming to this algorithm according to the chosen mode of conversion: standard or minimal.
Insert an empty arrayA into the JSON output. The objects containing the name-value pairs associated with the cell values will be subsequently inserted into this array.
Each row within the table is processed sequentially in order. For each row in the current table:
@@ -228,12 +229,12 @@
Minimal mode
Generate a sequence of objects, S1 to Sn, each of which corresponds to a subject described by the current row, as described in .
-
The subject(s) described by each row are determined according to the aboutUrl property for each cell in the current row. Where aboutUrl is undefined, a default subject for the row is used.
+
The subject(s) described by each row are determined according to the about URL annotation for each cell in the current row. Where about URL is undefined, a default subject for the row is used.
As described in , process the sequence of objects, S1 to Sn, to produce a new sequence of rootobjects, SR1 to SRm, that MAY include nestedobjects.
-
A row MAY describe multiple interrelated subjects; where the valueUrl property for one cell matches the aboutUrl property for another cell in the same row.
All other annotations for the table are ignored during the conversion; including information about table schemas and columns specified therein, foreign keys etc.
Insert the following name-value pair into objectT:
@@ -350,12 +351,12 @@
Standard mode
Generate a sequence of objects, S1 to Sn, each of which corresponds to a subject described by the current row, as described in .
-
The subject(s) described by each row are determined according to the aboutUrl property for each cell in the current row. Where aboutUrl is undefined, a default subject for the row is used.
+
The subject(s) described by each row are determined according to the about URL annotation for each cell in the current row. Where about URL is undefined, a default subject for the row is used.
As described in , process the sequence of objects, S1 to Sn, to produce a new sequence of rootobjects, SR1 to SRm, that MAY include nestedobjects.
-
A row MAY describe multiple interrelated subjects; where the valueUrl property for one cell matches the aboutUrl property for another cell in the same row.
Determine the unique subjects for the current row. The subject(s) described by each row are determined according to the aboutUrl property for each cell in the current row. A default subject for the row is used for any cells where aboutUrl is undefined.
+
Determine the unique subjects for the current row. The subject(s) described by each row are determined according to the about URL annotation for each cell in the current row. A default subject for the row is used for any cells where about URL is undefined.
-
For each subject that the current row describes where at least one of the cells that refers to that subject has a value or valueUrl that is not null, and is associated with a column where the value of property suppressOutput has value false:
(i is the index number with values from 1 to n, where n is the number of subjects for the row)
-
Subjecti is identified according to the aboutUrl property of its associated cells: IS. For a defaultsubject where aboutUrl is not specified by its cells, IS is null.
+
Subjecti is identified according to the about URL annotation of its associated cells: IS. For a defaultsubject where about URL is not specified by its cells, IS is null.
If the identifier for subjecti, IS, is not null, then insert the following name-value pair into objectSi:
@@ -397,32 +398,32 @@
Generating Objects
Each cell referring to subjecti is then processed sequentially according to the order of the columns.
-
For each cell referring to subjecti, where the value of property suppressOutput for the column associated with that cell is false, insert a name-value pair into objectSi as described below:
+
For each cell referring to subjecti, where the suppress output annotation for the column associated with that cell is false, insert a name-value pair into objectSi as described below:
-
If the value of propertyUrl for the cell is not null, then nameN takes the value of propertyUrl compacted according to the rules as defined in URL Compaction in [[!tabular-metadata]].
-
Else, nameN takes the value of the name property for the column associated with the cell.
+
If the value of property URL for the cell is not null, then nameN takes the value of property URL compacted according to the rules as defined in URL Compaction in [[!tabular-metadata]].
+
Else, nameN takes the value of the name annotation for the column associated with the cell.
-
If the valueUrl for the current cell is not null, then insert the following name-value pair into objectSi:
+
If the value URL for the current cell is not null, then insert the following name-value pair into objectSi:
name
N
value
Vurl
-
where Vurl is the value of valueUrl property for the current cell expressed as a string in the JSON output. If N is @type, compact Vurl according to the rules as defined in URL Compaction in [[!tabular-metadata]].
+
where Vurl is the value of value URL annotation for the current cell expressed as a string in the JSON output. If N is @type, compact Vurl according to the rules as defined in URL Compaction in [[!tabular-metadata]].
-
Else, if the cell specifies a separator property and the cell value is not an empty sequence, then the cell value provides a sequence of values for inclusion within the JSON output; insert an arrayAv containing each value V of the sequence into objectSi:
+
Else, if the cell value is a list that is not empty, then the cell value provides a sequence of values for inclusion within the JSON output; insert an arrayAv containing each value V of the sequence into objectSi:
name
N
value
Av
-
Each of the values V derived from the sequence MUST be expressed in the JSON output according to the datatype property of the cell as defined below: .
-
Since arrays are implicitly ordered in JSON, the ordered property, if specified, has no effect on the JSON output.
+
Each of the values V derived from the sequence MUST be expressed in the JSON output according to the datatype of V as defined below in .
+
Since arrays are implicitly ordered in JSON, the ordered annotation has no effect on the JSON output.
Else, if the cell value is not null, then the cell value provides a single value V for inclusion within the JSON output; insert the following name-value pair into objectSi:
@@ -433,7 +434,7 @@
Generating Objects
V
-
Value V derived from the cell values MUST be expressed in the JSON output according to the datatype property of the cell as defined below: .
+
Value V derived from the cell values MUST be expressed in the JSON output according to the datatype of the value as defined in .
@@ -449,11 +450,11 @@
Generating Objects
Generating Nested Objects
The steps in the algorithm defined herein apply to both standardandminimal modes.
-
Where the current row describes multiple subjects, it MAY be possible to organise the objects associated with those subjects such that some objects are nested within others; e.g. where the valueUrl property for one cell matches the aboutUrl property for another cell in the same row. This algorithm considers a sequence of objects generated according to , S1 to Sn, each of which corresponds to a subject described by the current row. It generates a new sequence of rootobjects, SR1 to SRm, that MAY include nestedobjects.
+
Where the current row describes multiple subjects, it MAY be possible to organise the objects associated with those subjects such that some objects are nested within others; e.g. where the value URL annotation for one cell matches the about URL annotation for another cell in the same row. This algorithm considers a sequence of objects generated according to , S1 to Sn, each of which corresponds to a subject described by the current row. It generates a new sequence of rootobjects, SR1 to SRm, that MAY include nestedobjects.
Where the current row describes only a single subject, this algorithm may be bypassed as no nesting is possible. In such a case, the rootobjectSR1 is identical to the original objectS1.
-
This nesting algorithm is based on the interrelationships between subjects described within a given row that are specified using the valueUrl property. Cell values expressing the identity of a subject in the current row (i.e., as a simple literal) will be ignored by this algorithm.
+
This nesting algorithm is based on the interrelationships between subjects described within a given row that are specified using the value URL annotation. Cell values expressing the identity of a subject in the current row (i.e., as a simple literal) will be ignored by this algorithm.
The algorithm uses the following terms:
@@ -489,7 +490,7 @@
Generating Nested Objects
-
For all cells in the current row, determine the valueUrls, Vurl, that occur only once. The list of these uniquely occurring valueUrls is referred to as the URL-list.
+
For all cells in the current row, determine the value URLs, Vurl, that occur only once. The list of these uniquely occurring value URLs is referred to as the URL-list.
For all cells associated with the current objectSi (e.g. whose aboutUrl property matches IS):
+
For all cells associated with the current objectSi (e.g. whose about URL property matches IS):
-
If the valueUrl property of the current cell is defined and its value, Vurl, appears in the URL-list, then check each of the otherobjects in the sequence S1 to Sn to determine if Vurl identifies one of those objects.
+
If the value URL annotation of the current cell is defined and its value, Vurl, appears in the URL-list, then check each of the otherobjects in the sequence S1 to Sn to determine if Vurl identifies one of those objects.
For objectSj, if the name-value pair with name@id is present and its value matches Vurl, then:
If the root of the tree containing vertexN is a vertex that represents objectSj, then objectSi is already a descendant of objectSj; no further action should be taken for this instance of Vurl.
This clause in the algorithm prevents circular loops being created.
-
Furthermore, because the URL-list contains valueUrls that occur only once for the current row, objectSi cannot be a descendant of an intermediate vertices in the tree.
+
Furthermore, because the URL-list contains value URLs that occur only once for the current row, objectSi cannot be a descendant of an intermediate vertices in the tree.
@@ -545,70 +546,66 @@
Generating Nested Objects
Interpreting datatypes
-
Cell values are expressed in the JSON output according to the cell's datatype property. The relationship between the value of the datatype property and the primitive types supported by JSON (as specified in [[!RFC7159]]) is provided in the table below.
+
Cell values are expressed in the JSON output according to their datatype. The relationship between the datatype of the value and the primitive types supported by JSON (as specified in [[!RFC7159]]) is provided in the table below.
Instances of JSON reserved characters within string values MUST be escaped as defined in [[!RFC7159]].
-
JSON has no native support for expressing language information; therefore the lang property has no effect on the JSON output.
+
JSON has no native support for expressing language information; therefore the language of a value has no effect on the JSON output.
-
A cell's format property is irrelevant to the conversion procedure defined in this specification; the cell value has already been parsed from the contents the cell according to the format property.
+
A datatype'sformat is irrelevant to the conversion procedure defined in this specification; the cell value has already been parsed from the contents the cell according to the format annotation.
Where the contents of the cell cannot be parsed, or other validation errors occur, cell errors will be provided. It is an implementation decision to determine how conversion applications should proceed in the event that cell errors are encountered.
-
datatype
JSON primitive type
Remarks
+
datatype
JSON primitive type
-
anyAtomicType
string
-
any
string
any is considered to be equivalent to anyAtomicType
-
anyURI
string
-
base64Binary
string
-
binary
string
binary is considered to be equivalent to base64Binary
-
boolean
boolean
-
date
string
-
dateTime
string
-
datetime
string
datetime is considered to be equivalent to dateTime
-
dateTimeStamp
string
-
decimal
number
-
integer
number
-
long
number
-
int
number
-
short
number
-
byte
number
-
nonNegativeInteger
number
-
positiveInteger
number
-
unsignedLong
number
-
unsignedInt
number
-
unsignedShort
number
-
unsignedByte
number
-
nonPositiveInteger
number
-
negativeInteger
number
-
double
number
-
number
number
number is considered to be equivalent to double
-
duration
string
-
dayTimeDuration
string
-
yearMonthDuration
string
-
float
number
-
gDay
string
-
gMonth
string
-
gMonthDay
string
-
gYear
string
-
gYearMonth
string
-
hexBinary
string
-
QName
string
-
string
string
-
normalizedString
string
-
token
string
-
language
string
-
Name
string
-
NMTOKEN
string
-
xml
string
-
html
string
-
json
string
-
time
string
+
anyAtomicType
string
+
anyURI
string
+
base64Binary
string
+
boolean
boolean
+
date
string
+
dateTime
string
+
dateTimeStamp
string
+
decimal
number
+
integer
number
+
long
number
+
int
number
+
short
number
+
byte
number
+
nonNegativeInteger
number
+
positiveInteger
number
+
unsignedLong
number
+
unsignedInt
number
+
unsignedShort
number
+
unsignedByte
number
+
nonPositiveInteger
number
+
negativeInteger
number
+
double
number
+
duration
string
+
dayTimeDuration
string
+
yearMonthDuration
string
+
float
number
+
gDay
string
+
gMonth
string
+
gMonthDay
string
+
gYear
string
+
gYearMonth
string
+
hexBinary
string
+
QName
string
+
string
string
+
normalizedString
string
+
token
string
+
language
string
+
Name
string
+
NMTOKEN
string
+
xml
string
+
html
string
+
json
string
+
time
string
@@ -617,9 +614,9 @@
Interpreting datatypes
JSON-LD to JSON
-
This section defines a mechanism for transforming the [[json-ld]] dialect used for common properties and notes into JSON.
Name-value pairs from notes and common properties annotations are generally copied verbatim from the metadata description subject to the exceptions below:
+
Name-value pairs from notes and non-core annotations annotations are generally copied verbatim from the metadata description subject to the exceptions below:
Name-value pairs whose value is an object using the [[json-ld]] keyword @value, for example:
@@ -656,7 +653,7 @@
JSON-LD to JSON
-
In addition to compacting values of propertyUrls, URLs which ware the value of @type used within the notes and common properties are compacted according to the rules as defined in URL Compaction in [[!tabular-metadata]].
+
In addition to compacting values of property URLs, URLs which ware the value of @type used within the notes and non-core annotations are compacted according to the rules as defined in URL Compaction in [[!tabular-metadata]].
@@ -667,7 +664,7 @@
Examples
Simple example
- This example comprises a single annotated table containing information attributes about countries; country code, position (latitude, longitude) and name. Whilst the input tabular data file, published at http://example.org/countries.csv, includes a header line, no further metadata annotations are given. The tabular data file is provided below:
+ This example comprises a single table containing information attributes about countries; country code, position (latitude, longitude) and name. Whilst the input tabular data file, published at http://example.org/countries.csv, includes a header line, no further metadata annotations are given. The tabular data file is provided below:
Simple example
-
Annotations for the resulting tableT, with 4 columns and 3 rows, are shown below:
-
-
-
-
id
core annotations
annotations
-
url
columns
rows
-
-
-
T
http://example.org/countries.csv
C1, C2, C3, C4
R1, R2, R3
-
-
-
-
Annotations for the columns, rows and cells in tableT are shown in the tables below.
As the value of propertyUrl has not been set within the metadata description it defaults to the URI Template (see [[RFC6570]]) #{[column-name]}, where [column-name] is the value of the name property for the column associated with the cell. For example, the value of propertyUrl for all cells in columnC1 ("name": "countryCode") is http://example.org/countries.csv#countryCode.
-
+
+
Annotations for the resulting tableT, with 4 columns and 3 rows, are shown below:
+
+
+
+
id
core annotations
+
url
columns
rows
+
+
+
T
http://example.org/countries.csv
C1, C2, C3, C4
R1, R2, R3
+
+
+
+
Annotations for the columns, rows and cells in tableT are shown in the tables below.
Minimal mode output for this example is provided below:
Simple example
-
The aboutUrl property has not been set for cells in table T ({ "url": "http://example.org/countries.csv"}) - cells in a given row where aboutUrl has not been specified are assumed to refer to the same subject and so the name-value pairs associated with the cell values of that row occur within the same object.
-
Given that the propertyUrl has not been explicitly set for cells in table T ({ "url": "http://example.org/countries.csv"}), the simplified name is used in the name-value pairs; e.g. countryCode rather than http://example.org/countries.csv#countryCode
+
The about URL annotation has not been set for cells in table T ({ "url": "http://example.org/countries.csv"}) - cells in a given row where about URL has not been specified are assumed to refer to the same subject and so the name-value pairs associated with the cell values of that row occur within the same object.
+
Given that the property URL is null for cells in table T ({ "url": "http://example.org/countries.csv"}), the simplified name is used in the name-value pairs; e.g. countryCode rather than http://example.org/countries.csv#countryCode
Standard mode output for this example is provided below:
@@ -771,7 +768,7 @@
Simple example
-
Even though the table was defined in isolation, the table is wrapped in a table group.
The name-value pair with nameurl provides reference to the original tabular data file and to specific rows therein.
The row number is provided for each row using name-value pair with namerownum.
The object containing the name-values pairs associated with the cell values of a row are related to the object for that row using the name-value pair with namedescribes.
@@ -802,106 +799,114 @@
Example with single table and rich annotations
+
Core annotations for the resulting tableT, with 9 columns and 3 rows, are shown below:
-
Annotations for the resulting tableT, with 9 columns and 3 rows, are shown below:
"cavity or decay; trunk decay; codominant leaders; included bark; large leader or limb decay; previous failure root damage; root decay; beware of BEES"
"cavity or decay", "trunk decay", "codominant leaders", "included bark", "large leader or limb decay", "previous failure root damage", "root decay", "beware of BEES"
"cavity or decay; trunk decay; codominant leaders; included bark; large leader or limb decay; previous failure root damage; root decay; beware of BEES"
"cavity or decay", "trunk decay", "codominant leaders", "included bark", "large leader or limb decay", "previous failure root damage", "root decay", "beware of BEES"
The lists of values from cells in columnC7 ("name": "comments") are assumed to be unordered as the boolean ordered annotation, which defaults to false, has not be set within the metadata description.
-
For brevity, the propertyUrl is not shown in the table of cell annotations. Where not explicitly set, the value of propertyUrl defaults to the URI Template (see [[RFC6570]]) #{[column-name]}, where [column-name] is the value of the name property for the column associated with the cell. For example, the value of propertyUrl for all cells in columnC2 ("name": "on_street") is http://example.org/tree-ops-ext.csv#on_street.
Minimal mode output for this example is provided below:
@@ -914,8 +919,8 @@
Example with single table and rich annotations
-
The subject described by each row is explicitly defined using the aboutUrl property; e.g. the subject of rowR1 is http://example.org/tree-ops-ext#gid-1.
-
Output for columnC1 ({ "name": "GID" }) is not included as column property suppressOutput has value true.
+
The subject described by each row is explicitly defined using the about URL annotation; e.g. the subject of rowR1 is http://example.org/tree-ops-ext#gid-1.
CellsC1.7 and C2.7 (rowsR1 and R2; column, { "name": "comments" }) have nullvalues - no output is included for these cells.
CellC3.7 (rowR3; column, { "name": "comments" }) contains a sequence of values; the set of values are included in an array.
@@ -931,7 +936,7 @@
Example with single table and rich annotations
TableT ({ "url": "http://example.org/tree-ops-ext.csv"}) has been explicitly identified: { "@id": "<http://exmple.org/tree-ops-ext>"}.
-
Common properties and notes specified for tableT ({ "url": "http://example.org/tree-ops-ext.csv"}) are included in the output.
+
Non-core annotations and notes specified for tableT ({ "url": "http://example.org/tree-ops-ext.csv"}) are included in the output.
@@ -959,7 +964,7 @@
Example with single table and using virtual columns to produce mult
- The CSV to JSON translation is limited to providing one statement, or triple, per column in the table. The target schema.org markup requires 10 statements to describe each event. As the base tabular data file contains 5 columns, an additional 5 virtual columns have been added in order to provide for the full complement of statements—including the relationships between the 3 resources (event, location, and offer) described by each row of the table. Note that the virtual property is set to true for these virtual columns.
+ The CSV to JSON translation is limited to providing one statement, or triple, per column in the table. The target schema.org markup requires 10 statements to describe each event. As the base tabular data file contains 5 columns, an additional 5 virtual columns have been added in order to provide for the full complement of statements—including the relationships between the 3 resources (event, location, and offer) described by each row of the table. Note that the virtual annotation is true for these virtual columns.
Furthermore, note that no attempt is made to reconcile between locations or offers that may be associated with more than one event; every row in the table will create both a new location resource and offer resource in addition to the event resource. If considered necessary, applications such as OpenRefine may be used to identify and reconcile duplicate location resources once the JSON output has been generated.
@@ -970,6 +975,7 @@
Example with single table and using virtual columns to produce mult
+
Annotations for the resulting tableT, with 10 columns and 2 rows, are shown below:
@@ -987,8 +993,8 @@
Example with single table and using virtual columns to produce mult
ColumnsC6, C7 and C8 ({ "name": "type_event"}, { "name": "type_place"} and { "name": "type_offer"}) define the semantic types of the resources described by each row: schema:MusicEvent, schema:Place and schema:Offer respectively—noting that the use of rdf:type is converted to the name@type (as used in [[json-ld]]) by this conversion application.
-
ColumnC9 ({ "name": "location"}) uses the aboutUrl and valueUrl to assert the relationship between the event and location resources.
-
ColumnC10 ({ "name": "offer"}) uses the aboutUrl and valueUrl to assert the relationship between the event and offer resources.
+
ColumnC9 ({ "name": "location"}) uses the about URL and value URL to assert the relationship between the event and location resources.
+
ColumnC10 ({ "name": "offer"}) uses the about URL and value URL to assert the relationship between the event and offer resources.
Standard mode output for this example is provided below:
@@ -1076,7 +1082,7 @@
Example with single table and using virtual columns to produce mult
-
The resources described by each row are explicitly defined using the aboutUrl property—in this case three resources per row (event, location, and offer); the objects containing the name-values pairs associated with the cell values of a row are related to the object for each subject in that row using the name-value pair with namedescribes.
+
The resources described by each row are explicitly defined using the about URL annotation—in this case three resources per row (event, location, and offer); the objects containing the name-values pairs associated with the cell values of a row are related to the object for each subject in that row using the name-value pair with namedescribes.
@@ -1084,7 +1090,7 @@
Example with single table and using virtual columns to produce mult
Example with table group comprising four interrelated tables
- This example is based on Use Case #4 - Publication of public sector roles and salaries and uses four annotated tables published within a table group. Information about senior roles and junior roles within a government department or organization are published in CSV format by each department. These are validated against a centrally published schema to ensure that all the data published by departments is consistent. Additionally, lists of organizations and professions are also published centrally, providing controlled vocabularies against which departmental submissions are validated.
+ This example is based on Use Case #4 - Publication of public sector roles and salaries and uses four annotated tables published within a group of tables. Information about senior roles and junior roles within a government department or organization are published in CSV format by each department. These are validated against a centrally published schema to ensure that all the data published by departments is consistent. Additionally, lists of organizations and professions are also published centrally, providing controlled vocabularies against which departmental submissions are validated.
@@ -1159,54 +1165,66 @@
Example with table group comprising four interrelated tables
Finally, note that because the centrally published metadata descriptions are intended to be reused across many government departments and organizations, extra consideration has been given to defining URIs for the person and post resources defined in each row of the senior roles tabular data and subsequently referenced from the junior roles tabular data. To ensure that naming clashes are avoided, the unique reference for the organization to which the person or post belongs has been included in a path segment of the identifier. For example, the URI template propertyaboutUrl used to identify the senior post is specified as http://example.org/organization/{organizationRef}/post/{ref}, thus yielding the URI http://example.org/organization/hefce.ac.uk/post/90115 for the post described in the first row of the senior roles tabular data.
-
The table group generated from parsing the tabular data files and associated metadata is shown below and provides the basis for the conversion to JSON.
+
The group of tables generated from parsing the tabular data files and associated metadata is shown below and provides the basis for the conversion to JSON.
-
Annotations for the table groupG and the four tablesTa, Tb, Tc, and Td are shown below.
In this example, output for the centrally published lists of organizations and professions, tables Ta and Tb (http://example.org/gov.uk/data/organizations.csv and http://example.org/gov.uk/data/professions.csv respectively), are not required; only information from the departmental submissions is to be translated to JSON. Note the suppressOutput annotation on this table.
+
In this example, output for the centrally published lists of organizations and professions, tables Ta and Tb (http://example.org/gov.uk/data/organizations.csv and http://example.org/gov.uk/data/professions.csv respectively), are not required; only information from the departmental submissions is to be translated to RDF. Note the suppress output annotation on this table.
+
+
The following foreign keys are defined:
+
+
+
id
columns in table
columns in referenced table
+
+
+
Fa1
Ca3
Ca1
+
Fc1
Cc5
Cc1
+
Fc2
Cc6
Cb1
+
Fc3
Cc7
Ca1
+
Fd1
Cd1
Cc1
+
Fd2
Cd7
Cb1
+
Fd3
Cd8
Ca1
+
+
Annotations for the columns, rows and cells in tableT are shown in the tables below.
Notice that valueUrl is not specified for cellsCa2.3 and Cc2.5 because in each case the cell value is null and the virtual property of columnCb5 is not specified.
+
Notice that value URL is not specified for cellsCa2.3 and Cc2.5 because in each case the cell value is null and the virtual annotation of columnCb5 is not defined.
Minimal mode output for this example is provided below:
@@ -1337,11 +1355,11 @@
Example with table group comprising four interrelated tables
Prefixes defined within the RDFa 1.1 Initial Context ([[rdfa-core]]) are not expanded; e.g. dc: for <http://purl.org/dc/terms/>.
-
Output for tablesTa and Tb ({ "url": "http://example.org/gov.uk/data/organizations.csv" } and { "url": "http://example.org/gov.uk/data/professions.csv" }) are not included as property suppressOutput is specified with value true for each of the tables.
ColumnsCc5 and Cd1 ({ "name": "reportsTo" } and { "name": "reportsToSenior" }) use the aboutUrl, propertyUrl and valueUrl properties to assert the relationship between the given post and the senior post it reports to for the cells therein. However, since senior posts and junior posts are described in different tables so it is not possible to create nested objects for this particular case.
-
Similarly, columnsCc7 and Cd8 (both with { "name": "organizationRef" }) use the aboutUrl, propertyUrl and valueUrl properties to assert the relationship between the given post and the organization to which it belongs for the cells those columns.
-
Finally, note that two resources are created for each row within tableTc ({ "url": "http://example.org/senior-roles.csv" }): the person and the post they occupy. The relationship between these resources is specified via virtualcolumnCc8 ({ "name": "post_holder" }) using the aboutUrl, propertyUrl and valueUrl properties. The personobject provides the value of the name-value pair with corresponding nameorg:heldBy, thus nesting the personobject within the postobject.
+
Output for tablesTa and Tb ({ "url": "http://example.org/gov.uk/data/organizations.csv" } and { "url": "http://example.org/gov.uk/data/professions.csv" }) are not included as the suppress output annotation has the value true for each of the tables.
ColumnsCc5 and Cd1 ({ "name": "reportsTo" } and { "name": "reportsToSenior" }) use the about URL, property URL and value URL annotations to assert the relationship between the given post and the senior post it reports to for the cells therein. However, since senior posts and junior posts are described in different tables it is not possible to create nested objects for this particular case.
+
Similarly, columnsCc7 and Cd8 (both with { "name": "organizationRef" }) use the about URL, property URL and value URL annotations to assert the relationship between the given post and the organization to which it belongs for the cells those columns.
+
Finally, note that two resources are created for each row within tableTc ({ "url": "http://example.org/senior-roles.csv" }): the person and the post they occupy. The relationship between these resources is specified via virtualcolumnCc8 ({ "name": "post_holder" }) using the about URL, property URL and value URL annotations. The personobject provides the value of the name-value pair with corresponding nameorg:heldBy, thus nesting the personobject within the postobject.
Standard mode output for this example is provided below:
This document describes the processing of tabular data to create an RDF subject-predicate-object triples [[!rdf11-concepts]]. Since RDF is an abstract syntax, these triples MAY be serialized in a concrete RDF syntax such as N-Triples [[n-triples]], Turtle [[turtle]], RDFa [[rdfa-primer]], JSON-LD [[json-ld]], or TriG [[trig]]. The RDF serializations offered by a conversion application is implementation dependent.
+
This document describes the processing of tabular data to create an RDF subject-predicate-object triples [[!rdf11-concepts]]. Since RDF is an abstract syntax, these triples MAY be serialized in a concrete RDF syntax such as N-Triples [[n-triples]], Turtle [[turtle]], RDFa [[rdfa-primer]], JSON-LD [[json-ld]], or TriG [[trig]]. The RDF serializations offered by a conversion application is implementation defined.
The conversion procedure described in this specification operates on the tabular data. This specification does not specify the processes needed to convert CSV-encoded data into tabular data form. Please refer to [[!tabular-data-model]] for details of parsing tabular data.
+
The conversion procedure described in this specification operates on the annotated tabular data model. This specification does not specify the processes needed to convert CSV-encoded data into tabular data form. Please refer to [[!tabular-data-model]] for details of parsing tabular data.
Conversion applications MUST provide at least two modes of operation: standard and minimal.
-
Standard mode conversion frames the information gleaned from the cells of the tabular data with details of the rows, tables, and a table group within which that information is provided.
+
Standard mode conversion frames the information gleaned from the cells of the tabular data with details of the rows, tables, and a group of tables within which that information is provided.
Minimal mode conversion includes only the information gleaned from the cells of the tabular data.
@@ -141,13 +141,17 @@
Introduction
Conversion applications MAY offer additional implementation specific conversion modes.
-
Conversion specifications, as defined in [[!tabular-metadata]] MAY be used to specify how tabular data can be transformed into another format using a script or template. Such a conversion specification MAY use the RDF output described in this specification as input.
+
Transformation definitions, as defined in [[!tabular-metadata]] MAY be used to specify how tabular data can be transformed into another format using a script or template. Such a transformation definitions MAY use the RDF output described in this specification as input.
-
The conversion procedure described in this specification is considered to be entirely textual. There is no requirement on conversion applications to check the semantic consistency of the data during the conversion, nor to validate the triples against RDF syntax rules. Downstream applications SHOULD be aware of the potential for syntax errors and take appropriate action.
+
There is no requirement on conversion applications to check the semantic consistency of the data during the conversion, nor to validate the triples against RDF schema. Downstream applications SHOULD be aware of the potential for inconsistencies and take appropriate action.
-
Tabular data MUST conform to the description from [[!tabular-data-model]]. In particular note that each row MUST contain the same number of cells (although some of these cells may be empty). Given this constraint, not all CSV-encoded data can be considered to be tabular data. As such, the conversion procedure described in this specification cannot be applied to all CSV files.
+
Tabular data MUST conform to the description from [[!tabular-data-model]]. In particular note that each row MUST contain the same number of cells (although some of these cells may be empty).
+
+
+ Not all CSV-encoded data can be parsed into a tabular data model. An algorithm for parsing CSV-based files is described in [[!tabular-data-model]].
+
This specification makes use of the compact IRI Syntax; please refer to the Compact IRIs from [[json-ld]].
Cell errors are defined in [[!tabular-data-model]] as a (possibly empty) list of validation errors generated while parsing the literal content of a cell to generate the semantic value.
cell value
-
A cell value is defined in [[!tabular-data-model]] as the semantic value of the cell; this MAY be null or, in the case that the cell specifies a separator property, a sequence of values.
+
A cell value is defined in [[!tabular-data-model]] as the semantic value of the cell; this MAY be null or a sequence of values.
column
A column is defined in [[!tabular-data-model]] as a vertical arrangement of cells within a table.
-
common properties
-
The common properties of a metadata resource are defined in Section 3.3 Common Properties of [[!tabular-metadata]]). The RDF triples corresponding to these properties are the result of running the algorithm specified in or equivalalent, over the common properties defined within the metadata description.
+
group of tables
+
The group of tables is defined in [[!tabular-data-model]] as comprising a set of annotated tables and a set of annotations that relate to that group.
-
identifier
-
The identifier is the evaluation of the @id property for the current resource. As defined in [[!tabular-data-model]], the identifier is null if the @id property is undefined. The identifier MAY be applied to either a table group or a table.
+
group of tables identifier
+
The group of tables identifier is the id annotation on a group of tables. As defined in [[!tabular-data-model]].
literal node
A literal node is defined in [[!rdf11-concepts]] as a node within an RDF graph that provides values such as strings, numbers, and dates.
@@ -205,8 +209,11 @@
Algorithm terms
node
A node is defined in [[!rdf11-concepts]] as a subject or an object of an RDF triple. When in subject position, it can be either a blank node or identified with a URL; when in object position, it can be a blank node, a literal, or identified with a URL.
+
non-core annotations
+
Core annotations are listed in [[!tabular-data-model]]; groups of tables and tables may also have other annotations that are not defined in that specification; these are known as non-core annotations.
+
notes
-
A list of notes, as defined in [[!tabular-data-model]], attached to an annotated table or table group using the notes property. This may be an empty list.
+
A list of notes, as defined in [[!tabular-data-model]], attached to an annotated table or group of tables using the notes property. This may be an empty list.
predicate
A predicate is defined in [[!rdf11-concepts]] as an IRI that denotes the property used to relate nodes within an RDF triple.
@@ -214,8 +221,8 @@
Algorithm terms
prefixed name
A prefixed name is an abbreviation for a URI, in the syntax prefix:name. See Names of Common Properties in [[!tabular-metadata]] for information on expansion.
The row is defined in [[!tabular-data-model]] as a horizontal arrangement of cells within a table.
@@ -227,16 +234,13 @@
Algorithm terms
A row source number is defined in [[!tabular-data-model]] as the position of the row within the source tabular data file. Provision of the row source number is dependent on parsing applications and may be reported as null.
subject
-
Within this algorithm, a subject is the resource that the value of a given cell refers to. This may be specified using the aboutUrl property.
+
Within this algorithm, a subject is the resource that the value of a given cell refers to. This may be specified using about URL.
-
table group
-
The table group is defined in [[!tabular-data-model]] as comprising a set of annotated tables and a set of annotations that relate to those tables.
Establish a new blank nodeSdef to be used as the default subject for cells where aboutUrl is undefined.
-
A row MAY describe multiple interrelated subjects; where the valueUrl property for one cell matches the aboutUrl property for another cell in the same row.
-
For each cell in the current row where the value of property suppressOutput for the column associated with that cell is false:
+
Establish a new blank nodeSdef to be used as the default subject for cells where about URL is undefined.
Else, predicateP is constructed by appending the value of the name property for the column associated with the cell to the the tabular data file URL as a fragment identifier.
Else, predicateP is constructed by appending the value of the name annotation for the column associated with the cell to the the tabular data file URL as a fragment identifier.
-
If the valueUrl for the current cell is not null, then valueUrl identifies a nodeVurl that is related the current subject using the predicateP; emit the following triple:
+
If the value URL for the current cell is not null, then value URL identifies a nodeVurl that is related the current subject using the predicateP; emit the following triple:
Else, if the cell specifies a separator property and the cell value is not an empty sequence and the cell specifies that boolean property ordered is true, then the cell value provides an ordered sequence of literal nodes for inclusion within the RDF output using an instance of rdf:ListVlist as defined in [[rdf-schema]]. This instance is related to the subject using the predicateP; emit the triples defining list Vlist plus the following triple:
+
Else, if the cell value is a list and the cellordered annotation is true, then the cell value provides an ordered sequence of literal nodes for inclusion within the RDF output using an instance of rdf:ListVlist as defined in [[rdf-schema]]. This instance is related to the subject using the predicateP; emit the triples defining list Vlist plus the following triple:
Else, if the cell specifies a separator property and the cell value is not an empty sequence, then the cell value provides an unordered sequence of literal nodes for inclusion within the RDF output, each of which is related to the subject using the predicateP. For each value provided in the sequence, add a literal nodeVliteral; emit the following triple:
+
Else, if the cell value is a list, then the cell value provides an unordered sequence of literal nodes for inclusion within the RDF output, each of which is related to the subject using the predicateP. For each value provided in the sequence, add a literal nodeVliteral; emit the following triple:
Else, if the cell value is not null and the cell does not specify a separator property, then the cell value provides a single literal nodeVliteral for inclusion within the RDF output that is related the current subject using the predicateP; emit the following triple:
+
Else, if the cell value is not null, then the cell value provides a single literal nodeVliteral for inclusion within the RDF output that is related the current subject using the predicateP; emit the following triple:
In the case when a cell value does not have a datatype, the conversion should default to string.
+
In the case where a sequence of values is provided, each value in the list has its own datatype; the datatype may be different for different items in the sequence.
@@ -440,10 +444,10 @@
Generating RDF
Interpreting datatypes
-
Cell values are expressed in the RDF output according to the cell's datatype property. The relationship between the value of the datatype property and the datatype IRI used in the RDF is provided in the table below.
A cell's format property is irrelevant to the conversion procedure defined in this specification; the cell value has already been parsed from the contents the cell according to the format property.
+
A datatype's format annotation is irrelevant to the conversion procedure defined in this specification; the cell value has already been parsed from the contents the cell according to the format annotation.
Where the contents of the cell cannot be parsed, or other validation errors occur, cell errors will be provided. It is an implementation decision to determine how conversion applications should proceed in the event that cell errors are encountered.
@@ -453,16 +457,12 @@
Interpreting datatypes
-
anyAtomicType
xsd:anyAtomicType
any is considered to be equivalent to anyAtomicType
-
any
xsd:anyAtomicType
-
anyAtomicType
xsd:anyAtomicType
+
anyAtomicType
xsd:anyAtomicType
anyURI
xsd:anyURI
base64Binary
xsd:base64Binary
-
binary
xsd:base64Binary
binary is considered to be equivalent to base64Binary
boolean
xsd:boolean
date
xsd:date
dateTime
xsd:dateTime
-
datetime
xsd:dateTime
datetime is considered to be equivalent to dateTime
dateTimeStamp
xsd:dateTimeStamp
decimal
xsd:decimal
integer
xsd:integer
@@ -479,7 +479,6 @@
Interpreting datatypes
nonPositiveInteger
xsd:nonPositiveInteger
negativeInteger
xsd:negativeInteger
double
xsd:double
-
number
xsd:double
number is considered to be equivalent to double
duration
xsd:duration
dayTimeDuration
xsd:dayTimeDuration
yearMonthDuration
xsd:yearMonthDuration
@@ -491,7 +490,7 @@
Interpreting datatypes
gYearMonth
xsd:gYearMonth
hexBinary
xsd:hexBinary
QName
xsd:QName
-
string
xsd:string or rdf:langString
depending on whether or not the lang property is defined for the cell.
+
string
xsd:string or rdf:langString depending on whether or not the value has an associated language.
normalizedString
xsd:normalizedString
token
xsd:token
language
xsd:language
@@ -504,7 +503,7 @@
Interpreting datatypes
-
In the case of rdf:langString, the appropriate language tag (as defined in [[!rdf11-concepts]]) MUST be provided for the string, based on the value of lang.
+
In the case of rdf:langString, the appropriate language tag (as defined in [[!rdf11-concepts]]) MUST be provided for the string, based on the value of cell value's language.
(See section on Graph Literals in [[!rdf11-concepts]] for further details on language tagged literals.)
According to [[rdf11-concepts]] language tags cannot be combined with any other xsd datatypes. If a cell has any other datatype than string, the value of lang MUST be ignored. Also, all literals have a datatype; however, specific serializations, like Turtle [[turtle]], MAY provide a special syntax for literals with datatype xsd:string or rdf:langString.
@@ -548,8 +547,8 @@
Inclusion of provenance information
JSON-LD to RDF
-
This section defines a mechanism for transforming the [[json-ld]] dialect used for common properties and notes into RDF in a manner consistent with the Deserialize JSON-LD to RDF Algorithm defined in [[!json-ld-api]]. Converters MAY use any algorithm which results in equivalent triples.
-
Given a subject, property and value in normalized form:
+
This section defines a mechanism for transforming the [[json-ld]] dialect used for non-core annotations and notes into RDF in a manner consistent with the Deserialize JSON-LD to RDF Algorithm defined in [[!json-ld-api]]. Converters MAY use any algorithm which results in equivalent triples.
+
Given a subject, property and value in normalized form:
Property is a term defined in the [[csvw-context]], a prefixed name, or an absolute URL; expand to an absolute URL by replacing a term with the URI from the term definition in [[csvw-context]] or a prefixed name as described in .
If value is an array, generate RDF by running this algorithm using subject, property using each array member as value.
@@ -633,7 +632,7 @@
Examples
Simple example
- This example comprises a single annotated table containing information attributes about countries; country code, position (latitude, longitude) and name. Whilst the input tabular data file, published at http://example.org/countries.csv, includes a header line, no further metadata annotations are given. The tabular data file is provided below:
+ This example comprises a single table containing information attributes about countries; country code, position (latitude, longitude) and name. Whilst the input tabular data file, published at http://example.org/countries.csv, includes a header line, no further metadata annotations are given. The tabular data file is provided below:
As the value of propertyUrl has not been set within the metadata description it defaults to the URI Template (see [[RFC6570]]) #{[column-name]}, where [column-name] is the value of the name property for the column associated with the cell. For example, the value of propertyUrl for all cells in columnC1 ("name": "countryCode") is http://example.org/countries.csv#countryCode.
-
@@ -726,7 +722,10 @@
Simple example
data-oninclude="updateExample">
-
The aboutUrl property has not been set for cells in table T ({ "url": "http://example.org/countries.csv"}) - cells in a given row where aboutUrl has not been specified are assumed to refer to the same subject. This unspecified subject is treated as a blank node.
+
+
The about URL annotation has not been set for cells in table T ({ "url": "http://example.org/countries.csv"}) - cells in a given row where about URL has not been specified are assumed to refer to the same subject. This unspecified subject is treated as a blank node.
+
Given that the property URL is null for cells in table T ({ "url": "http://example.org/countries.csv"}), the property URL defaults to the URI Template (see [[RFC6570]]) #{[column-name]}, where [column-name] is the value of the name annotation of the column associated with the cell. For example, the value of the property URL annotation for all cells in columnC1 ("name": "countryCode") is http://example.org/countries.csv#countryCode.
+
Standard mode output for this example is provided in Turtle [[turtle]] syntax below:
@@ -738,8 +737,8 @@
Simple example
-
Even though the table was defined in isolation, the table is wrapped in a table group.
-
The type of both table and table group resources is explicitly stated; csvw:TableGroup and csvw:Table respectively.
The type of both table and group of tables objects is explicitly stated; csvw:TableGroup and csvw:Table respectively.
The csvw:url property provides reference to the original tabular data file and to specific rows therein - noting the need to escape the Turtle-syntax reserved character = (U+003D) within the fragment identifier.
The row number is provided for each row using csvw:rownum property.
A subject and row are related using the csvw:describes property.
@@ -749,7 +748,7 @@
Simple example
Example with single table and rich annotations
- This example is based on Use Case #11 - City of Palo Alto Tree Data and comprises a single annotated table describing an inventory of tree maintenance operations. The input tabular data file, published at http://example.org/tree-ops-ext.csv, and the associated metadata description http://example.org/tree-ops-ext.csv-metadata.json are provided below:
+ This example is based on Use Case #11 - City of Palo Alto Tree Data and comprises a single table describing an inventory of tree maintenance operations. The input tabular data file, published at http://example.org/tree-ops-ext.csv, and the associated metadata description http://example.org/tree-ops-ext.csv-metadata.json are provided below:
Example with single table and rich annotations
-
Annotations for the resulting tableT, with 9 columns and 3 rows, are shown below:
+
Core annotations for the resulting tableT, with 9 columns and 3 rows, are shown below:
"cavity or decay; trunk decay; codominant leaders; included bark; large leader or limb decay; previous failure root damage; root decay; beware of BEES"
"cavity or decay", "trunk decay", "codominant leaders", "included bark", "large leader or limb decay", "previous failure root damage", "root decay", "beware of BEES"
"cavity or decay; trunk decay; codominant leaders; included bark; large leader or limb decay; previous failure root damage; root decay; beware of BEES"
"cavity or decay", "trunk decay", "codominant leaders", "included bark", "large leader or limb decay", "previous failure root damage", "root decay", "beware of BEES"
For brevity, the propertyUrl is not shown in the table of cell annotations. Where not explicitly set, the value of propertyUrl defaults to the URI Template (see [[RFC6570]]) #{[column-name]}, where [column-name] is the value of the name property for the column associated with the cell. For example, the value of propertyUrl for all cells in columnC2 ("name": "on_street") is http://example.org/tree-ops-ext.csv#on_street.
-
-
The lists of values from cells in columnC7 ("name": "comments") are assumed to be unordered as the boolean property ordered, with default value true, has not be set within the metadata description.
+
The lists of values from cells in columnC7 ("name": "comments") are assumed to be unordered as the boolean ordered annotation, which defaults to false, has not be set within the metadata description.
-
Minimal mode output for this example is provided in Turtle [[turtle]] syntax below:
Example with single table and rich annotations
-
The subject described by each row is explcitly defined using the aboutUrl property; e.g. the subject of rowR1 is http://example.org/tree-ops-ext#gid-1.
-
Output for columnC1 ({ "name": "GID" }) is not included as column property suppressOutput has value true.
The datatype property is set on columnsC5, C6, C8 and C9 ({ "name": "dbh"}, { "name": "inventory_date" }, { "name": "protected" } and { "name": "kml" }); integer, date, boolean and xml respectively. The datatype property is inherited by all cells in each of those columns, therefore the RDF output for those cells includes the appropriate datatype IRI.
+
The subject described by each row is explicitly defined using the about URL annotation; e.g. the subject of rowR1 is http://example.org/tree-ops-ext#gid-1.
The datatype property is set on columnsC5, C6, C8 and C9 ({ "name": "dbh"}, { "name": "inventory_date" }, { "name": "protected" } and { "name": "kml" }); integer, date, boolean and xml respectively. The datatype property is inherited by all cells in each of those columns, therefore the RDF output for those cells includes the appropriate datatype IRI.
CellsC1.7 and C2.7 (rowsR1 and R2; column, { "name": "comments" }) have nullvalues - no output is included for these cells.
-
CellC3.7 (rowR3; column, { "name": "comments" }) contains an unordered sequence of values; the set of values are included as a simple set of triples as opposed to an instance of rdf:List as the ordered property has not been specified (default is unorderd).
+
CellC3.7 (rowR3; column, { "name": "comments" }) contains an unordered sequence of values; the set of values are included as a simple set of triples as opposed to an instance of rdf:List as the ordered annotation has defaulted to false.
Standard mode output for this example is provided in Turtle [[turtle]] syntax below:
@@ -904,8 +908,8 @@
Example with single table and rich annotations
TableT ({ "url": "http://example.org/tree-ops-ext.csv"}) has been explicitly identified: { "@id": "<http://exmple.org/tree-ops-ext>"}.
-
Common properties and notes specified for tableT ({ "url": "http://example.org/tree-ops-ext.csv"}) are included in the output.
-
As the metadata description file http://example.org/tree-ops-ext.csv-metadata.json defines a default language within the context ("@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}]), all common properties of type string (e.g. dc:title, dcat:keyword, dc:publisher, dc:license and dc:modified) are expressed in the RDF output using the the appropriate language tag.
+
Non-core annotations and notes specified for tableT ({ "url": "http://example.org/tree-ops-ext.csv"}) are included in the output.
+
As the metadata description file http://example.org/tree-ops-ext.csv-metadata.json defines a default language within the context ("@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}]), all non-core annotations of type string (e.g. dc:title, dcat:keyword, dc:publisher, dc:license and dc:modified) are expressed in the RDF output using the the appropriate language tag.
@@ -913,7 +917,7 @@
Example with single table and rich annotations
Example with single table and using virtual columns to produce multiple subjects per row
- This example uses a single annotated table describing a listing of music events. Each row from the tabular data file corresponds to three resources; the music event itself, the location where that event occurs and the offer to sell tickets for that event. The goal is to convert the CSV content into schema.org markup that a search engine such as Google can use to index music events. Details of how Google expects this information to be structured can be found here.
+ This example uses a single table describing a listing of music events. Each row from the tabular data file corresponds to three resources; the music event itself, the location where that event occurs and the offer to sell tickets for that event. The goal is to convert the CSV content into schema.org markup that a search engine such as Googlecan use to index music events. Details of how Google expects this information to be structured can be found here.
The input tabular data file, published at http://example.org/events-listing.csv, and the associated metadata description http://example.org/events-listing.csv-metadata.json are provided below:
@@ -933,7 +937,7 @@
Example with single table and using virtual columns to produce mult
- The CSV to RDF translation is limited to providing one statement, or triple, per column in the table. The target schema.org markup requires 10 statements to describe each event. As the base tabular data file contains 5 columns, an additional 5 virtual columns have been added in order to provide for the full complement of statements—including the relationships between the 3 resources (event, location, and offer) described by each row of the table. Note that the virtual property is set to true for these virtual columns.
+ The CSV to RDF translation is limited to providing one statement, or triple, per column in the table. The target schema.org markup requires 10 statements to describe each event. As the base tabular data file contains 5 columns, an additional 5 virtual columns have been added in order to provide for the full complement of statements—including the relationships between the 3 resources (event, location, and offer) described by each row of the table. Note that the virtual annotation is set to true for these virtual columns.
Furthermore, note that no attempt is made to reconcile between locations or offers that may be associated with more than one event; every row in the table will create both a new location resource and offer resource in addition to the event resource. If considered necessary, applications such as OpenRefine may be used to identify and reconcile duplicate location resources once the RDF output has been generated.
@@ -961,10 +965,6 @@
Example with single table and using virtual columns to produce mult
ColumnsC6, C7 and C8 ({ "name": "type_event"}, { "name": "type_place"} and { "name": "type_offer"}) define the semantic types of the resources described by each row: schema:MusicEvent, schema:Place and schema:Offer respectively.
ColumnC9 ({ "name": "location"}) uses the aboutUrl and valueUrl to assert the relationship between the event and location resources.
ColumnC10 ({ "name": "offer"}) uses the aboutUrl and valueUrl to assert the relationship between the event and offer resources.
@@ -1051,15 +1051,15 @@
Example with single table and using virtual columns to produce mult
-
The resources described by each row are explcitly defined using the aboutUrl property—in this case three resources per row (event, location, and offer); the relationship between the row and each subject resource is asserted using the csvw:describes property; e.g. for rowR1 we state [] csvw:describes t1:event-1, t1:place-1, t1:offer-1 .
+
The resources described by each row are explcitly defined using the about URL annotation this case three resources per row (event, location, and offer); the relationship between the row and each subject resource is asserted using the csvw:describes property; e.g. for rowR1 we state [] csvw:describes t1:event-1, t1:place-1, t1:offer-1 .
-
Example with table group comprising four interrelated tables
+
Example with group of tables comprising four interrelated tables
- This example is based on Use Case #4 - Publication of public sector roles and salaries and uses four annotated tables published within a table group. Information about senior roles and junior roles within a government department or organization are published in CSV format by each department. These are validated against a centrally published schema to ensure that all the data published by departments is consistent. Additionally, lists of organizations and professions are also published centrally, providing controlled vocabularies against which departmental submissions are validated.
+ This example is based on Use Case #4 - Publication of public sector roles and salaries and uses four tables published within a group of tables. Information about senior roles and junior roles within a government department or organization are published in CSV format by each department. These are validated against a centrally published schema to ensure that all the data published by departments is consistent. Additionally, lists of organizations and professions are also published centrally, providing controlled vocabularies against which departmental submissions are validated.
@@ -1134,55 +1134,66 @@
Example with table group comprising four interrelated tables
Finally, note that because the centrally published metadata descriptions are intended to be reused across many government departments and organizations, extra consideration has been given to defining URIs for the person and post resources defined in each row of the senior roles tabular data and subsequently referenced from the junior roles tabular data. To ensure that naming clashes are avoided, the unique reference for the organization to which the person or post belongs has been included in a path segment of the identifier. For example, the URI template propertyaboutUrl used to identify the senior post is specified as http://example.org/organization/{organizationRef}/post/{ref}, thus yielding the URI http://example.org/organization/hefce.ac.uk/post/90115 for the post described in the first row of the senior roles tabular data.
-
The table group generated from parsing the tabular data files and associated metadata is shown below and provides the basis for the conversion to RDF.
+
The group of tables generated from parsing the tabular data files and associated metadata is shown below and provides the basis for the conversion to RDF.
-
Annotations for the table groupG and the four tablesTa, Tb, Tc, and Td are shown below.
+
Annotations for the group of tablesG and the four tablesTa, Tb, Tc, and Td are shown below.
In this example, output for the centrally published lists of organizations and professions, tables Ta and Tb (http://example.org/gov.uk/data/organizations.csv and http://example.org/gov.uk/data/professions.csv respectively), are not required; only information from the departmental submissions is to be translated to RDF. Note the suppressOutput annotation on this table.
+
In this example, output for the centrally published lists of organizations and professions, tables Ta and Tb (http://example.org/gov.uk/data/organizations.csv and http://example.org/gov.uk/data/professions.csv respectively), are not required; only information from the departmental submissions is to be translated to RDF. Note the suppress output annotation on this table.
+
+
The following foreign keys are defined:
+
+
+
id
columns in table
columns in referenced table
+
+
+
Fa1
Ca3
Ca1
+
Fc1
Cc5
Cc1
+
Fc2
Cc6
Cb1
+
Fc3
Cc7
Ca1
+
Fd1
Cd1
Cc1
+
Fd2
Cd7
Cb1
+
Fd3
Cd8
Ca1
+
+
Annotations for the columns, rows and cells in tableT are shown in the tables below.
Notice that valueUrl is not specified for cellsCa2.3 and Cc2.5 because in each case the cell value is null and the virtual property of columnCb5 is not specified.
+
Notice that value URL is not specified for cellsCa2.3 and Cc2.5 because in each case the cell value is null and the virtual annotation of columnCb5 is not defined.
@@ -1313,11 +1324,11 @@
Example with table group comprising four interrelated tables
-
Output for tablesTa and Tb ({ "url": "http://example.org/gov.uk/data/organizations.csv" } and { "url": "http://example.org/gov.uk/data/professions.csv" }) are not included as property suppressOutput is specified with value true for each of the tables.
ColumnsCc5 and Cd1 ({ "name": "reportsTo" } and { "name": "reportsToSenior" }) use the aboutUrl, propertyUrl and valueUrl properties to assert the relationship between the given post and the senior post it reports to for the cells therein.
-
Similarly, columnsCc7 and Cd8 (both with { "name": "organizationRef" }) use the aboutUrl, propertyUrl and valueUrl properties to assert the relationship between the given post and the organization to which it belongs for the cells those columns.
-
Finally, note that two resources are created for each row within tableTc ({ "url": "http://example.org/senior-roles.csv" }): the person and the post they occupy. The relationship between these resources is specified via virtualcolumnCc8 ({ "name": "post_holder" }) using the aboutUrl, propertyUrl and valueUrl properties.
+
Output for tablesTa and Tb ({ "url": "http://example.org/gov.uk/data/organizations.csv" } and { "url": "http://example.org/gov.uk/data/professions.csv" }) are not included as the suppress output annotation is true.
ColumnsCc5 and Cd1 ({ "name": "reportsTo" } and { "name": "reportsToSenior" }) use the about URL, property URL and value URL annotations to assert the relationship between the given post and the senior post it reports to for the cells therein.
+
Similarly, columnsCc7 and Cd8 (both with { "name": "organizationRef" }) use the about URL, property URL and value URL annotations to assert the relationship between the given post and the organization to which it belongs for the cells those columns.
+
Finally, note that two resources are created for each row within tableTc ({ "url": "http://example.org/senior-roles.csv" }): the person and the post they occupy. The relationship between these resources is specified via virtualcolumnCc8 ({ "name": "post_holder" }) using the about URL, property URL and value URL annotations.
Standard mode output for this example is provided in Turtle [[turtle]] syntax below:
@@ -1330,7 +1341,7 @@
Example with table group comprising four interrelated tables
-
Table groupG was explicitly defined, but has not been explicitly identified; the table resource is treated as a blank node.
+
Table groupG was explicitly defined, but has not been explicitly identified; the table group and table resources are treated as blank nodes.
The person and post resources described by each row of tableTc ({ "url": "http://example.org/senior-roles.csv"}) are explcitly defined using the aboutUrl property; therefore, say, for rowRc1 we state [] csvw:describes <http://example.org/organization/hefce.ac.uk/post/90115>, <http://example.org/organization/hefce.ac.uk/person/1> .; whilst the aboutUrl property has not been defined for resources described by each row of tableTd ({ "url": "http://example.org/junior-roles.csv"}); therefore blank nodes are used, e.g. for rowRd1 we state [] csvw:describes _:d8b8e40c-8c74-458b-99f7-64d1cf5c65f2 ..
Validation, conversion, display, and search of tabular data on the web requires additional metadata that describes how the data should be interpreted. This document defines a vocabulary for metadata that annotates tabular data. This can be used to provide metadata at various levels, from collections of data from CSV documents, and how they relate to each other down to individual cells within a table.
+
+ The metadata defined in this specification is used to provide annotations on an annotated table or group of tables, as defined in [[!tabular-data-model]]. Annotated tables form the basis for all further processing, such as validating, converting, or displaying the tables.
+
@@ -236,7 +239,6 @@
Annotating Tables
the properties name, titles, and dc:description are used to create the name, titles, and dc:description annotations on the column in the data model. The datatype property is an inherited property that affects the value of each cell in that column (see for more on inherited properties).
-
The property value of an annotation is that defined in the metadata, unless otherwise noted.
Metadata Format
@@ -244,7 +246,7 @@
Metadata Format
This section defines a set of properties and permitted values for annotating tabular data, and how these properties should be interpreted by applications.
a variable is set for each column within the schema; the name of the variable is the name of the column and the value is derived from the value of the cell in that column in the row that is currently being processed, namely one of:
+
a variable is set for each column within the schema; the name of the variable is the column name of the column from the annotated table and the value is derived from the value of the cell in that column in the row that is currently being processed, namely one of:
null
the canonical representation of the value of the cell, as defined in [[!xmlschema11-2]], if it has a single value
@@ -323,11 +325,11 @@
URI Template Properties
_sourceRow
_sourceRow is set to the source number of the row that is currently being processed; this usually varies from _row by skip rows and header rows
_name
-
_name is set to the URI decoded property value of the name property on the cell column that is currently being processed. (Percent-decoding is necessary as name may have been encoded if taken from titles; this prevents double percent-encoding.)
+
_name is set to the URI decoded column name annotation, as defined in [[!tabular-data-model]], for the column that is currently being processed. (Percent-decoding is necessary as name may have been encoded if taken from titles; this prevents double percent-encoding.)
- The property value of a URI template property is only available when processing individual cells in a table, usually while converting tables as defined in [[!tabular-data-model]]. The value is the result of:
+ The annotation value is the result of:
applying the template against the cell in that column in the row that is currently being processed
arrays — lists of numbers, booleans, strings, or objects
- The property value of a boolean atomic property is false if unset; otherwise, the property value of an atomic property is that set in metadata or null, if unset. Processors MUST raise an error if a property is set to an invalid value type, such as a boolean atomic property being set to the number 1 or a numeric atomic property being set to the string "3.1415".
+ The annotation value of a boolean atomic property is false if unset; otherwise, the annotation value of an atomic property is normalized value of that property, or null, if unset. Processors MUST raise an error if a property is set to an invalid value type, such as a boolean atomic property being set to the number 1 or a numeric atomic property being set to the string "3.1415".
@@ -664,7 +666,7 @@
Optional Properties
notes
- An array property that provides an array of objects representing arbitrary annotations on the annotated tabular data model. The value of this property becomes the value of the notes annotation for the annotated table. The properties on these objects are interpreted equivalently to common properties as described in . When an array of note objects B is merged into an original array of note objects A, each note object from B is appended into the array A.
+ An array property that provides an array of objects representing arbitrary annotations on the annotated tabular data model. The value of this property becomes the value of the notes annotation for the table. The properties on these objects are interpreted equivalently to common properties as described in . When an array of note objects B is merged into an original array of note objects A, each note object from B is appended into the array A.
The Web Annotation Working Group is developing a vocabulary for expressing annotations. In future versions of this specification, we anticipate referencing that vocabulary.
@@ -679,7 +681,7 @@
Optional Properties
tableSchema
An object property that provides a single schema description as described in . This may be provided as an embedded object within the JSON metadata or as a URL reference to a separate JSON schema document. If a table description is within a table group description, the tableSchema from that table group acts as the default for this property.
As defined for table groups. The value of this property becomes the value of the transformations annotation for this table.
@@ -703,10 +705,10 @@
Optional Properties
Schemas
- A tableSchema is a definition of a tabular format that may be common to multiple tables. For example, multiple tables from different sources may have the same columns and be designed such that they can be aggregated together.
+ A Schema is a definition of a tabular format that may be common to multiple tables. For example, multiple tables from different sources may have the same columns and be designed such that they can be aggregated together.
- A schema description is a JSON object that encodes the information about a schema, which describes the structure of a table. All the properties of a schema description are optional.
+ A schema description is a JSON object that encodes the information about a schema, which describes the structure of a table. All the properties of a schema description are optional.
columns
@@ -755,11 +757,11 @@
Schemas
resource
-
A link property holding a URL that is the identifier for a specific table that is being referenced. If this property is present then schemaReference MUST NOT be present. The table group MUST contain a table whose url annotation is identical to the property value of this property. That table is the referenced table.
+
A link property holding a URL that is the identifier for a specific table that is being referenced. If this property is present then schemaReference MUST NOT be present. The table group MUST contain a table whose url annotation is identical to the expanded value of this property. That table is the referenced table.
schemaReference
-
A link property holding a URL that is the identifier for a schema that is being referenced. If this property is present then resource MUST NOT be present. The table group MUST contain a table with a tableSchema having a @id that is identical to the property value of this property, and there MUST NOT be more than one such table. That table is the referenced table.
+
A link property holding a URL that is the identifier for a schema that is being referenced. If this property is present then resource MUST NOT be present. The table group MUST contain a table with a tableSchema having a @id that is identical to the expanded value of this property, and there MUST NOT be more than one such table. That table is the referenced table.
columnReference
@@ -767,7 +769,7 @@
Schemas
- The value of this property is used to create the value of the foreign keys annotation on the table using this schema by creating a list of foreign keys comprising a list of columns in the table and a list of columns in the referenced table. The value of this property is also used to create the value of the referenced rows annotation on each of the rows in the table that uses this schema, which is a pair of the relevant foreign key and the referenced row in the referenced table.
+ The value of this property becomes the foreign keys annotation on the table using this schema by creating a list of foreign keys comprising a list of columns in the table and a list of columns in the referenced table. The value of this property is also used to create the value of the referenced rows annotation on each of the rows in the table that uses this schema, which is a pair of the relevant foreign key and the referenced row in the referenced table.
As defined in [[!tabular-data-model]], validators MUST check that, for each row, the combination of cells in the referencing columns reference a unique row within the referenced table through a combination of cells in the referenced columns. For examples, see and .
@@ -784,7 +786,7 @@
Schemas
primaryKey
- A column reference property that holds either a single reference to a column description object or an array of references. The value of this property is used to create the value of the primary key annotation for each row within a table that uses this schema by creating a list of the cells in that row that are in the referenced columns.
+ A column reference property that holds either a single reference to a column description object or an array of references. The value of this property becomes the primary key annotation for each row within a table that uses this schema by creating a list of the cells in that row that are in the referenced columns.
As defined in [[!tabular-data-model]], validators MUST check that each row has a unique combination of cells in the indicated columns. For example, if primaryKey is set to ["familyName", "givenName"] then every row must have a unique value for the combination of the familyName and givenName columns.
@@ -977,27 +979,26 @@
Columns
name
- An atomic property that gives a single canonical name for the column. The value of this property is used to create the value of the name annotation for the described column. This MUST be a string. Conversion specifications MUST use this property as the basis for the names of properties/elements/attributes in the results of conversions.
+ An atomic property that gives a single canonical name for the column. The value of this property becomes the name annotation for the described column. This MUST be a string. Conversion specifications MUST use this property as the basis for the names of properties/elements/attributes in the results of conversions.
For ease of reference within URI template properties, column names are restricted as defined in Variables in [[!URI-TEMPLATE]] with the additional provision that names beginning with "_" are reserved by this specification and MUST NOT be used.
-
The property value of name is that defined within metadata, if it exists. Otherwise, it is the first value from the property value of titles, having the same language tag as default language or und if not specified, percent-encoded as necessary to conform to the syntactic requirements as a string without language, as defined in [[!RFC3986]]. Otherwise, it is the string "_col.[N]" where [N] is the column number.
-
suppressOutput
-
A boolean atomic property. If true, suppresses any output that would be generated when converting cells in this column. The value of this property is used to create the value of the suppress output annotation for the described column.
+
A boolean atomic property. If true, suppresses any output that would be generated when converting cells in this column. The value of this property becomes the suppress output annotation for the described column.
titles
- A natural language property that provides possible alternative names for the column. The value of this property is used to create the value of the titles annotation for the described column.
+ A natural language property that provides possible alternative names for the column. The value of this property becomes the titles annotation for the described column.
+
If there is no name property defined on this column, the first titles string value having the same language tag as default language or und or if not specified, becomes the name annotation for the described column. This annotation MUST be percent-encoded as necessary to conform to the syntactic requirements defined in [[!RFC3986]]
virtual
-
A boolean atomic property taking a single value which indicates whether the column is a virtual column not present in the original source. The value of this property is used to create the value of the virtual annotation for the described column. If present, a virtual column MUST appear after all other non-virtual column definitions.
+
A boolean atomic property taking a single value which indicates whether the column is a virtual column not present in the original source. The normalized value of this property becomes the virtual annotation for the described column. If present, a virtual column MUST appear after all other non-virtual column definitions.
Virtual columns are useful for inserting cells with default values into an annotated table to control the results of conversions.
We invite comment on whether virtual columns are useful enough to include in the final recommendation in spite of the added complexity.
@@ -1012,6 +1013,7 @@
Columns
+
If the column description has neither name nor titles properties, the string "_col.[N]" where [N] is the column number, becomes the name annotation for the described column.
The description MAY contain any common properties to provide extra metadata about the column as a whole, such as a full description.
@@ -1088,7 +1090,7 @@
Inherited Properties
aboutUrl
-
A URI template property that MAY be used to create a unique identifier for each cell within a row when mapping data to other formats. There are no compatibility restrictions on this property. The value of this property is used to create the value of the about URL annotation for the described column, and the about URL annotation for the cell.
+
A URI template property that MAY be used to create a unique identifier for each cell within a row when mapping data to other formats. There are no compatibility restrictions on this property. The value of this property becomes the about URL annotation for the described column,.
A value for this property is compatible with an inherited value if they are identical, or if the value is a subtype within the datatype hierarchy defined in , including if the inherited value is explicitly specified as the base of this value.
-
The value of this property is used to create the value of the datatype annotation for the described column.
We invite comment on whether datatype should allow for a "union" of types for a cell; this would allow for a set of datatypes that could be matched against the string value of a cell, choosing the first match; e.g., to match either a date or datetime.
default
- An atomic property holding a single string that is used to create a default value for the cell in cases where the original string value is an empty string. This default value MAY be used when converting the table into other formats, or when the table is displayed. If not specified, the default for the default property is the empty string, "". A value for this property is compatible with an inherited value only if they are identical. The value of this property is used to create the value of the default annotation for the described column.
+ An atomic property holding a single string that is used to create a default value for the cell in cases where the original string value is an empty string. This default value MAY be used when converting the table into other formats, or when the table is displayed. If not specified, the default for the default property is the empty string, "". A value for this property is compatible with an inherited value only if they are identical. The value of this property becomes the default annotation for the described column.
lang
- An atomic property giving a single string language code as defined by [[!BCP47]]. Indicates the language of the value within the cell. A value for this property is compatible with an inherited value if it is a sub-language of the inherited value; for example en-US is compatible with en but not fr. The value of this property is used to create the value of the lang annotation for the described column.
+ An atomic property giving a single string language code as defined by [[!BCP47]]. Indicates the language of the value within the cell. A value for this property is compatible with an inherited value if it is a sub-language of the inherited value; for example en-US is compatible with en but not fr. The value of this property becomes the lang annotation for the described column.
null
- An atomic property giving the string or strings used for null values within the data. If the string value of the cell is equal to any one of these values, the cell value is null. If not specified, the default for the null property is the empty string. A value for this property is compatible with an inherited value if it is a subset of the inherited value. The value of this property is used to create the value of the null annotation for the described column.
+ An atomic property giving the string or strings used for null values within the data. If the string value of the cell is equal to any one of these values, the cell value is null. If not specified, the default for the null property is the empty string. A value for this property is compatible with an inherited value if it is a subset of the inherited value. The value of this property becomes the null annotation for the described column.
ordered
-
A boolean atomic property taking a single value which indicates whether a list that is the value of the cell is ordered (if true) or unordered (if false). The default is false. This property is irrelevant if the separator is null or undefined, but this is not an error. A value for this property is compatible with an inherited value only if they are identical. The value of this property is used to create the value of the ordered annotation for the described column, and the ordered annotation for the described cell.
+
A boolean atomic property taking a single value which indicates whether a list that is the value of the cell is ordered (if true) or unordered (if false). The default is false. This property is irrelevant if the separator is null or undefined, but this is not an error. A value for this property is compatible with an inherited value only if they are identical. The value of this property becomes the ordered annotation for the described column, and the ordered annotation for the described cell.
propertyUrl
-
An URI template property that MAY be used to create a URI for a property if the table is mapped to another format. There are no compatibility restrictions on this property. The value of this property is used to create the value of the property URL annotation for the described column, and the property URL annotation for the cell.
+
An URI template property that MAY be used to create a URI for a property if the table is mapped to another format. There are no compatibility restrictions on this property. The value of this property becomes the property URL annotation for the described column.
propertyUrl is typically defined on a column description. If defined on a schema description, table description or table group description, care must be taken to ensure that transformed cell values maintain an appropriate semantic relationship, for example by including the name of the column in the generated URL by using _name in the template.
required
-
A boolean atomic property taking a single value which indicates whether the cell must have a non-null value. The default is false. A value for this property is compatible with an inherited value only if they are identical. The value of this property is used to create the value of the required annotation for the described column.
+
A boolean atomic property taking a single value which indicates whether the cell must have a non-null value. The default is false. A value for this property is compatible with an inherited value only if they are identical. The value of this property becomes the required annotation for the described column.
separator
- An atomic property that MUST have a single string value that is the character used to separate items in the string value of the cell. If null or unspecified, the cell does not contain a list. Otherwise, application MUST split the string value of the cell on the specified separator character and parse each of the resulting strings separately. The cell's value will then be a list. Conversion specifications MUST use the separator to determine the conversion of a cell into the target format. See for more details. A value for this property is compatible with an inherited value only if they are identical. The value of this property is used to create the value of the separator annotation for the described column.
+ An atomic property that MUST have a single string value that is the character used to separate items in the string value of the cell. If null or unspecified, the cell does not contain a list. Otherwise, application MUST split the string value of the cell on the specified separator character and parse each of the resulting strings separately. The cell's value will then be a list. Conversion specifications MUST use the separator to determine the conversion of a cell into the target format. See Parsing Cells in [[!tabular-data-model]] for more details. A value for this property is compatible with an inherited value only if they are identical. The value of this property becomes the separator annotation for the described column.
textDirection
- An atomic property that MUST have a single string value that is one of "rtl" or "ltr" (the default). Indicates whether the text within cells should be displayed by default as left-to-right or right-to-left text. The value of this property is used to create the value of the text direction annotation for the column, and the text direction annotation for the cell. A value for this property is compatible with an inherited value only if they are identical. See Bidirectional Tables in [[!tabular-data-model]] for details.
+ An atomic property that MUST have a single string value that is one of "rtl" or "ltr" (the default). Indicates whether the text within cells should be displayed by default as left-to-right or right-to-left text. The value of this property becomes the text direction annotation for the column. A value for this property is compatible with an inherited value only if they are identical. See Bidirectional Tables in [[!tabular-data-model]] for details.
valueUrl
-
An URI template property that is used to map the values to the cells into URLs. See for details. There are no compatibility restrictions on this property. The value of this property is used to create the value of the value URL annotation for the described column, and the value URL annotation for the cell.
This allows a cell value to define one or more RDF resources value of a cell instead of a literal values, as defined in [[rdf-concepts]]. For example, if the value were "{#reference}", each cell value of a column named reference would be used to create a URI such as http://example.com/#1234, if 1234 were a cell value of that column.
Dialect descriptions do not provide a mechanism for handling CSV files in which there are multiple tables within a single file (eg separated by empty lines).
- The default dialect description for CSV files is:
-
+
+ The default dialect description for CSV files is:
+
{
@@ -1512,7 +1512,7 @@
Example
Datatypes
- Cells within tables may be annotated with a datatype which indicates the type of the values obtained by parsing the string value of the cell. See for details of how string values are parsed against datatypes.
+ Cells within tables may be annotated with a datatype which indicates the type of the values obtained by parsing the string value of the cell. See [[!tabular-data-model]] for a description of annotations on a datatype.
Built-in Datatypes
@@ -1521,10 +1521,10 @@
Built-in Datatypes
the datatypes defined in [[!xmlschema11-2]] as derived from and including anyAtomicType
-
the datatype number which is exactly equivalent to double
-
the datatype binary which is exactly equivalent to base64Binary
-
the datatype datetime which is exactly equivalent to dateTime
-
the datatype any which is exactly equivalent to anyAtomicType
+
the datatype number which is mapped to double in the data model
+
the datatype binary which is mapped to base64Binary in the data model
+
the datatype datetime which is mapped to dateTime in the data model
+
the datatype any which is mapped to anyAtomicType in the data model
the datatype xml, a sub-type of string, which indicates the value is an XML fragment
the datatype html, a sub-type of string, which indicates the value is an HTML fragment
the datatype json, a sub-type of string, which indicates the value is serialized JSON
@@ -1543,67 +1543,67 @@
Derived Datatypes
base
- An atomic property that contains a single string: a term defined in the default context representing a built-in datatype URL, as listed above. If this property is missing, its default is string. All values of the datatype MUST be valid values of the base datatype.
+ An atomic property that contains a single string: a term defined in the default context representing a built-in datatype URL, as listed above. If this property is missing, its default is string. All values of the datatype MUST be valid values of the base datatype. The value of this property becomes the base annotation for the described datatype.
format
- An atomic property that contains either a single string or an object that defines the format of a value of this type, used when parsing a string value as described in .
+ An atomic property that contains either a single string or an object that defines the format of a value of this type, used when parsing a string value as described in Parsing Cells in [[!tabular-data-model]]. The value of this property becomes the format annotation for the described datatype.
length
- A numeric atomic property that contains a single integer that is the exact length of the value. See for details.
+ A numeric atomic property that contains a single integer that is the exact length of the value. The value of this property becomes the length annotation for the described datatype. See Length Constraints in [[!tabular-data-model]] for details.
minLength
- An atomic property that contains a single integer that is the minimum length of the value. See for details.
+ An atomic property that contains a single integer that is the minimum length of the value. The value of this property becomes the minimum length annotation for the described datatype. See Length Constraints in [[!tabular-data-model]] for details.
maxLength
- A numeric atomic property that contains a single integer that is the maximum length of the value. See for details.
+ A numeric atomic property that contains a single integer that is the maximum length of the value. The value of this property becomes the maximum length annotation for the described datatype. See Length Constraints in [[!tabular-data-model]] for details.
minimum
- An atomic property that contains a single number that is the minimum valid value (inclusive); equivalent to minInclusive. See for details.
+ An atomic property that contains a single number or string that is the minimum valid value (inclusive); equivalent to minInclusive. The value of this property becomes the minimum annotation for the described datatype. See Value Constraints in [[!tabular-data-model]] for details.
maximum
- An atomic property that contains a single number that is the maximum valid value (inclusive); equivalent to maxInclusive. See for details.
+ An atomic property that contains a single number or string that is the maximum valid value (inclusive); equivalent to maxInclusive. The value of this property becomes the maximum annotation for the described datatype. See Value Constraints in [[!tabular-data-model]] for details.
minInclusive
- An atomic property that contains a single number that is the minimum valid value (inclusive). See for details.
+ An atomic property that contains a single number or string that is the minimum valid value (inclusive). The value of this property becomes the minimum annotation for the described datatype. See Value Constraints in [[!tabular-data-model]] for details.
maxInclusive
- An atomic property that contains a single number that is the maximum valid value (inclusive). See for details.
+ An atomic property that contains a single number or string that is the maximum valid value (inclusive). The value of this property becomes the maximum annotation for the described datatype. See Value Constraints in [[!tabular-data-model]] for details.
minExclusive
- An atomic property that contains a single number that is the minimum valid value (exclusive). See for details.
+ An atomic property that contains a single number or string that is the minimum valid value (exclusive). The value of this property becomes the minimum exclusive annotation for the described datatype. See Value Constraints in [[!tabular-data-model]] for details.
maxExclusive
- An atomic property that contains a single number that is the maximum valid value (exclusive). See for details.
+ An atomic property that contains a single number or string that is the maximum valid value (exclusive). The value of this property becomes the maximum exclusive annotation for the described datatype. See Value Constraints in [[!tabular-data-model]] for details.
- The length, minLength and maxLength properties indicate the exact, minimum and maximum lengths of values of a datatype.
-
-
- Applications MUST raise an error if both length and minLength are specified and they do not have the same value. Similarly, applications MUST raise an error if both length and maxLength are specified and they do not have the same value. Applications MUST raise an error if length, maxLength, or minLength are specified and the base datatype is not string or one of its subtypes, or a binary type.
-
-
- The length of a value is determined as follows:
-
-
-
if the value is null, its length is zero
-
if the value is a string or one of its subtypes, its length is the number of characters in the value
-
if the value is of a binary type, its length is the number of bytes in the binary value
-
-
-
-
Value Constraints
-
- The minimum, maximum, minInclusive, maxInclusive, minExclusive, and maxExclusive properties indicate limits on values of a datatype. These apply to numeric, date/time, and duration types.
-
-
- In all ways, including the errors described below, the minimum property is equivalent to the minInclusive property and the maximum property is equivalent to the maxInclusive property. Applications MUST raise an error if both minimum and minInclusive are specified and they do not have the same value. Similarly, applications MUST raise an error if both maximum and maxInclusive are specified and they do not have the same value.
-
-
- Applications MUST raise an error if both minInclusive and minExclusive are specified, or if both maxInclusive and maxExclusive are specified. Applications MUST raise an error if both minInclusive and maxInclusive are specified and maxInclusive is less than minInclusive, or if both minInclusive and maxExclusive are specified and maxExclusive is less than or equal to minInclusive. Similarly, applications MUST raise an error if both minExclusive and maxExclusive are specified and maxExclusive is less than minExclusive, or if both minExclusive and maxInclusive are specified and maxInclusive is less than or equal to minExclusive.
-
-
- Applications MUST raise an error if minimum, minInclusive, maximum, maxInclusive, minExclusive, or maxExclusive are specified and the base datatype is not a numeric, date/time, or duration type.
-
-
- Validation against these properties is as defined in [[!xmlschema11-2]].
-
-
-
-
-
-
Parsing cells
-
- Unlike many other data formats, tabular data is designed to be read by humans. For that reason, it's common for data to be represented within tabular data in a human-readable way. The null, required, default, separator, datatype, and lang properties provide the information needed to parse the string value of a cell into its (semantic) value. This is used:
-
-
-
by validators to check that the data in the table is in the expected format,
-
by converters to parse the values before mapping them into values in the target of the conversion,
-
when displaying data, to map it into formats that are meaningful for those viewing the data (as opposed to those publishing it), and
-
when inputting data, to turn entered values into representations in a consistent format.
a single value with an associated optional datatype or language, or
-
a sequence of such values.
-
-
- The process of parsing the string value of a cell into a single value or a list of values is as follows:
-
-
-
unless the datatype is string, json, xml, html, anyAtomicType, or any, replace all carriage return (#xD), line feed (#xA), and tab (#x9) characters with space characters.
-
unless the datatype is string, json, xml, html, anyAtomicType, any, or normalizedString, strip leading and trailing whitespace from the string value and replace all instances of two or more whitespace characters with a single space character.
-
if the resulting string is an empty string, apply the remaining steps to the string given by the default property.
-
if the separator property is not null and the resulting string is an empty string, the cell value is an empty list. If the required property is true, add an error to the list of errors for the cell.
-
if the separator property is not null, the cell value is a list of values created by:
-
-
if the normalized string is an empty string, apply the remaining steps to the string given by the default property.
-
if the normalized string is the same as any one of the values of the null property, then the resulting value is null.
-
split the normalized string at the character specified by the separator property.
-
unless the datatype is string, anyAtomicType, or any, strip leading and trailing whitespace from these strings.
-
applying the remaining steps to each of the strings in turn.
-
-
-
if the string is an empty string, apply the remaining steps to the string given by the default property.
-
if the string is the same as any one of the values of the null property, then the resulting value is null. If the separator property is null and the required property is true, add an error to the list of errors for the cell.
-
validate the string based on the datatype, using the format property if one is specified, as described below, and then against the constraints described in ; if there are any errors, add them to the list of errors for the cell; the resulting value is typed as a string with the language provided by the lang property.
-
otherwise, if there are no errors, parse the string using the format if one is specified, as described below; the resulting value is typed according to the datatype and if the datatype is string, or there is no datatype, it has the language provided by the lang property.
-
-
-
Parsing examples
-
- When no metadata is available, the value of a cell is the same as its string value. For example, a cell with a string value of "99" would similarly have the (semantic) value "99".
-
-
- If a datatype is provided for the cell, that is used to create a (semantic) value for the cell. For example, if the metadata contains:
-
-
-"datatype": "integer"
-
-
- for the cell with the string value "99" then the value of that cell will be the integer 99. A cell whose string value was not a valid integer (such as "one" or "1.0") would be assigned that string value as its (semantic) value, but also have a validation error listed in its errors annotation.
-
-
- Sometimes data uses special codes to indicate unknown or null values. For example, a particular column might contain a number that is expected to be between 1 and 10, with the string 99 used in the original tabular data file to indicate a null value. The metadata for such a column would include:
-
- In this case, a cell with a string value of "5" would have the (semantic) value of the integer 5; a cell with a string value of "99" would have the value null.
-
-
- Similarly, a cell may be assigned a default value if the string value for the cell is empty. A configuration such as:
-
- In this case, a cell whose string value is "" would be assigned the value of the integer 5. A cell whose string value contains whitespace, such as a single tab character, would also be assigned the value of the integer 5: when the datatype is something other than string, anyAtomicType, or any, leading and trailing whitespace is stripped from string values before the remainder of the processing is carried out.
-
-
- Cells can contain sequences of values. For example, a cell might have the string value "1 5 7.0". In this case, the separator is a space character. The appropriate configuration would be:
-
- and this would mean that the cell's value would be an array containing two integers and a string: [1, 5, "7.0"]. The final value of the array is a string because it is not a valid integer; the cell's errors annotation will also contain a validation error.
-
-
- Also, with this configuration, if the string value of the cell were "" (ie it was an empty cell) the value of the cell would be an empty list.
-
-
- A cell value can be inserted into a URL created using a URI template property such as valueUrl. For example, if a cell with the string value"1 5 7.0" were in a column named values, defined with:
-
- then after expansion of the URI template, the resulting valueUrl would be ?values=1.0,5.0,7.0. The canonical representations of the decimal values are used within the URL.
-
-
-
-
Formats for numeric types
-
- It is not uncommon for numbers within tabular data to be formatted for human consumption, which may involve using commas for decimal points, grouping digits in the number using commas, or adding currency symbols or percent signs to the number.
-
-
- If the datatype is a numeric type, the format property indicates the expected format for that number. Its value MUST be either a single string or an object with one or more of the properties:
-
-
-
decimalChar
-
An atomic property containing a single character string whose value is used to represent a decimal point within the number. The default value is ".".
-
groupChar
-
An atomic property containing a single character string whose value is used to group digits within the number. The default value is ",".
-
pattern
-
An atomic property containing a regular expression string, in the syntax and interpreted as defined by [[!ECMASCRIPT]].
-
-
- Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.
-
-
- If the format property is a single string, this is interpreted in the same way as if it were an object with a pattern property whose value is that string.
-
-
- When parsing the string value of a cell against this format specification, implementations MUST recognise and parse numbers that consist of:
-
-
-
an optional + or - sign,
-
followed by a decimal digit (0-9),
-
followed by any number of decimal digits (0-9) and the character specified as the groupChar,
-
followed by an optional decimalChar followed by one or more decimal digits (0-9),
-
followed by an optional exponent, consisting of an E followed by an optional + or - sign followed by one or more decimal digits (0-9), or
-
followed by an optional percent (%) or per-mille (‰) sign.
-
-
- or that are one of the special values:
-
-
-
NaN,
-
INF, or
-
-INF.
-
-
- Implementations MUST add a validation error to the errors annotation for the cell if the string being parsed:
-
does not match the regular expression defined in the pattern property, if there is one,
-
contains the decimalChar, if the datatype is integer or one of its sub-values,
-
contains an exponent, if the datatype is decimal or one of its sub-values, or
-
is one of the special values NaN, INF, or -INF, if the datatype is decimal or one of its sub-values.
-
-
- Implementations MUST use the sign, exponent, percent, and per-mille signs when parsing the string value of a cell to provide the value of the cell. For example, the string value "-25%" must be interpreted as -0.25 and the string value "1E6" as 1000000.
-
-
-
-
Formats for booleans
-
- Boolean values may be represented in many ways aside from the standard 1 and 0 or true and false.
-
-
- If the datatype for a cell is boolean, the format property provides the true and false values expected, separated by |. For example if format is Y|N then cells must hold either Y or N with Y meaning true and N meaning false.
-
-
- The resulting cell value will be one or more boolean true or false values.
-
-
-
-
Formats for dates and times
-
- Dates and times are commonly represented in tabular data in formats other than those defined in [[!xmlschema11-2]].
-
-
- If the datatype is a date or time type, the format property indicates the expected format for that date or time.
-
-
- The supported date and time formats listed here are expressed in terms of the date field symbols defined in [[!UAX35]] and MUST be interpreted by implementations as defined in that specification.
-
-
- The following date formats MUST be recognised by implementations:
-
-
-
yyyy-MM-dd e.g., 2015-03-22
-
yyyyMMdd e.g., 20150322
-
dd-MM-yyyy e.g., 22-03-2015
-
d-M-yyyy e.g., 22-3-2015
-
MM-dd-yyyy e.g., 03-22-2015
-
M-d-yyyy e.g., 3-22-2015
-
dd/MM/yyyy e.g., 22/03/2015
-
d/M/yyyy e.g., 22/3/2015
-
MM/dd/yyyy e.g., 03/22/2015
-
M/d/yyyy e.g., 3/22/2015
-
dd.MM.yyyy e.g., 22.03.2015
-
d.M.yyyy e.g., 22.3.2015
-
MM.dd.yyyy e.g., 03.22.2015
-
M.d.yyyy e.g., 3.22.2015
-
-
- The following time formats MUST be recognised by implementations:
-
-
-
HH:mm:ss e.g., 15:02:37
-
HHmmss e.g., 150237
-
HH:mm e.g., 15:02
-
HHmm e.g., 1502
-
-
- The following date/time formats MUST be recognised by implementations:
-
-
-
yyyy-MM-ddTHH:mm:ss e.g., 2015-03-15T15:02:37
-
yyyy-MM-ddTHH:mm e.g., 2015-03-15T15:02
-
any of the date formats above, followed by a single space, followed by any of the time formats above, e.g., M/d/yyyy HH:mm for 3/22/2015 15:02 or dd.MM.yyyy HH:mm:ss for 22.03.2015 15:02:37
-
-
- Implementations MUST also recognise date, time, and date/time formats that end with timezone markers consisting of between one and three xs or Xs, possibly after a single space. These MUST be interpreted as follows:
-
-
-
X e.g., -08, +0530, or Z (minutes are optional)
-
XX e.g., -0800, +0530, or Z
-
XXX e.g., -08:00, +05:30, or Z
-
x e.g., -08 or +0530 (Z is not permitted)
-
xx e.g., -0800 or +0530 (Z is not permitted)
-
xxx e.g., -08:00 or +05:30 (Z is not permitted)
-
-
- For example, formats could include yyyy-MM-ddTHH:mm:ssXXX for 2015-03-15T15:02:37Z or 2015-03-15T15:02:37-05:00, or HH:mm x for 15:02 -05.
-
-
- The cell value will one or more dates/time values extracted using the format.
-
-
- For simplicity, this version of this standard does not support abbreviated or full month or day names, or double digit years. Future versions of this standard may support other date and time formats, or general purpose date/time pattern strings. Authors of schemas SHOULD use appropriate regular expressions, along with the string datatype, for dates and times that use a format other than that specified here.
-
-
-
-
Formats for durations
-
- Durations MUST be formatted and interpreted as defined in [[!xmlschema11-2]], using the [[!ISO8601]] format -?PnYnMnDTnHnMnS. For example, the duration P1Y1D is used for a year and a day; the duration PT2H30M for 2 hours and 30 minutes.
-
-
- If the datatype is a duration type, the format property provides a regular expression for the string values, in the syntax and processed as defined by [[!ECMASCRIPT]].
-
-
- Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.
-
-
- The cell value will be one or more durations extracted using the format.
-
-
-
-
Formats for other types
-
- If the datatype is not numeric, boolean, a date/time type, or a duration type, the format property provides a regular expression for the string values, in the syntax and processed as defined by [[!ECMASCRIPT]].
-
-
- Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.
-
-
- Values that are labelled as html, xml, or json are not validated against those formats.
-
-
- Metadata creators who wish to check the syntax of HTML, XML, or JSON within tabular data should use the format property to specify a regular expression against which such values will be tested.
-
diff --git a/publishing-snapshots/WD-syntax-2015-04/datatypes.png b/publishing-snapshots/WD-syntax-2015-04/datatypes.png
new file mode 100644
index 00000000..a907a8cb
Binary files /dev/null and b/publishing-snapshots/WD-syntax-2015-04/datatypes.png differ
diff --git a/publishing-snapshots/WD-syntax-2015-04/datatypes.svg b/publishing-snapshots/WD-syntax-2015-04/datatypes.svg
new file mode 100644
index 00000000..2e7059ef
--- /dev/null
+++ b/publishing-snapshots/WD-syntax-2015-04/datatypes.svg
@@ -0,0 +1,2730 @@
+
+
+
diff --git a/syntax/datatypes.key b/syntax/datatypes.key
new file mode 100644
index 00000000..f1bbbbe7
Binary files /dev/null and b/syntax/datatypes.key differ
diff --git a/syntax/datatypes.pdf b/syntax/datatypes.pdf
new file mode 100644
index 00000000..e905a00f
Binary files /dev/null and b/syntax/datatypes.pdf differ
diff --git a/syntax/datatypes.png b/syntax/datatypes.png
new file mode 100644
index 00000000..a907a8cb
Binary files /dev/null and b/syntax/datatypes.png differ
diff --git a/syntax/datatypes.svg b/syntax/datatypes.svg
new file mode 100644
index 00000000..2e7059ef
--- /dev/null
+++ b/syntax/datatypes.svg
@@ -0,0 +1,2730 @@
+
+
+
diff --git a/syntax/index.html b/syntax/index.html
index 37202f6a..f534b419 100644
--- a/syntax/index.html
+++ b/syntax/index.html
@@ -111,10 +111,10 @@
- Tabular data is routinely transferred on the web in a variety of formats, including variants on CSV, tab-delimited files, fixed field formats, spreadsheets, HTML tables, and SQL dumps. This document outlines a data model or infoset for tabular data and metadata about that tabular data that can be used as a basis for validation, display, or creating other formats. It also contains some non-normative guidance for publishing tabular data as CSV and how that maps into the tabular data model.
+ Tabular data is routinely transferred on the web in a variety of formats, including variants on CSV, tab-delimited files, fixed field formats, spreadsheets, HTML tables, and SQL dumps. This document outlines a data model, or infoset, for tabular data and metadata about that tabular data that can be used as a basis for validation, display, or creating other formats. It also contains some non-normative guidance for publishing tabular data as CSV and how that maps into the tabular data model.
- An annotated model of tabular data can be supplemented by separate metadata about the table. This specification defines how implementations should locate that metadata, given a file containing tabular data. The syntax for that metadata is defined in [[!tabular-metadata]].
+ An annotated model of tabular data can be supplemented by separate metadata about the table. This specification defines how implementations should locate that metadata, given a file containing tabular data. The standard syntax for that metadata is defined in [[!tabular-metadata]]. Note, however, that applications may have other means to create annotated tables, e.g., through some application specific API-s; this model does not depend on the specificities described in [[!tabular-metadata]].
@@ -165,7 +165,7 @@
Introduction
Tabular Data Models
- This section defines an annotated tabular data model: a model for tables that are annotated with metadata. Annotations provide information about the cells, rows, columns, tables, and groups of tables with which they are associated. The values of these annotations may be lists, structured objects, or atomic values. Core annotations are those that affect the behavior of processors defined in this specification, but other annotations may also be present on any of the components of the model.
+ This section defines an annotated tabular data model: a model for tables that are annotated with metadata. Annotations provide information about the cells, rows, columns, tables, and groups of tables with which they are associated. The values of these annotations may be lists, structured objects, or atomic values. Core annotations are those that affect the behavior of processors defined in this specification, but other annotations may also be present on any of the components of the model.
Annotations may be described directly in [[!tabular-metadata]], be embedded in a tabular data file, or created during the process of generating an annotated table.
@@ -180,8 +180,8 @@
Table groups
id — an identifier for this group of tables, or null if this is undefined.
-
notes — a list of notes on the group of tables, as described in [[!tabular-metadata]], which may be an empty list.
-
resources — the list of tables in the group of tables. A group of tables MUST have one or more tables.
+
notes — any number of additional annotations on the group of tables. This annotation may be empty.
+
tables — the list of tables in the group of tables. A group of tables MUST have one or more tables.
Groups of tables MAY in addition have any number of annotations which provide information about the group of tables. Annotations on a group of tables may include:
@@ -191,9 +191,9 @@
Table groups
information about the source or provenance of the group of tables.
links to other groups of tables (eg to those that provide similar data from a different time period).
-
- These arise from common properties defined on table group descriptions within metadata documents, as defined in [[!tabular-metadata]].
-
+
+
When originating from [[!tabular-metadata]], these annotations arise from common properties defined on table group descriptions within metadata documents.
+
Tables
@@ -205,7 +205,7 @@
Tables
direction — the direction in which the columns in the table should be displayed, as described in .
foreign keys — a list of foreign keys on the table, as defined in [[!tabular-metadata]], which may be an empty list.
id — an identifier for this table, or null if this is undefined.
-
notes — a list of notes on the table, as described in [[!tabular-metadata]], which may be an empty list.
+
notes — any number of additional annotations on the table. This annotation may be empty.
rows — the list of rows in the table. A table MUST have one or more rows and the order of the rows within the list is significant and MUST be preserved by applications.
suppress output — a boolean that indicates whether or not this table should be suppressed in any output generated from the converting the group of tables that this table belongs to into another format, as described in .
transformations — a (possibly empty) list of specifications for converting this table into other formats, as defined in [[!tabular-metadata]].
@@ -219,9 +219,7 @@
Tables
information about the source or provenance of the data in the table, or
links to other tables (eg to indicate tables that include related information).
-
- These arise from common properties defined on table descriptions within metadata documents, as defined in [[!tabular-metadata]].
-
+
When originating from [[!tabular-metadata]], these annotations arise from common properties defined on table group descriptions within metadata documents.
Columns
@@ -229,28 +227,28 @@
Columns
A column represents a vertical arrangement of cells within a table. The core annotations of a column are:
-
about URL — the expected about URLURI template used to create a URL identifier for each value of cell in this column relative to the row in which it is contained, as defined in [[!tabular-metadata]].
+
about URL — the about URLURI template used to create a URL identifier for each value of cell in this column relative to the row in which it is contained, as defined in [[!tabular-metadata]].
cells — the list of cells in the column. A column MUST contain one cell from each row in the table. The order of the cells in the list MUST match the order of the rows in which they appear within the rows for the associated table.
-
datatype — the expected datatype for the values of cells in this column, as defined in [[!tabular-metadata]].
-
default — the default value for cells whose string value is an empty string, as defined in [[!tabular-metadata]].
-
lang — the expected language for the values of cells in this column, as defined in [[!tabular-metadata]].
+
datatype — the expected datatype for the values of cells in this column, as defined in [[!tabular-metadata]].
lang — the code for the expected language for the values of cells in this column, expressed in the format defined by [[!BCP47]]
name — the name of the column.
-
null — the string or strings which cause the value of cells having string value matching any of these values to be null, as defined in [[!tabular-metadata]].
+
null — the string or strings which cause the value of cells having string value matching any of these values to be null.
number — the position of the column amongst the columns for the associated table, starting from 1.
-
property URL — the expected property URLURI template used to create a URL identifier for the property of each value of cell in this column relative to the row in which it is contained, as defined in [[!tabular-metadata]].
+
ordered — a boolean that indicates whether the order of values of a cell should be preserved or not.
+
property URL — the expected property URLURI template used to create a URL identifier for the property of each value of cell in this column relative to the row in which it is contained, as defined in [[!tabular-metadata]].
required — a boolean that indicates that values of cells in this column MUST NOT be empty.
-
separator — a string value used to create multiple values of cells in this column by splitting the string value on the separator, as defined in [[!tabular-metadata]].
+
separator — a string value used to create multiple values of cells in this column by splitting the string value on the separator.
source number — the position of the column in the file at the url of the table, or null.
suppress output — a boolean that indicates whether or not this column should be suppressed in any output generated from converting the table, as described in .
text direction — the indicator of the text directionvalues of cells in this column, as described in and [[!tabular-metadata]].
-
titles — any number of human-readable titles for the column, each of which has an associated language.
-
value URL — the expected value URLURI template used to create the URL identifier for the value of each cell in this, as defined in [[!tabular-metadata]].
+
text direction — the indicator of the text directionvalues of cells in this column, as described in .
+
titles — any number of human-readable titles for the column, each of which has an associated language represented as an object whose properties MUST be language codes as defined by [[!BCP47]] and whose values are arrays of strings related to that language.
+
value URL — the expected value URLURI template used to create the URL identifier for the value of each cell in this, as defined in [[!tabular-metadata]].
virtual — a boolean that indicates whether the column is a virtual column. Virtual columns are used to extend the source data with additional empty columns to support more advanced conversions; when this annotation is false, the column is a real column, which exists in the source data for the table.
- Columns MAY in addition have any number of other annotations, such as a description. These arise from common properties defined on column descriptions within metadata documents, as defined in [[!tabular-metadata]].
-
+ Columns MAY in addition have any number of other annotations, such as a description. When originating from [[!tabular-metadata]], these annotations arise from common properties defined on table group descriptions within metadata documents.
about URL — a URL for the entity that this cell provides information about, or null.
+
about URL — an absolute URL for the entity about which this cell provides information, or null.
column — the column in which the cell appears; the cell MUST be in the cells for that column.
errors — a (possibly empty) list of validation errors generated while parsing the value of the cell.
ordered — a boolean that, if the value of this cell is a list, indicates whether the order of that list should be preserved or not.
-
property URL — a URL for the property that this cell provides, or null.
+
property URL — an absolute URL for the property associated with this cell, or null.
row — the row in which the cell appears; the cell MUST be in the cells for that row.
string value — a string that is the original syntactic representation of the value of the cell, eg how the cell appears within a CSV file; this may be an empty string.
text direction — which direction the text within the cell should be displayed, as described in .
-
value — the semantic value of the cell; this MAY be of a datatype other than a string, MAY be a list, and MAY be null. For example, annotations might enable a processor to understand the string value of the cell as representing a number or a date. By default, if the string value is an empty string, the semantic value of the cell is null. See Parsing Cells in [[!tabular-metadata]] for details about how to compute the cell value.
-
value URL — a URL for this cell's value, or null.
+
value — the semantic value of the cell; this MAY be a list of values, each of which MAY have a datatype other than a string, MAY have a language and MAY be null. For example, annotations might enable a processor to understand the string value of the cell as representing a number or a date. By default, if the string value is an empty string, the semantic value of the cell is null.
+
value URL — an absolute URL for this cell's value, or null.
There presence or absence of quotes around a value within a CSV file is a syntactic detail that is not reflected in the tabular data model. In other words, there is no distinction in the model between the second value in a,,z and the second value in a,"",z.
@@ -308,6 +306,63 @@
Cells
Neither this specification nor [[!tabular-metadata]] defines a method to specify such annotations. Implementations MAY define a method for adding annotations to cells by interpreting notes on the table.
+
+
Datatypes
+
+ Columns and Cells within tables may be annotated with a datatype which indicates the type of the values obtained by parsing the string value of the cell.
+
the datatypes defined in [[!xmlschema11-2]] as derived from and including anyAtomicType
+
the datatype xml, a sub-type of string, which indicates the value is an XML fragment
+
the datatype html, a sub-type of string, which indicates the value is an HTML fragment
+
the datatype json, a sub-type of string, which indicates the value is serialized JSON
+
+
+
+ Diagram showing the built-in datatypes, based on [[!xmlschema11-2]]; names in paranthesis denote aliases to the [[!xmlschema11-2]] terms (see the diagram in SVG or PNG formats)
+
+
base — a string representing the datatype identifier from the set defined above. All values of the datatype MUST be valid values of the base datatype.
+
format — a string or object that defines the format of a value of this type, used when parsing a cell string value as described in .
+
length — a number that the exact length of a cell string value as described in .
+
minimum length — a number that the minimum length of a cell string value as described in .
+
maximum length — a number that the maximum length of a cell string value as described in .
+
minimum — a number that the minimum valid value (inclusive) of a cell string value as described in .
+
maximum — a number that the maximum valid value (inclusive) of a cell string value as described in .
+
minimum exclusive — a number that the minimum valid value (exclusive) of a cell string value as described in .
+
maximum exclusive — a number that the maximum valid value (exclusive) of a cell string value as described in .
+
+
+ Datatypes MAY have any number of additional annotations. The annotations on a datatype provide metadata about the datatype such as title or description. These arise from common properties defined on datatype descriptions within metadata documents, as defined in [[!tabular-metadata]].
+
+ Validation of cell string values against these datatypes is as defined in [[!xmlschema11-2]].
+
+
+
Locating Metadata
@@ -533,6 +588,304 @@
Creating Annotated Tables
In the case of starting with a metadata file, UMM will describe a table or group of tables, and no other metadata files will be retrieved. Thus the metadata file must provide all applicable metadata aside from that embedded within the tabular data files themselves.
+
+
Parsing Cells
+
+ Unlike many other data formats, tabular data is designed to be read by humans. For that reason, it's common for data to be represented within tabular data in a human-readable way. The
+ datatype,
+ default,
+ lang,
+ null,
+ required, and
+ separator annotations provide the information needed to parse the string value of a cell into its (semantic) value. This is used:
+
+
+
by validators to check that the data in the table is in the expected format,
+
by converters to parse the values before mapping them into values in the target of the conversion,
+
when displaying data, to map it into formats that are meaningful for those viewing the data (as opposed to those publishing it), and
+
when inputting data, to turn entered values into representations in a consistent format.
+
+
The process of parsing a cell creates a cell with annotations based on the original string value, parsed value and other column annotations and adds the cell to the list of cells in a row and cells in a column:
+
+
The raw string value becomes the string value annotation on the cell.
a single value with an associated optional datatype or language, or
+
a sequence of such values.
+
+
+ The process of parsing the string value into a single value or a list of values is as follows:
+
+
+
unless the datatype base is string, json, xml, html or anyAtomicType, replace all carriage return (#xD), line feed (#xA), and tab (#x9) characters with space characters.
+
unless the datatype base is string, json, xml, html, anyAtomicType, or normalizedString, strip leading and trailing whitespace from the string value and replace all instances of two or more whitespace characters with a single space character.
+
if the resulting string is an empty string, apply the remaining steps to the string given by the column default annotation.
+
if the column separator annotation is not null and the resulting string is an empty string, the cell value is an empty list. If the column required annotation is true, add an error to the list of errors for the cell.
if the normalized string is an empty string, apply the remaining steps to the string given by the column default annotation.
+
if the normalized string is the same as any one of the values of the column null annotation, then the resulting value is null.
+
split the normalized string at the character specified by the column separator annotation.
+
unless the datatype base is string or anyAtomicType, strip leading and trailing whitespace from these strings.
+
applying the remaining steps to each of the strings in turn.
+
+
+
if the string is an empty string, apply the remaining steps to the string given by the column default annotation.
+
if the string is the same as any one of the values of the column null annotation, then the resulting value is null. If the column separator annotation is null and the column required annotation is true, add an error to the list of errors for the cell.
+
validate the string based on the datatype, using the datatype format annotation if one is specified, as described below, and then against the constraints described in ; if there are any errors, add them to the list of errors for the cell; the resulting value has a datatype annotation of string with the language annotation provided by the column lang annotation.
+
otherwise, if there are no errors, parse the string using the datatype format if one is specified, as described below; the resulting value sets a datatype annotation according to the datatype base and if the datatype base is string, or there is no datatype, it sets the language annotation from the column lang annotation.
+
+
The final value (or values) become the value annotation on the cell.
+
If there is a about URL annotation on the column, it becomes the about URL annotation on the cell, after being transformed into an absolute URL as described in URI Template Properties of [[!tabular-metadata]].
+ When datatype annotation is available, the value of a cell is the same as its string value. For example, a cell with a string value of "99" would similarly have the (semantic) value "99".
+
+
+ If a datatype base is provided for the cell, that is used to create a (semantic) value for the cell. For example, if the metadata contains:
+
+
+ "datatype": "integer"
+
+
+ for the cell with the string value "99" then the value of that cell will be the integer 99. A cell whose string value was not a valid integer (such as "one" or "1.0") would be assigned that string value as its (semantic) value, but also have a validation error listed in its errors annotation.
+
+
+ Sometimes data uses special codes to indicate unknown or null values. For example, a particular column might contain a number that is expected to be between 1 and 10, with the string 99 used in the original tabular data file to indicate a null value. The metadata for such a column would include:
+
+ In this case, a cell with a string value of "5" would have the (semantic) value of the integer 5; a cell with a string value of "99" would have the value null.
+
+
+ Similarly, a cell may be assigned a default value if the string value for the cell is empty. A configuration such as:
+
+ In this case, a cell whose string value is "" would be assigned the value of the integer 5. A cell whose string value contains whitespace, such as a single tab character, would also be assigned the value of the integer 5: when the datatype is something other than string or anyAtomicType, leading and trailing whitespace is stripped from string values before the remainder of the processing is carried out.
+
+
+ Cells can contain sequences of values. For example, a cell might have the string value "1 5 7.0". In this case, the separator is a space character. The appropriate configuration would be:
+
+ and this would mean that the cell's value would be an array containing two integers and a string: [1, 5, "7.0"]. The final value of the array is a string because it is not a valid integer; the cell's errors annotation will also contain a validation error.
+
+
+ Also, with this configuration, if the string value of the cell were "" (ie it was an empty cell) the value of the cell would be an empty list.
+
+
+ A cell value can be inserted into a URL created using a URI template property such as valueUrl. For example, if a cell with the string value"1 5 7.0" were in a column named values, defined with:
+
+ then after expansion of the URI template, the resulting valueUrl would be ?values=1.0,5.0,7.0. The canonical representations of the decimal values are used within the URL.
+
+
+
+
Formats for numeric types
+
+ It is not uncommon for numbers within tabular data to be formatted for human consumption, which may involve using commas for decimal points, grouping digits in the number using commas, or adding currency symbols or percent signs to the number.
+
+
+ If the datatype base is a numeric type, the datatype format annotation indicates the expected format for that number. Its value MUST be either a single string or an object with one or more of the properties:
+
+
+
decimalChar
+
A single character string whose value is used to represent a decimal point within the number. The default value is ".".
+
groupChar
+
A single character string whose value is used to group digits within the number. The default value is ",".
+
pattern
+
A regular expression string, in the syntax and interpreted as defined by [[!ECMASCRIPT]].
+
+
+ Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.
+
+
+ If the datatype format annotation is a single string, this is interpreted in the same way as if it were an object with a pattern property whose value is that string.
+
+
+ When parsing the string value of a cell against this format specification, implementations MUST recognise and parse numbers that consist of:
+
+
+
an optional + or - sign,
+
followed by a decimal digit (0-9),
+
followed by any number of decimal digits (0-9) and the character specified as the groupChar,
+
followed by an optional decimalChar followed by one or more decimal digits (0-9),
+
followed by an optional exponent, consisting of an E followed by an optional + or - sign followed by one or more decimal digits (0-9), or
+
followed by an optional percent (%) or per-mille (‰) sign.
+
+
+ or that are one of the special values:
+
+
+
NaN,
+
INF, or
+
-INF.
+
+
+ Implementations MUST add a validation error to the errors annotation for the cell if the string being parsed:
+
contains an exponent, if the datatype base is decimal or one of its sub-values, or
+
is one of the special values NaN, INF, or -INF, if the datatype base is decimal or one of its sub-values.
+
+
+ Implementations MUST use the sign, exponent, percent, and per-mille signs when parsing the string value of a cell to provide the value of the cell. For example, the string value "-25%" must be interpreted as -0.25 and the string value "1E6" as 1000000.
+
+
+
+
Formats for booleans
+
+ Boolean values may be represented in many ways aside from the standard 1 and 0 or true and false.
+
+
+ If the datatype base for a cell is boolean, the datatype format annotation provides the true and false values expected, separated by |. For example if format is Y|N then cells must hold either Y or N with Y meaning true and N meaning false.
+
+
+ The resulting cell value will be one or more boolean true or false values.
+
+
+
+
Formats for dates and times
+
+ Dates and times are commonly represented in tabular data in formats other than those defined in [[!xmlschema11-2]].
+
+
+ If the datatype base is a date or time type, the datatype format annotation indicates the expected format for that date or time.
+
+
+ The supported date and time formats listed here are expressed in terms of the date field symbols defined in [[!UAX35]] and MUST be interpreted by implementations as defined in that specification.
+
+
+ The following date formats MUST be recognised by implementations:
+
+
+
yyyy-MM-dd e.g., 2015-03-22
+
yyyyMMdd e.g., 20150322
+
dd-MM-yyyy e.g., 22-03-2015
+
d-M-yyyy e.g., 22-3-2015
+
MM-dd-yyyy e.g., 03-22-2015
+
M-d-yyyy e.g., 3-22-2015
+
dd/MM/yyyy e.g., 22/03/2015
+
d/M/yyyy e.g., 22/3/2015
+
MM/dd/yyyy e.g., 03/22/2015
+
M/d/yyyy e.g., 3/22/2015
+
dd.MM.yyyy e.g., 22.03.2015
+
d.M.yyyy e.g., 22.3.2015
+
MM.dd.yyyy e.g., 03.22.2015
+
M.d.yyyy e.g., 3.22.2015
+
+
+ The following time formats MUST be recognised by implementations:
+
+
+
HH:mm:ss e.g., 15:02:37
+
HHmmss e.g., 150237
+
HH:mm e.g., 15:02
+
HHmm e.g., 1502
+
+
+ The following date/time formats MUST be recognised by implementations:
+
+
+
yyyy-MM-ddTHH:mm:ss e.g., 2015-03-15T15:02:37
+
yyyy-MM-ddTHH:mm e.g., 2015-03-15T15:02
+
any of the date formats above, followed by a single space, followed by any of the time formats above, e.g., M/d/yyyy HH:mm for 3/22/2015 15:02 or dd.MM.yyyy HH:mm:ss for 22.03.2015 15:02:37
+
+
+ Implementations MUST also recognise date, time, and date/time formats that end with timezone markers consisting of between one and three xs or Xs, possibly after a single space. These MUST be interpreted as follows:
+
+
+
X e.g., -08, +0530, or Z (minutes are optional)
+
XX e.g., -0800, +0530, or Z
+
XXX e.g., -08:00, +05:30, or Z
+
x e.g., -08 or +0530 (Z is not permitted)
+
xx e.g., -0800 or +0530 (Z is not permitted)
+
xxx e.g., -08:00 or +05:30 (Z is not permitted)
+
+
+ For example, formats could include yyyy-MM-ddTHH:mm:ssXXX for 2015-03-15T15:02:37Z or 2015-03-15T15:02:37-05:00, or HH:mm x for 15:02 -05.
+
+
+ The cell value will one or more dates/time values extracted using the format.
+
+
+ For simplicity, this version of this standard does not support abbreviated or full month or day names, or double digit years. Future versions of this standard may support other date and time formats, or general purpose date/time pattern strings. Authors of schemas SHOULD use appropriate regular expressions, along with the string datatype, for dates and times that use a format other than that specified here.
+
+
+
+
Formats for durations
+
+ Durations MUST be formatted and interpreted as defined in [[!xmlschema11-2]], using the [[!ISO8601]] format -?PnYnMnDTnHnMnS. For example, the duration P1Y1D is used for a year and a day; the duration PT2H30M for 2 hours and 30 minutes.
+
+
+ If the datatype base is a duration type, the datatype format annotation provides a regular expression for the string values, in the syntax and processed as defined by [[!ECMASCRIPT]].
+
+
+ Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.
+
+
+ The cell value will be one or more durations extracted using the format.
+
+
+
+
Formats for other types
+
+ If the datatype base is not numeric, boolean, a date/time type, or a duration type, the datatype format annotation provides a regular expression for the string values, in the syntax and processed as defined by [[!ECMASCRIPT]].
+
+
+ Authors are encouraged to be conservative in the regular expressions that they use, sticking to the basic features of regular expressions that are likely to be supported across implementations.
+
+
+ Values that are labelled as html, xml, or json are not validated against those formats.
+
+
+ Metadata creators who wish to check the syntax of HTML, XML, or JSON within tabular data should use the datatype format annotation to specify a regular expression against which such values will be tested.
+
+
+
Displaying Tables
@@ -670,7 +1023,7 @@
Grammar
-
Parsing Tabular Data
+
Parsing Tabular Data
As described in , there may be many formats which an application might interpret into the tabular data model described in , including using different separators or fixed format tables, multiple tables within a single file, or ones that have metadata lines before a table header.
@@ -908,7 +1261,7 @@
Parsing Tabular Data
This parsing algorithm does not account for the possibility of there being more than one area of tabular data within a single CSV file.
-
+
Bidirectionality in CSV Files
Bidirectional content does not alter the definition of rows or the assignment of cells to columns. Whether or not a CSV file contains right-to-left characters, the first column's content is the first cell of each row, which is the text prior to the first occurrence of a comma within that row.