TARQL Mapping Language

Richard Cyganiak edited this page Aug 7, 2013 · 4 revisions

Note: This specification is ahead of the implementation and describes various features and behaviours that are not currently found in the implementation.

This is work towards a specification for a SPARQL-based data mapping language called TARQL (Transformation SPARQL). TARQL can be used to convert data from RDF, CSV, TSV and JSON (and potentially XML and relational databases, although that is not yet part of the spec) to RDF.

A design goal for this first version of TARQL is implementability on top of Apache ARQ without creating a new parser; therefore it often abuses SPARQL notation to achieve effects that ought to really use their own keyword or language construct.

TARQL Datasets

Standard SPARQL RDF datasets consist a default graph and zero or more named graphs. TARQL datasets additionally have a default table and zero or more named tables.

All the named items are named with a IRI. The same name (IRI) cannot designate both a named table and a named graph.

A table here is a header (ordered set of SPARQL variables) followed by a list of SPARQL bindings (mappings from variables to RDF terms). The variables used in each binding must be a subset of the header variables. This is the same data structure as produced by the VALUES construct of SPARQL 1.1.

Specifying the dataset

The input to TARQL mapping evaluation is a TARQL dataset. The dataset is specified using the FROM and FROM NAMED clauses. A TARQL processor may also provide other means equivalent to these two clauses, for example, command line options that implicitly provide additional FROM and FROM NAMED clauses; in this specification we will just treat these like additional FROM and FROM NAMED clauses and not discuss them further.

The dataset is built from a set of FROM and FROM NAMED clauses by dereferencing the provided IRI, identifying whether the retrieved representation is a table format (CSV and TSV) or a graph format (RDF syntaxes and JSON), and adding the respective default graph or default table (for FROM) or a named graph or named table (FROM NAMED) to the input dataset. Multiple FROM clauses are handled by forming the union of the graphs or the concatenation of the tables.

Multiple queries in a mapping

Syntactically, a TARQL query is a concatenation of one or more SPARQL queries. There are restrictions on the query forms:

  • DESCRIBE queries are not allowed at all.
  • Any number of ASK queries are allowed.
  • Either a SELECT query or one or more CONSTRUCT queries are allowed, but not both SELECT and CONSTRUCT in the same mapping.

Some clauses that can occur in a SPARQL query have an effect on subsequent queries in a mapping:

  • BASE is passed on as the default base to the next query, but can be changed in each query.
  • PREFIX is cumulative, that is, prefixes declared in a previous query are recognized in subsequent queries. Prefixes can be re-defined in each query.
  • FROM NAMED is cumulative, that is, additions to the dataset made in previous queries are recognized in subsequent ones. A FROM NAMED clause with an IRI that occurred previously is an error.
  • FROM is not carried through to subsequent queries. The default graph or default dataset needs to be declared for each query.

Issue: Handling of FROM requires thought. If FROM is specified via command line (like in ARQ's --data), it presumably should apply to all queries in a mapping. But if a FROM is present on every query, the intention is very likely to override/replace the one from the previous query. Ideally we'd specify it once on top of the query and that's it. This also interacts interestingly with the re-use of previous query results via the default graph.

If a query has no graph-producing FROM clause, then the query's default graph contains the union of any previous CONSTRUCT query in the mapping. This allows re-using of previous results in subsequent queries.

Issue: This re-use would be cleaner and more modular if we could store query results in a named table or named graph, and then use that named item in later queries. Maybe we could abuse a magic PREFIX to indicate a target graph or target table?

The mapping result

A mapping containing CONSTRUCT queries evaluates to an RDF graph. A mapping containing a SELECT query evaluates to a SPARQL result set. A mapping containing only ASK queries evaluates to a single boolean value. In a mapping with multiple CONSTRUCT queries, the result is the graph union of the query results. For multiple ASK queries, the result is the conjunction (logical AND) of the queries.

For evaluation to a graph or to a SPARQL result set, if any ASK queries are present, then they are evaluated, and if the result is to any is false, then the overall query execution fails. This can be used to embed assertions on the inputs that must be fulfilled for the mapping to work.

Details of CSV/TSV loading

<table.csv#header=present> and <table.csv#header=absent> in FROM or FROM NAMED indicate whether the header row contains variable names. For FROM NAMED, the resulting table name is <table.csv>, not <table.csv#header=xxx>. By default, the tool will assume that a header is absent.

If the header is absent, then the variables ?A, ?B and so on are used in the resulting table.

If the header is present, then the first row contains the variable names to be used for the resulting table, and it will not produce a binding (row in the table). To translate strings to variable names, replace spaces with underscores, and remove any characters not allowed in variable names. Duplicate variable names are disambiguated by appending the lowest positive integer number that results in a unique name, processing variable names left to right. For example:

(?A ?A1 ?A) => (?A2 ?A1 ?A3)
(?A1 ?A1) => (?A11 ?A12)

A rough approximation of the characters allowed in variable names is: a-zA-Z0-9_ and any Unicode codepoint from U+00C0 upwards. But see the full grammar for details.

Details of JSON loading

JSON documents specified in FROM or FROM NAMED are parsed as RDF graphs using the algorithm specified here.

The algorithm requires a vocabulary IRI as input. This IRI is taken from the prefix mapping of the json: prefix. If that prefix is not defined, the IRI http://tarql.org/json/ will be used. Note that prefixes can are defined (and can be re-defined between queries) using the PREFIX clause.

Strings, numbers, true and false are translated to RDF literals in the obvious way. Nulls are ignored.

Objects are translated by generating a triple for each key-value pair. The subject is a fresh blank node. The predicate IRI is generated from the value by percent-encoding any character not in the iunreserved production of RFC 3987 and appending the result to the vocabulary IRI. The object is obtained by recursively translating the value. The subject blank node is returned as the result of the translation.

Arrays are translated to an rdf:List whose members are the translated values. The first blank node in the list is returned as the translation result.

A triple of this form completes the document:

<documentIRI> rdf:value _:translatedRootJSONObjectOrArray.

Note: The rdf:value triple is intended to make querying for the root object easy. The choice of the rdf:value property, rather than something like json:root, avoids clashes with keys that might exist in the document. By choosing a sort-of-fitting term from the rdf: namespace, we make it more likely that users don't have to declare an extra namespace.

Example JSON (from the JSON website):

{"menu": {
  "id": "file",
  "value": "File",
  "popup": {
    "menuitem": [
      {"value": "New", "onclick": "CreateNewDoc()"},
      {"value": "Open", "onclick": "OpenDoc()"},
      {"value": "Close", "onclick": "CloseDoc()"}

The same in Turtle, parsed with the default vocabulary IRI:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix json: <http://tarql.org/json/>.
<> rdf:value [
  json:menu [
    json:id "file";
    json:value "File";
    json:popup [
      json:menuitem (
        [json:value "New"; json:onclick "CreateNewDoc()";]
        [json:value "Open"; json:onclick "OpenDoc()";]
        [json:value "Close"; json:onclick "CloseDoc()";]

Note that lists can be queried like this:

?parentObject json:key/rdf:rest*/rdf:first ?listMember

Issue: Should we introduce a shortcut for the above? json:key/tarql:list? jsonl:key? json:key/rdf:List? json:key/rdfs:member?

Query evaluation

An empty group ({}) indicates that the default table is to be inserted there as VALUES. This includes empty WHERE, UNION and OPTIONAL clauses. Read {} as DEFAULT TABLE.

An empty named graph (GRAPH <x> {}) indicates that the named table is to be inserted there. Read GRAPH <x> {} as NAMED TABLE <x>.

The rdf:rest*/rdf:first property path expression MUST produce elements in order of path length. This is to make JSON lists work as expected.

Issue: The SPARQL spec has this expression as an example for property paths, but notes that this expression doesn't guarantee the order of results. Why is this so?

Pseudo variables

The ?ROWNUM pseudo variable holds the number of the current row. Counting starts at 1 and skips empty rows. Note that rows skipped via FILTER are still counted.

Open issues

  • Saving results into named tables and named graphs instead of only allowing a single (“default”) result?
  • XML as input?
  • Relational databases as input?
  • Function library?
  • How to create blank nodes that need to be used in multiple bindings? Especially if across queries? BLANK creates distinct blank nodes in every binding.
  • Supporting tree-shaped outputs (XML and JSON) would be great, but becomes very complex if done with a single CONSTRUCT, and how to deal with shared blank nodes and ordering?
  • Use case: A directory contains 100s of XML files in a subdirectory structure. Can we process them all in a single mapping?
  • Use case: We have a list of CSV URLs in a file (RDF or CSV). Can we load and process all files in a single mapping?

Towards a SPARQL extension for data transformations

The language specified here “abuses” SPARQL notations to make SPARQL work for tasks it was not originally designed for. A better approach would be to extend SPARQL with new keywords and constructs. We have not chosen this approach as it would involve messing with the SPARQL grammar and creating a new parser, and our main goal of experimenting with SPARQL syntax for transformation tasks can be achieved within the current SPARQL grammar.

Nevertheless, it's easy to imagine a SPARQL extension that adds elegant native language constructs for the features of TARQL. Useful additional SPARQL features might include:

  • Parameters to FROM and FROM NAMED to configure the parser, e.g., with a JSON vocabulary IRI, a flag that indicates presence of CSV headers, or even a keyword or IRI to select the parser to choose.
  • Semantics for multiple queries in a single file. This is obviously useful in the case of multiple CONSTRUCT queries, but defining useful semantics for combinations of any kind of query is imaginable.
  • Inserting streams of bindings into a query at arbitrary points, in the same way as VALUES does in SPARQL 1.1 but where the values might come from an external file or service.
  • Extending RDF datasets to also cover tables, and inserting these tables into the query as streams of bindings (as we do with the {} and GRAPH <x> {} notations).

Below are some syntax doodles—not to be taken too seriously:

FROM JSON <x.json> AS :x; VOCABULARY=<http://example.com/myjsonvocab#>

FROM <x.json> AS :x OPTIONS {
  tarql:reader tarql:JSON;
  tarql:vocabulary <http://example.com/myjsonvocab#>;

FROM JSON <x.json> { VOCAB <http://example.com/myjsonvocab#> } INTO :x

FROM CSV <x.csv> { HEADERS (?foo ?bar ?baz) ESCAPE "\\" } INTO :table1

FROM <jdbc:mysql:///iswc> D2RQ USER "root" MAPPING <mappings/d2rq-iswc.ttl> INTO :iswc