There are many languages built to query data. SQL is typically used to query relational data. There is SPARQL for RDF data. Cypher is used to do pattern matching in graph data. The list could go on. Compilers convert languages like these to Gremlin so that it becomes possible to use them in any context that Gremlin is used. In other words, a Gremlin Compiler enables a particular query language to work on any TinkerPop-enabled graph system.
The SPARQL-Gremlin compiler, transforms SPARQL queries into Gremlin traversals. It uses the Apache Jena SPARQL processor ARQ, which provides access to a syntax tree of a SPARQL query.
The goal of this work is to bridge the query interoperability gap between the two famous, yet fairly disconnected, graph communities: Semantic Web (which relies on the RDF data model) and Graph database (which relies on property graph data model).
Note
|
The foundational research work on SPARQL-Gremlin compiler (aka Gremlinator) can be found in the Gremlinator paper. This paper presents the graph query language semantics of SPARQL and Gremlin, and a formal mapping between SPARQL pattern matching graph patterns and Gremlin traversals. |
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>sparql-gremlin</artifactId>
<version>x.y.z</version>
</dependency>
The SPARQL-Gremlin compiler converts SPARQL queries into Gremlin so that they can be executed across any TinkerPop-enabled graph system. To use this compiler in the Gremlin Console, first install and activate the "tinkerpop.sparql" plugin:
gremlin> :install org.apache.tinkerpop sparql-gremlin x.y.z
==>Loaded: [org.apache.tinkerpop, sparql-gremlin, x.y.z]
gremlin> :plugin use tinkerpop.sparql
==>tinkerpop.sparql activated
Installing this plugin will download appropriate dependencies and import certain classes to the console so that they may be used as follows:
graph = TinkerFactory.createModern() g = traversal(SparqlTraversalSource).with(graph) (1) g.sparql("""SELECT ?name ?age WHERE { ?person v:name ?name . ?person v:age ?age } ORDER BY ASC(?age)""") (2)
-
Define
g
as aTraversalSource
that uses theSparqlTraversalSource
- by default, thetraversal()
method usually returns aGraphTraversalSource
which includes the standard Gremlin starts steps likeV()
orE()
. In this case, theSparqlTraversalSource
enables starts steps that are specific to SPARQL only - in this case thesparql()
start step. -
Execute a SPARQL query against the TinkerGraph instance. The
SparqlTraversalSource
uses a TraversalStrategy to transparently converts that SPARQL query into a standard Gremlin traversal and then when finally iterated, executes that against the TinkerGraph.
The SPARQL-Gremlin compiler supports the following prefixes to traverse the graph:
Prefix | Purpose |
---|---|
|
access to vertex id, label or property value |
|
out-edge traversal |
|
property traversal |
Note that element IDs and labels are treated like normal properties, hence they can be accessed using the same pattern:
g.sparql("""SELECT ?name ?id ?label WHERE { ?element v:name ?name . ?element v:id ?id . ?element v:label ?label .}""")
The SPARQL-Gremlin compiler currently supports translation of the SPARQL 1.0 specification, especially SELECT
queries, though there is an on-going effort to cover the entire SPARQL 1.1 query feature spectrum. The supported
SPARQL query types are:
-
Union
-
Optional
-
Order-By
-
Group-By
-
STAR-shaped or neighbourhood queries
-
Query modifiers, such as:
-
Filter with restrictions
-
Count
-
LIMIT
-
OFFSET
-
The current implementation of SPARQL-Gremlin compiler (i.e. SPARQL-Gremlin) does not support the following cases:
-
SPARQL queries with variables in the predicate position are not currently covered, with an exception of the following case:
g.sparql("""SELECT * WHERE { ?x ?y ?z . }""")
-
A SPARQL Union query with un-balanced patterns, i.e. a gremlin union traversal can only be generated if the input SPARQL query has the same number of patterns on both the side of the union operator. For instance, the following SPARQL query cannot be mapped, since a union is executed between different number of graph patterns (two patterns
union
1 pattern).
g.sparql("""SELECT *
WHERE {
{?person e:created ?software .
?person v:name "josh" .}
UNION
{?software v:lang "java" .} }""")
-
A non-Group key variable cannot be projected in a SPARQL query. This is a SPARQL language limitation rather than that of Gremlin/TinkerPop. Apache Jena throws the exception "Non-group key variable in SELECT" if this occurs. For instance, in a SPARQL query with GROUP-BY clause, only the variable on which the grouping is declared, can be projected. The following query is valid:
g.sparql("""SELECT ?age
WHERE {
?person v:label "person" .
?person v:age ?age .
?person v:name ?name .} GROUP BY (?age)""")
Whereas, the following SPARQL query will be invalid:
g.sparql("""SELECT ?person
WHERE {
?person v:label "person" .
?person v:age ?age .
?person v:name ?name .} GROUP BY (?age)""")
-
In a SPARQL query with an ORDER-BY clause, the ordering occurs with respect to the first projected variable in the query. It is possible to choose any number of variable to be projected, however, the first variable in the selection will be the ordering decider. For instance, in the query:
g.sparql("""SELECT ?name ?age
WHERE {
?person v:label "person" .
?person v:age ?age .
?person v:name ?name . } ORDER BY (?age)""")
the result set will be ordered according to the ?name
variable (in ascending order by default) despite having passed
?age
in the order by. Whereas, for the following query:
g.sparql("""SELECT ?age ?name
WHERE {
?person v:label "person" .
?person v:age ?age .
?person v:name ?name . } ORDER BY (?age)""")
the result set will be ordered according to the ?age
(as it is the first projected variable). Finally, for the
select all case (SELECT *
):
g.sparql("""SELECT *
WHERE { ?person v:label "person" . ?person v:age ?age . ?person v:name ?name . } ORDER BY (?age)""")
the the variable encountered first will be the ordering decider, i.e. since we have ?person
encountered first,
the result set will be ordered according to the ?person
variable (which are vertex id).
-
In the current implementation,
OPTIONAL
clause doesn’t work under nesting withUNION
clause (i.e. multiple optional clauses with in a union clause) andORDER-By
clause (i.e. declaring ordering over triple patterns within optional clauses). Everything else with SPARQLOPTIONAL
works just fine.
The following section presents examples of SPARQL queries that are currently covered by the SPARQL-Gremlin compiler.
Select all vertices with the label person
.
g.sparql("""SELECT * WHERE { ?person v:label "person" .}""")
Select the values of the properties name
and age
for each person
vertex.
g.sparql("""SELECT ?name ?age WHERE { ?person v:label "person" . ?person v:name ?name . ?person v:age ?age . }""")
Select only those persons who created a project.
g.sparql("""SELECT ?name ?age WHERE { ?person v:label "person" . ?person v:name ?name . ?person v:age ?age . ?person e:created ?project . }""")
Select only those persons who are older than 30.
g.sparql("""SELECT ?name ?age WHERE { ?person v:label "person" . ?person v:name ?name . ?person v:age ?age . FILTER (?age > 30) }""")
Select the distinct names of the created projects.
g.sparql("""SELECT DISTINCT ?name WHERE { ?person v:label "person" . ?person v:age ?age . ?person e:created ?project . ?project v:name ?name . FILTER (?age > 30)}""")
Select the distinct names of all Java projects.
g.sparql("""SELECT DISTINCT ?name WHERE { ?person v:label "person" . ?person v:age ?age . ?person e:created ?project . ?project v:name ?name . ?project v:lang ?lang . FILTER (?age > 30 && ?lang = "java") }""")
Select all persons who have developed a software in java using union.
g.sparql("""SELECT * WHERE { {?person e:created ?software .} UNION {?software v:lang "java" .} }""")
Return the names of the persons who have created a software in java and optionally python.
g.sparql("""SELECT ?person
WHERE {
?person v:label "person" .
?person e:created ?software .
?software v:lang "java" .
OPTIONAL {?software v:lang "python" . }}""")
Select all vertices with the label person
and order them by their age.
g.sparql("""SELECT ?age ?name WHERE { ?person v:label "person" . ?person v:age ?age . ?person v:name ?name . } ORDER BY (?age)""")
Select all vertices with the label person
and group them by their age.
g.sparql("""SELECT ?age WHERE { ?person v:label "person" . ?person v:age ?age . } GROUP BY (?age)""")
Count the number of projects which have been created by persons under the age of 30 and group them by age. Return only the top two.
g.sparql("""SELECT (COUNT(?project) as ?p)
WHERE {
?person v:label "person" .
?person v:age ?age . FILTER (?age < 30)
?person e:created ?project .
} GROUP BY (?age) LIMIT 2""")
Accessing the Meta-Property of a graph element. Meta-Property can be perceived as the reified statements in an RDF graph.
g = traversal(SparqlTraversalSource).with(graph) g.sparql("""SELECT ?name ?startTime WHERE { ?person v:name "daniel" . ?person p:location ?location . ?location v:value ?name . ?location v:startTime ?startTime }""")
STAR-shaped queries are the queries that form/follow a star-shaped execution plan. These in terms of graph traversals
can be perceived as path queries or neighborhood queries. For instance, getting all the information about a specific
person
or software
.
g.sparql("""SELECT ?age ?software ?lang ?name WHERE { ?person v:name "josh" . ?person v:age ?age . ?person e:created ?software . ?software v:lang ?lang . ?software v:name ?name . }""")
The sparql()
-step takes a SPARQL query and returns a result. That result can be further processed by standard Gremlin
steps as shown below:
g = traversal(SparqlTraversalSource).with(graph) g.sparql("SELECT ?name ?age WHERE { ?person v:name ?name . ?person v:age ?age }") g.sparql("SELECT ?name ?age WHERE { ?person v:name ?name . ?person v:age ?age }").select("name") g.sparql("SELECT * WHERE { }").out("knows").values("name") g.withSack(1.0f).sparql("SELECT * WHERE { }"). repeat(outE().sack(mult).by("weight").inV()). times(2). sack()
Mixing SPARQL with Gremlin steps introduces some interesting possibilities for complex traversals.