Skip to content

Latest commit

 

History

History
435 lines (347 loc) · 13.3 KB

compilers.asciidoc

File metadata and controls

435 lines (347 loc) · 13.3 KB

Gremlin Compilers

There are many languages built to query data. SQL is typically used to query relational data. There is SPARQL for RDF data. Cypher is used to do pattern matching in graph data. The list could go on. Compilers convert languages like these to Gremlin so that it becomes possible to use them in any context that Gremlin is used. In other words, a Gremlin Compiler enables a particular query language to work on any TinkerPop-enabled graph system.

SPARQL-Gremlin

gremlintron

The SPARQL-Gremlin compiler, transforms SPARQL queries into Gremlin traversals. It uses the Apache Jena SPARQL processor ARQ, which provides access to a syntax tree of a SPARQL query.

The goal of this work is to bridge the query interoperability gap between the two famous, yet fairly disconnected, graph communities: Semantic Web (which relies on the RDF data model) and Graph database (which relies on property graph data model).

Note
The foundational research work on SPARQL-Gremlin compiler (aka Gremlinator) can be found in the Gremlinator paper. This paper presents the graph query language semantics of SPARQL and Gremlin, and a formal mapping between SPARQL pattern matching graph patterns and Gremlin traversals.
<dependency>
   <groupId>org.apache.tinkerpop</groupId>
   <artifactId>sparql-gremlin</artifactId>
   <version>x.y.z</version>
</dependency>

The SPARQL-Gremlin compiler converts SPARQL queries into Gremlin so that they can be executed across any TinkerPop-enabled graph system. To use this compiler in the Gremlin Console, first install and activate the "tinkerpop.sparql" plugin:

gremlin> :install org.apache.tinkerpop sparql-gremlin x.y.z
==>Loaded: [org.apache.tinkerpop, sparql-gremlin, x.y.z]
gremlin> :plugin use tinkerpop.sparql
==>tinkerpop.sparql activated

Installing this plugin will download appropriate dependencies and import certain classes to the console so that they may be used as follows:

graph = TinkerFactory.createModern()
g = traversal(SparqlTraversalSource).with(graph)                                                       (1)
g.sparql("""SELECT ?name ?age
            WHERE { ?person v:name ?name . ?person v:age ?age }
            ORDER BY ASC(?age)""")                                                                     (2)
  1. Define g as a TraversalSource that uses the SparqlTraversalSource - by default, the traversal() method usually returns a GraphTraversalSource which includes the standard Gremlin starts steps like V() or E(). In this case, the SparqlTraversalSource enables starts steps that are specific to SPARQL only - in this case the sparql() start step.

  2. Execute a SPARQL query against the TinkerGraph instance. The SparqlTraversalSource uses a TraversalStrategy to transparently converts that SPARQL query into a standard Gremlin traversal and then when finally iterated, executes that against the TinkerGraph.

Prefixes

The SPARQL-Gremlin compiler supports the following prefixes to traverse the graph:

Prefix Purpose

v:<id|label|<name>>

access to vertex id, label or property value

e:<label>

out-edge traversal

p:<name>

property traversal

Note that element IDs and labels are treated like normal properties, hence they can be accessed using the same pattern:

g.sparql("""SELECT ?name ?id ?label
	WHERE {
	?element v:name ?name .
	?element v:id ?id .
	?element v:label ?label .}""")

Supported Queries

The SPARQL-Gremlin compiler currently supports translation of the SPARQL 1.0 specification, especially SELECT queries, though there is an on-going effort to cover the entire SPARQL 1.1 query feature spectrum. The supported SPARQL query types are:

  • Union

  • Optional

  • Order-By

  • Group-By

  • STAR-shaped or neighbourhood queries

  • Query modifiers, such as:

    • Filter with restrictions

    • Count

    • LIMIT

    • OFFSET

Limitations

The current implementation of SPARQL-Gremlin compiler (i.e. SPARQL-Gremlin) does not support the following cases:

  • SPARQL queries with variables in the predicate position are not currently covered, with an exception of the following case:

g.sparql("""SELECT * WHERE { ?x ?y ?z . }""")
  • A SPARQL Union query with un-balanced patterns, i.e. a gremlin union traversal can only be generated if the input SPARQL query has the same number of patterns on both the side of the union operator. For instance, the following SPARQL query cannot be mapped, since a union is executed between different number of graph patterns (two patterns union 1 pattern).

g.sparql("""SELECT *
            WHERE {
                {?person e:created ?software .
                ?person v:name "josh" .}
                UNION
                {?software v:lang "java" .} }""")
  • A non-Group key variable cannot be projected in a SPARQL query. This is a SPARQL language limitation rather than that of Gremlin/TinkerPop. Apache Jena throws the exception "Non-group key variable in SELECT" if this occurs. For instance, in a SPARQL query with GROUP-BY clause, only the variable on which the grouping is declared, can be projected. The following query is valid:

g.sparql("""SELECT ?age
            WHERE {
                ?person v:label "person" .
                ?person v:age ?age .
                ?person v:name ?name .} GROUP BY (?age)""")

Whereas, the following SPARQL query will be invalid:

g.sparql("""SELECT ?person
            WHERE {
              ?person v:label "person" .
              ?person v:age ?age .
              ?person v:name ?name .} GROUP BY (?age)""")
  • In a SPARQL query with an ORDER-BY clause, the ordering occurs with respect to the first projected variable in the query. It is possible to choose any number of variable to be projected, however, the first variable in the selection will be the ordering decider. For instance, in the query:

g.sparql("""SELECT ?name ?age
            WHERE {
                ?person v:label "person" .
                ?person v:age ?age .
                ?person v:name ?name . } ORDER BY (?age)""")

the result set will be ordered according to the ?name variable (in ascending order by default) despite having passed ?age in the order by. Whereas, for the following query:

g.sparql("""SELECT ?age ?name
            WHERE {
                ?person v:label "person" .
                ?person v:age ?age .
                ?person v:name ?name . } ORDER BY (?age)""")

the result set will be ordered according to the ?age (as it is the first projected variable). Finally, for the select all case (SELECT *):

g.sparql("""SELECT *
            WHERE { ?person v:label "person" . ?person v:age ?age . ?person v:name ?name . } ORDER BY (?age)""")

the the variable encountered first will be the ordering decider, i.e. since we have ?person encountered first, the result set will be ordered according to the ?person variable (which are vertex id).

  • In the current implementation, OPTIONAL clause doesn’t work under nesting with UNION clause (i.e. multiple optional clauses with in a union clause) and ORDER-By clause (i.e. declaring ordering over triple patterns within optional clauses). Everything else with SPARQL OPTIONAL works just fine.

Examples

The following section presents examples of SPARQL queries that are currently covered by the SPARQL-Gremlin compiler.

Select All

Select all vertices in the graph.

g.sparql("""SELECT * WHERE { }""")

Match Constant Values

Select all vertices with the label person.

g.sparql("""SELECT * WHERE {  ?person v:label "person" .}""")

Select Specific Elements

Select the values of the properties name and age for each person vertex.

g.sparql("""SELECT ?name ?age
WHERE {
  ?person v:label "person" .
  ?person v:name ?name .
  ?person v:age ?age . }""")

Pattern Matching

Select only those persons who created a project.

g.sparql("""SELECT ?name ?age
WHERE {
  ?person v:label "person" .
  ?person v:name ?name .
  ?person v:age ?age .
  ?person e:created ?project . }""")

Filtering

Select only those persons who are older than 30.

g.sparql("""SELECT ?name ?age
WHERE {
  ?person v:label "person" .
  ?person v:name ?name .
  ?person v:age ?age .
    FILTER (?age > 30) }""")

Deduplication

Select the distinct names of the created projects.

g.sparql("""SELECT DISTINCT ?name
WHERE {
  ?person v:label "person" .
  ?person v:age ?age .
  ?person e:created ?project .
  ?project v:name ?name .
    FILTER (?age > 30)}""")

Multiple Filters

Select the distinct names of all Java projects.

g.sparql("""SELECT DISTINCT ?name
WHERE {
  ?person v:label "person" .
  ?person v:age ?age .
  ?person e:created ?project .
  ?project v:name ?name .
  ?project v:lang ?lang .
    FILTER (?age > 30 && ?lang = "java") }""")

Union

Select all persons who have developed a software in java using union.

g.sparql("""SELECT *
WHERE {
  {?person e:created ?software .}
  UNION
  {?software v:lang "java" .} }""")

Optional

Return the names of the persons who have created a software in java and optionally python.

g.sparql("""SELECT ?person
WHERE {
  ?person v:label "person" .
  ?person e:created ?software .
  ?software v:lang "java" .
  OPTIONAL {?software v:lang "python" . }}""")

Order By

Select all vertices with the label person and order them by their age.

g.sparql("""SELECT ?age ?name
WHERE {
  ?person v:label "person" .
  ?person v:age ?age .
  ?person v:name ?name .
} ORDER BY (?age)""")

Group By

Select all vertices with the label person and group them by their age.

g.sparql("""SELECT ?age
WHERE {
  ?person v:label "person" .
  ?person v:age ?age .
} GROUP BY (?age)""")

Mixed/complex/aggregation-based queries

Count the number of projects which have been created by persons under the age of 30 and group them by age. Return only the top two.

g.sparql("""SELECT (COUNT(?project) as ?p)
WHERE {
  ?person v:label "person" .
  ?person v:age ?age . FILTER (?age < 30)
  ?person e:created ?project .
} GROUP BY (?age) LIMIT 2""")

Meta-Property Access

Accessing the Meta-Property of a graph element. Meta-Property can be perceived as the reified statements in an RDF graph.

g = traversal(SparqlTraversalSource).with(graph)
g.sparql("""SELECT ?name ?startTime
WHERE {
  ?person v:name "daniel" .
  ?person p:location ?location .
  ?location v:value ?name .
  ?location v:startTime ?startTime }""")

STAR-shaped queries

STAR-shaped queries are the queries that form/follow a star-shaped execution plan. These in terms of graph traversals can be perceived as path queries or neighborhood queries. For instance, getting all the information about a specific person or software.

g.sparql("""SELECT ?age ?software ?lang ?name
WHERE {
  ?person v:name "josh" .
  ?person v:age ?age .
  ?person e:created ?software .
  ?software v:lang ?lang .
  ?software v:name ?name . }""")

With Gremlin

The sparql()-step takes a SPARQL query and returns a result. That result can be further processed by standard Gremlin steps as shown below:

g = traversal(SparqlTraversalSource).with(graph)
g.sparql("SELECT ?name ?age WHERE { ?person v:name ?name . ?person v:age ?age }")
g.sparql("SELECT ?name ?age WHERE { ?person v:name ?name . ?person v:age ?age }").select("name")
g.sparql("SELECT * WHERE { }").out("knows").values("name")
g.withSack(1.0f).sparql("SELECT * WHERE { }").
  repeat(outE().sack(mult).by("weight").inV()).
    times(2).
  sack()

Mixing SPARQL with Gremlin steps introduces some interesting possibilities for complex traversals.