Skip to content

Bug in the definition of the exists() function #251

@borismotik

Description

@borismotik

Hello,

The semantics of the exists() function, used to implement EXISTS { } expressions, seems to contain a bug, which makes implementing the function hard. I would like to first outline what I perceive as a problem, and then I'd like to suggest a solution, which I will present in two steps. Finally, I will outline some general inconsistencies in the definition of the semantics of SPARQL 1.2 that I have noticed as I was writing this.

The Problem

As per the current SPARQL 1.2 draft, EXISTS expressions are compiled into the exists() function, the semantics of which is defined as follows:

The value exists(P), given D(G) is true if and only if eval(D(G), substitute(P, μ)) is a non-empty sequence.

This makes intuitive sense, but there are several problems.

  1. The second argument of eval() in this definition is not of correct type. The definition of eval() is given in the first sentence of Section 18.6.2 as follows:

    We define eval(D(G), A) as the evaluation of an algebraic query expression A with respect to a dataset D having active graph G.

Thus, the second argument of A is an algebraic query expression; however, the result of substitute() is a graph pattern, which is a sort of a type error.

  1. One might try to fix the issue by saying something like this:

    The value exists(P), given D(G) is true if and only if eval(D(G), A) is a non-empty sequence, where A := translate(substitute(P, μ)).

In other words, for each μ, we apply the replacement to P, translate the result into an algebraic query expression, and evaluate the latter. Such a of exists() now accepts correctly typed arguments, but it may be very hard to implement correctly. Translation of graph patterns is usually applied as a preprocessing step, and not as part of query evaluation. Moreover, since the domain of μ can differ between calls to the same instance of exists(), the algebraic query expression obtained for each μ can turn out to be quite different, which may prevent advance compilation.

  1. The definition of substitute() is syntactic and it does not take into account variable scope, which can have non-obvious consequences. Consider the following query.

    SELECT * WHERE { ?X :R ?Y . FILTER(EXISTS { ?X rdf:type :C . MINUS { ?X :S ?Y } }) }

As per the definition of variable scope in Section 18.3.1, only variable ?X is in scope of the graph pattern { ?X rdf:type :C . MINUS { ?X :S ?Y } }. In all other parts of SPARQL, this actually means that ?Y inside MINUS and in ?X :R ?Y are different. For example, query

SELECT * WHERE { ?X :R ?Y . { ?X rdf:type :C . MINUS { ?X :S ?Y } } }

is equivalent to

SELECT * WHERE { ?X :R ?Y . { ?X rdf:type :C . MINUS { ?X :S ?Y1 } } }

in the sense that both return exactly the same answers. But then, it would be quite natural to expect the same to apply to the query with FILTER(EXISTS { ... }) -- that is, it is natural (at least in my opinion) to expect the query to be equivalent to the following one

SELECT * WHERE { ?X :R ?Y . FILTER(EXISTS { ?X rdf:type :C . MINUS { ?X :S ?Y1 } }) }

Alas, this is not the case: if we apply the modified evaluation of exists() I highlighted above, substitution μ would always contain a value for ?Y; moreover, substitute() replaces variables without any restriction on variable scope, so ?Y inside MINUS should be replaced.

All this makes implementing exists() hard, an in fact it casts doubt (at least for me) as to what the actual semantics of EXISTS { ... } should be.

A Potential Solution -- simplified version

A possible solution is to make exists() accept an algebraic query expression as an argument. This would allow systems to compile the graph pattern under EXISTS { ... } as part of the preprocessing in the same way as all other query parts, which eliminates the need to deal with graph patterns during query evaluation: the query evaluation engine can now deal only with algebraic query expressions. The semantics of exists() could be defined as follows.

The value exists(A), given D(G) is true if and only if eval(D(G), A) contains at least one substitution μ' such that μ and μ' are compatible.

I am referring above to the notion of compatibility from the definition of Join in Section 18.6. This definition is actually quite close to the current one in its intent: my new definition of exists() produces the same results as would be the case if we modified the current definition (with the fix I suggested) so that substitute(P, μ) replaces in P only variable occurrences that are in scope of P. This would allow substitute() to "see" that variable ?Y inside MINUS in my example is different to the variable ?Y inside the triple pattern ?X :R ?Y. Hence, the impact of this change on implementations would be minimal, but the benefit would be clarity and a clear separation of graph pattern compilation and algebraic evaluation.

A Potential Solution -- full version

There is, however, a problem with the fix I presented above: examples with inner FILTERs from Section 8.3.3 of the current spec no longer work. I simplify one of the examples from this section as follows.

SELECT * WHERE { ?X :R ?Y FILTER(EXISTS { ?X :S ?Z . FILTER(?Y < ?Z) }) }

As per Section 18.3.1, variable ?Z is not in scope for { ?X :S ?Z . FILTER(?Y < ?Z) }; thus, the simplified fix for the semantics of exists() would make the query equivalent to the following query, which is clearly undesirable.

SELECT * WHERE { ?X :R ?Y FILTER(EXISTS { ?X :S ?Z . FILTER(?Y1 < ?Z) }) }

The good news is that we can just copy the approach used in OPTIONAL, which has to deal with a similar issue. The LeftJoin operator accepts two algebraic query expressions A1 and A2 and an expression F, and it returns substitutions that satisfy A1, A2, and F.

By analogy to OPTIONAL, we can define exists() to accept an algebraic query expression A and an expression F as arguments, and define its semantics as follows:

The value exists(A, F), given D(G) is true if and only if eval(D(G), A) contains at least one substitution μ' such that μ and μ' are compatible and F is true when evaluated in merge(μ, μ').

Furthermore, one can translate EXISTS { ... } analogously to how OPTIONAL is treated in Section 18.3.2.6. Concretely, Section 18.3.2.2 should be modified as follows:

Let FS := empty set
For each form FILTER(expr) in the group graph pattern
    For each occurrence of EXISTS{P} in expr
        Let A := Translate(P)
        If A is of the form Filter(F, A1)
            Replace EXISTS{P} in expr with exists(A1, F)
        Else 
            Replace EXISTS{P} in expr with exists(A, true)
        End
    End
    For each occurrence of NOT EXISTS{P} in expr
        ... similar to the above, but add fn:not in front of exists
    End
    FS := FS ∪ {expr}
End

This produces the intended results on the examples in Section 18.3.1, as well as on most similar queries used in practice.

I understand that the current definition that uses substitute() reflects the desire to use a form sideways information passing: one clearly does not want to implement exists(A, F) by evaluating A in full and then filtering out substitutions that are compatible with μ and that satisfy F; rather, one wants to evaluate A as a "simple check" given the current values in μ. Note, however, that the above definition does not preclude an efficient implementation based on sideways information passing: its objective is to just clearly specify the desired output. I have personally implemented the proposed semantics of exists(A, F) using sideways information passing in a reasonably efficient manner. If desired, I can provide more information on possible implementation avenues.

Inconsistencies in Section 18

As I was writing this, I noticed that Section 18.6.2 says LeftJoin accepts graph patterns P1 and P2, but the definition of LeftJoin in Section 18.6 says that the arguments of LeftJoin are algebraic query expressions.

Also, Section 18.3.2.2 currently says the following.

In expr, replace NOT EXISTS{P} with fn:not(exists(translate(P)))

Hence, the argument of exists() is an algebraic query expression (which is the type of object returned by translate()). However, as I already pointed out earlier, the definition of exists() in Section 18.6.2 says that the argument of exists() is a graph pattern.

There seem to be other inconsistencies of this type in the current version of the specification, so it might be worth double-checking all the details. In particular, in my view, the specification needs to clearly distinguish pre-translation and post-translation objects. Graph patterns are pre-translation, whereas algebraic query expressions are post-translation objects. Section 18.3 should tell us how to get from the former to the latter, and Section 18.6.2 should then talk exclusively about algebraic query expressions.

I would be happy to provide any clarification about any of these issues -- just please let me know.

Regards,

Boris

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions