Bug in the definition of the exists() function

Hello,

The semantics of the `exists()` function, used to implement `EXISTS { }` expressions, seems to contain a bug, which makes implementing the function hard. I would like to first outline what I perceive as a problem, and then I'd like to suggest a solution, which I will present in two steps. Finally, I will outline some general inconsistencies in the definition of the semantics of SPARQL 1.2 that I have noticed as I was writing this.

**The Problem**

As per the current SPARQL 1.2 draft, `EXISTS` expressions are compiled into the `exists()` function, the semantics of which is defined as follows:

    The value exists(P), given D(G) is true if and only if eval(D(G), substitute(P, μ)) is a non-empty sequence.

This makes intuitive sense, but there are several problems.

1. The second argument of `eval()` in this definition is not of correct type. The definition of `eval()` is given in the first sentence of Section 18.6.2 as follows:

    We define eval(D(G), A) as the evaluation of an algebraic query expression A with respect to a dataset D having active graph G.

Thus, the second argument of A is an **algebraic query expression**; however, the result of `substitute()` is a **graph pattern**, which is a sort of a *type error*.

2. One might try to fix the issue by saying something like this:

    The value exists(P), given D(G) is true if and only if eval(D(G), A) is a non-empty sequence, where A := translate(substitute(P, μ)).

In other words, for each μ, we apply the replacement to `P`, translate the result into an algebraic query expression, and evaluate the latter. Such a of `exists()` now accepts correctly typed arguments, but it may be very hard to implement correctly. Translation of graph patterns is usually applied as a preprocessing step, and not as part of query evaluation. Moreover, since the domain of μ can differ between calls to the same instance of `exists()`, the algebraic query expression obtained for each μ can turn out to be quite different, which may prevent advance compilation.

3. The definition of `substitute()` is syntactic and it does not take into account variable scope, which can have non-obvious consequences. Consider the following query.

    SELECT * WHERE { ?X :R ?Y . FILTER(EXISTS { ?X rdf:type :C . MINUS {  ?X :S ?Y } }) }

As per the definition of variable scope in Section 18.3.1, only variable ?X is in scope of the graph pattern { ?X rdf:type :C . MINUS {  ?X :S ?Y } }. In all other parts of SPARQL, this actually means that ?Y inside MINUS and in ?X :R ?Y are different. For example, query

    SELECT * WHERE { ?X :R ?Y . { ?X rdf:type :C . MINUS { ?X :S ?Y } } }

is equivalent to

    SELECT * WHERE { ?X :R ?Y . { ?X rdf:type :C . MINUS { ?X :S ?Y1 } } }

in the sense that both return exactly the same answers. But then, it would be quite natural to expect the same to apply to the query with `FILTER(EXISTS { ... })` -- that is, it is natural (at least in my opinion) to expect the query to be equivalent to the following one

    SELECT * WHERE { ?X :R ?Y . FILTER(EXISTS { ?X rdf:type :C . MINUS { ?X :S ?Y1 } }) }

Alas, this is not the case: if we apply the modified evaluation of `exists()` I highlighted above, substitution μ would always contain a value for `?Y`; moreover, `substitute()` replaces variables without any restriction on variable scope, so `?Y` inside MINUS should be replaced.

All this makes implementing `exists()` hard, an in fact it casts doubt (at least for me) as to what the actual semantics of `EXISTS { ... }` should be.


**A Potential Solution -- simplified version**

A possible solution is to make `exists()` accept an algebraic query expression as an argument. This would allow systems to compile the graph pattern under `EXISTS { ... }` as part of the preprocessing in the same way as all other query parts, which eliminates the need to deal with graph patterns during query evaluation: the query evaluation engine can now deal only with algebraic query expressions. The semantics of `exists()` could be defined as follows.

    The value exists(A), given D(G) is true if and only if eval(D(G), A) contains at least one substitution μ' such that μ and μ' are compatible.

I am referring above to the notion of compatibility from the definition of Join in Section 18.6. This definition is actually quite close to the current one in its intent: my new definition of `exists()` produces the same results as would be the case if we modified the current definition (with the fix I suggested) so that `substitute(P, μ)` replaces in `P` only variable occurrences that are in scope of `P`. This would allow `substitute()` to "see" that variable `?Y` inside `MINUS` in my example is different to the variable `?Y` inside the triple pattern `?X :R ?Y`. Hence, the impact of this change on implementations would be minimal, but the benefit would be clarity and a clear separation of graph pattern compilation and algebraic evaluation.


**A Potential Solution -- full version**

There is, however, a problem with the fix I presented above: examples with inner `FILTER`s from Section 8.3.3 of the current spec no longer work. I simplify one of the examples from this section as follows.

    SELECT * WHERE { ?X :R ?Y FILTER(EXISTS { ?X :S ?Z . FILTER(?Y < ?Z) }) }

As per Section 18.3.1, variable `?Z` is *not* in scope for `{ ?X :S ?Z . FILTER(?Y < ?Z) }`; thus, the simplified fix for the semantics of `exists()` would make the query equivalent to the following query, which is clearly undesirable.

    SELECT * WHERE { ?X :R ?Y FILTER(EXISTS { ?X :S ?Z . FILTER(?Y1 < ?Z) }) }

The good news is that we can just copy the approach used in `OPTIONAL`, which has to deal with a similar issue. The LeftJoin operator accepts two algebraic query expressions A1 and A2 and an expression F, and it returns substitutions that satisfy A1, A2, and F.

By analogy to OPTIONAL, we can define `exists()` to accept an algebraic query expression A and an expression F as arguments, and define its semantics as follows:

    The value exists(A, F), given D(G) is true if and only if eval(D(G), A) contains at least one substitution μ' such that μ and μ' are compatible and F is true when evaluated in merge(μ, μ').

Furthermore, one can translate `EXISTS { ... }` analogously to how `OPTIONAL` is treated in Section 18.3.2.6. Concretely, Section 18.3.2.2 should be modified as follows:

    Let FS := empty set
    For each form FILTER(expr) in the group graph pattern
        For each occurrence of EXISTS{P} in expr
            Let A := Translate(P)
            If A is of the form Filter(F, A1)
                Replace EXISTS{P} in expr with exists(A1, F)
            Else 
                Replace EXISTS{P} in expr with exists(A, true)
            End
        End
        For each occurrence of NOT EXISTS{P} in expr
            ... similar to the above, but add fn:not in front of exists
        End
        FS := FS ∪ {expr}
    End

This produces the intended results on the examples in Section 18.3.1, as well as on most similar queries used in practice.

I understand that the current definition that uses `substitute()` reflects the desire to use a form sideways information passing: one clearly does not want to implement `exists(A, F)` by evaluating A in full and then filtering out substitutions that are compatible with μ and that satisfy F; rather, one wants to evaluate A as a "simple check" given the current values in μ. Note, however, that the above definition does not preclude an efficient implementation based on sideways information passing: its objective is to just clearly specify the desired output. I have personally implemented the proposed semantics of `exists(A, F)` using sideways information passing in a reasonably efficient manner. If desired, I can provide more information on possible implementation avenues.


**Inconsistencies in Section 18**

As I was writing this, I  noticed that Section 18.6.2 says LeftJoin accepts graph patterns P1 and P2, but the definition of LeftJoin in Section 18.6 says that the arguments of LeftJoin are algebraic query expressions.

Also, Section 18.3.2.2 currently says the following.

    In expr, replace NOT EXISTS{P} with fn:not(exists(translate(P)))

Hence, the argument of `exists()` is an algebraic query expression (which is the type of object returned by `translate()`). However, as I already pointed out earlier, the definition of `exists()` in Section 18.6.2 says that the argument of `exists()` is a graph pattern.

There seem to be other inconsistencies of this type in the current version of the specification, so it might be worth double-checking all the details. In particular, in my view, the specification needs to clearly distinguish pre-translation and post-translation objects. Graph patterns are pre-translation, whereas algebraic query expressions are post-translation objects. Section 18.3 should tell us how to get from the former to the latter, and Section 18.6.2 should then talk exclusively about algebraic query expressions.



I would be happy to provide any clarification about any of these issues -- just please let me know.

Regards,

Boris

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug in the definition of the exists() function #251

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug in the definition of the exists() function #251

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions