Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

easier langtag and datatype agnostic matching of literals #34

Open
joernhees opened this issue Apr 3, 2019 · 19 comments
Open

easier langtag and datatype agnostic matching of literals #34

joernhees opened this issue Apr 3, 2019 · 19 comments
Labels
query Extends the Query spec

Comments

@joernhees
Copy link

Why?

For exploration or candidate generation i'm often doing things like the following in order to get results independent of language tag or datatype of the literal:

?s ?p ?l .
FILTER(STR(?l) = "XYZ") .

The problem with this is that many SPARQL endpoints don't seem to optimize for such queries (i.e., VALUES ?l {"XYZ"@en "XYZ"@de "XYZ"@fr ...10more... "XYZ" } ?s ?p ?l . tends to be a lot faster than the above FILTER clause!). While this could be seen as a common query plan optimization problem of SPARQL engines (failing to identify the perfect string lookup and using their existing indices to quickly answer the query), i'd like to honor this "frequent special case" with its own little syntax add-on, maybe also making it a lot easier to identify and optimize for.

For a syntax extension some things come to mind, such as "XYZ"@*, "XYZ"^^* and/or "XYZ"*, but not decided at all.

Previous work

Could be related to #13 and #17, but they seem to focus on other aspects.

Considerations for backward compatibility

How do current endpoints deal with "XYZ"@* ?

@cygri
Copy link

cygri commented Apr 3, 2019

If there was standard full text search that allowed matching any language, would that satisfy the requirement or would you still be asking for this addition?

#17 is about using the "XYZ"@en form with a wildcard instead of XYZ. The proposal here is about using the form with a wildcard instead of @en. So the two issues relate and should be considered together.

I hope "..."@* wouldn't match IRIs?

@VladimirAlexiev
Copy link
Contributor

I think that FTS would fill the bill. The query would be written like eg

?x luc:label """  "foo bar" """"

However, not all repos have FTS in all situations, and you need to configure an index, and standardizing FTS is a lot harder.

@cygri
Copy link

cygri commented Apr 3, 2019

But FTS would be a better solution to the stated problem. If the query author just wants a quick match and is not interested in the language tag, they are probably also not interested in details like capitalisation. The proposal here will only match if capitalisation in the query matches capitalisation in the data. That makes me think the proposal is actually not a very good solution to the stated problem. If FTS is available, that’s a better solution. If it is not avaiable, maybe the SPARQL 1.1 way with FILTER(str(?x)="...") is good enough? Do we really want to add a semi-solution?

@joernhees
Copy link
Author

hmm, full text search would maybe be a solution for some, but in my cases i actually don't want to do any kind of fuzzy lookup, just a simple perfect string match. It's very common (at least from my experience) that you have some kind of identifier (i'll mention names and uuids here as example) from some system and now want to look it up in your RDF store's literals. Especially when doing this on many loaded datasets you don't know how people modeled these identifying strings in RDF literals... maybe they used a language tag or not, maybe they used a specific datatype or not. I simply want to be able to say "i don't care about those RDF specifics".

You're right, my example above should be extended to the following to not match IRIs:

?s ?p ?l .
FILTER(isLiteral(?l) && STR(?l) = "XYZ") .

The further i extend this, the more obvious my argument becomes however: it's ever harder to correctly identify this and optimize the simple lookup we're talking about here, that nearly every backend will actually have an index for and could do in O(log(n)) time rather than a full table scan that this ends up with very frequently.

Talking about "XYZ"@*, "XYZ"^^* and/or "XYZ"* i'm however not sure what developers would intuitively expect them to match: Would you expect "XYZ"@* to match plain literals? Would you expect "XYZ"@* to match "XYZ"^^<myDT>? What if i tell you that <myDT> is a subclass of rdf:langString? It becomes even more obscure when talking about "XYZ"^^*... would you expect this to match plain literals? Would you expect it to match "XYZ"@en (actually one could argue that this string has the datatype rdf:langString)?

@cygri
Copy link

cygri commented Apr 4, 2019

Would you expect "XYZ"@* to match plain literals? Would you expect "XYZ"@* to match "XYZ"^^<myDT>?

The way to teach this is to say: “lang("XYZ") and lang("XYZ"^^<myDT>) return an empty string, and * matches an empty string, so "XYZ"@* matches both.” It's a bit of a fudge but not too bad.

What if i tell you that <myDT> is a subclass of rdf:langString?

It's not possible to subclass rdf:langString.

It becomes even more obscure when talking about "XYZ"^^*

I am not advocating for that syntax. I don't think it's a good idea.

Another option would be ~"XYZ" to match any literal with lexical form XYZ.

@joernhees
Copy link
Author

It's not possible to subclass rdf:langString.

hmm, couldn't easily find why not, any pointer?

Apart from that i'd also favor "XYZ"@* over the other forms.

@dbooth-boston
Copy link
Collaborator

Small philosophical comment: IMHO the root of the difficulty here is that literals can have two attributes -- type and language -- but they are disjoint and not modeled as explicit triples. That was a dirty design choice that is biting us now. I hope that any new solution at least approximates a cleaner solution.

@cygri
Copy link

cygri commented Apr 10, 2019

@joernhees What I meant is that literals of <myDT> cannot have language tags (cf. RDF Concepts) and lang("XYZ"^^<myDT>)="" (cf. definition of lang). So I don't see how asserting that <myDT> is a subclass of rdf:langString could ever be a useful thing to do.

@joernhees
Copy link
Author

oh, i thought along the "what i cannot make the statement <myDT> rdfs:subClassOf rdf:langString . ?" and not the RDF concepts route ;) ("foo"@en^^<myDT> would've definitely been even more fun ^^)

@arthurpsmith
Copy link

I have run into this so many times, this (language-independent string matching) would be my highest priority for 1.2. That said, there are a ton of similar issues in SPARQL execution caused by FILTER having second-class status (that is, only being applied AFTER a list of matching triples is created, rather than being applied proactively to generate a collection of triples). Case-insensitive matching is another just mentioned here, also substring matching (STRSTARTS, STRENDS, CONTAINS) etc. Hoping to match regexes might be a step too far, but if we could move these basic string FILTER actions into the main triple-matching syntax that could be a huge win (similar issues for date and numeric comparison operations too I think) - and I don't think requiring "full text search" is the only way to do that.

@afs
Copy link
Collaborator

afs commented Apr 16, 2019

@arthurpsmith -- good to hear about the issues you've encountered.

caused by FILTER having second-class status (that is, only being applied AFTER a list of matching triples is created, rather than being applied proactively to generate a collection of triples)

This would seem to be more about a lack of indexing/optimization. The SPARQL spec can define what the correct results are, but how that is achieved is up to the implementation. Any implementation has to balance a number of factors such as time to load vs amount of indexing done.

Case insensitive versions of STRSTARTS, STRENDS, CONTAINS and string equality sound like a good idea.

@JervenBolleman JervenBolleman added the query Extends the Query spec label Apr 16, 2019
@arthurpsmith
Copy link

@afs - yes, it could be done now with smart optimization and use of indexes - are you aware of any existing sparql servers that do this? I'm mostly familiar with blazegraph and haven't seen any sign of this being possible. I think making some of the simpler string matching directly part of the 1.2 spec it would be a more obvious target for index-based optimization, rather than having to parse FILTER syntax.

@arthurpsmith
Copy link

arthurpsmith commented Apr 16, 2019

Not sure if people have seen this over in the EasierRDF ideas list:
w3c/EasierRDF#22

@VladimirAlexiev
Copy link
Contributor

It's not about FILTER parsing, it's the fact that repos often need to scan all literals.
You could use some prefix index (trie) for strstarts(), but would need another for case-insensitive; and things get much harder for regex().

I tried this query:

select * {
  ?x rdfs:label ?y
   filter(strstarts(?y,"Sofi"))
}
select * {
  ?x rdfs:label ?y
   filter(regex(?y,"^Sofi"))
}
  • this count returns 325 ?!?
select count(*) {
  ?x rdfs:label ?y
   filter(strstarts(?y,"Sofi"))
}
  • this count returns nothing ?!?!?!?!
select count(*) {
  ?x rdfs:label ?y
   filter(regex(?y,"^Sofi"))
}

@joernhees
Copy link
Author

joernhees commented Apr 17, 2019

Hmm, i think this is a bit more complicated...

Yes, FILTER parsing with a good query optimizer should be enough, but practical implementations show us that it actually often doesn't seem to be implemented. In other words: a simple "language agnostic lookup" of a string in practice often goes unoptimized, even though all necessary indices are there already. That's why i'd like to make this common use-case a lot easier to write and identify.

For the different counts see #51, a thing you find "not interesting to standardize because SPARQL servers don't work like this" ;)

@afs
Copy link
Collaborator

afs commented Apr 17, 2019

STRSTART works on rdf:langString, REGEX needs a str().

https://www.w3.org/TR/sparql11-query/#func-arg-compatibility

@arthurpsmith
Copy link

arthurpsmith commented Apr 17, 2019

So I can think of two ways of pushing this sort of thing out of FILTERs and into triple matching syntax:

  1. Similar to what was suggested above, allow wild-cards and some simple regex indicators in literals - "XYZ"@* for any language, "XY"&* for strstarts, &*"YZ" for strends, "XYZ"&i for case independent perhaps? I think the wildcards would need to be outside the string quotes to avoid breaking backwards compatibility regarding searches containing those literal characters.
  2. Allow literals to be subjects of triples in SPARQL queries, where the predicates come from some registered function list - this would look something like:
    ?s ?p ?literal . ?literal <&STRSTARTS> "XYZ" .

In both cases the query optimizer would have to recognize the situation and use an appropriate index, or fall back to the old FILTER approach. I think the second approach may be more flexible as it introduces potentially a mechanism for easily adding new match functions, allowing matches ('<', '!=', etc.) for other kinds of literals, etc.
(edited - '?' obviously doesn't work as wildcard in SPARQL!)

@TallTed
Copy link
Member

TallTed commented Apr 29, 2019

@VladimirAlexiev - I'm not sure what you ran into with your testing against DBpedia; probably it was at least partly Virtuoso's Anytime Query functionality, that returns partial results upon timeout. I just tested your 4 queries, with a timeout of 3000 seconds (vs the endpoint's default of 30 seconds), and all returned 1465. The links below are all live, and should deliver the same results to you.

  1. filter(strstarts(?y,"Sofi")) query and result of 1465 rows

  2. filter(regex(?y,"^Sofi")) query and result of 1465 rows

  3. count of filter(strstarts(?y,"Sofi")) query and result of 1465

  4. count of filter(regex(?y,"^Sofi")) query and result of 1465

@VladimirAlexiev
Copy link
Contributor

@TallTed which sort of proves the point, that repos need special indexing to handle this in a fast way.

@joernhees I agree with this proposal because it gives repos incentive to index str(), as opposed to random regex patterns which is a lot harder.

Re #51: I agree with you that timing out an expensive count is better than returning a wrong count. But here we're discussing ways to make querying faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
query Extends the Query spec
Projects
None yet
Development

No branches or pull requests

8 participants