easier langtag and datatype agnostic matching of literals #34

joernhees · 2019-04-03T16:18:56Z

Why?

For exploration or candidate generation i'm often doing things like the following in order to get results independent of language tag or datatype of the literal:

?s ?p ?l .
FILTER(STR(?l) = "XYZ") .

The problem with this is that many SPARQL endpoints don't seem to optimize for such queries (i.e., VALUES ?l {"XYZ"@en "XYZ"@de "XYZ"@fr ...10more... "XYZ" } ?s ?p ?l . tends to be a lot faster than the above FILTER clause!). While this could be seen as a common query plan optimization problem of SPARQL engines (failing to identify the perfect string lookup and using their existing indices to quickly answer the query), i'd like to honor this "frequent special case" with its own little syntax add-on, maybe also making it a lot easier to identify and optimize for.

For a syntax extension some things come to mind, such as "XYZ"@*, "XYZ"^^* and/or "XYZ"*, but not decided at all.

Previous work

Could be related to #13 and #17, but they seem to focus on other aspects.

Considerations for backward compatibility

How do current endpoints deal with "XYZ"@* ?

The text was updated successfully, but these errors were encountered:

cygri · 2019-04-03T16:38:16Z

If there was standard full text search that allowed matching any language, would that satisfy the requirement or would you still be asking for this addition?

#17 is about using the "XYZ"@en form with a wildcard instead of XYZ. The proposal here is about using the form with a wildcard instead of @en. So the two issues relate and should be considered together.

I hope "..."@* wouldn't match IRIs?

VladimirAlexiev · 2019-04-03T18:43:24Z

I think that FTS would fill the bill. The query would be written like eg

?x luc:label """  "foo bar" """"

However, not all repos have FTS in all situations, and you need to configure an index, and standardizing FTS is a lot harder.

cygri · 2019-04-03T21:28:21Z

But FTS would be a better solution to the stated problem. If the query author just wants a quick match and is not interested in the language tag, they are probably also not interested in details like capitalisation. The proposal here will only match if capitalisation in the query matches capitalisation in the data. That makes me think the proposal is actually not a very good solution to the stated problem. If FTS is available, that’s a better solution. If it is not avaiable, maybe the SPARQL 1.1 way with FILTER(str(?x)="...") is good enough? Do we really want to add a semi-solution?

joernhees · 2019-04-04T11:36:12Z

hmm, full text search would maybe be a solution for some, but in my cases i actually don't want to do any kind of fuzzy lookup, just a simple perfect string match. It's very common (at least from my experience) that you have some kind of identifier (i'll mention names and uuids here as example) from some system and now want to look it up in your RDF store's literals. Especially when doing this on many loaded datasets you don't know how people modeled these identifying strings in RDF literals... maybe they used a language tag or not, maybe they used a specific datatype or not. I simply want to be able to say "i don't care about those RDF specifics".

You're right, my example above should be extended to the following to not match IRIs:

?s ?p ?l .
FILTER(isLiteral(?l) && STR(?l) = "XYZ") .

The further i extend this, the more obvious my argument becomes however: it's ever harder to correctly identify this and optimize the simple lookup we're talking about here, that nearly every backend will actually have an index for and could do in O(log(n)) time rather than a full table scan that this ends up with very frequently.

Talking about "XYZ"@*, "XYZ"^^* and/or "XYZ"* i'm however not sure what developers would intuitively expect them to match: Would you expect "XYZ"@* to match plain literals? Would you expect "XYZ"@* to match "XYZ"^^<myDT>? What if i tell you that <myDT> is a subclass of rdf:langString? It becomes even more obscure when talking about "XYZ"^^*... would you expect this to match plain literals? Would you expect it to match "XYZ"@en (actually one could argue that this string has the datatype rdf:langString)?

cygri · 2019-04-04T13:02:31Z

Would you expect "XYZ"@* to match plain literals? Would you expect "XYZ"@* to match "XYZ"^^<myDT>?

The way to teach this is to say: “lang("XYZ") and lang("XYZ"^^<myDT>) return an empty string, and * matches an empty string, so "XYZ"@* matches both.” It's a bit of a fudge but not too bad.

What if i tell you that <myDT> is a subclass of rdf:langString?

It's not possible to subclass rdf:langString.

It becomes even more obscure when talking about "XYZ"^^*

I am not advocating for that syntax. I don't think it's a good idea.

Another option would be ~"XYZ" to match any literal with lexical form XYZ.

joernhees · 2019-04-09T13:49:18Z

It's not possible to subclass rdf:langString.

hmm, couldn't easily find why not, any pointer?

Apart from that i'd also favor "XYZ"@* over the other forms.

dbooth-boston · 2019-04-09T18:47:59Z

Small philosophical comment: IMHO the root of the difficulty here is that literals can have two attributes -- type and language -- but they are disjoint and not modeled as explicit triples. That was a dirty design choice that is biting us now. I hope that any new solution at least approximates a cleaner solution.

cygri · 2019-04-10T10:09:24Z

@joernhees What I meant is that literals of <myDT> cannot have language tags (cf. RDF Concepts) and lang("XYZ"^^<myDT>)="" (cf. definition of lang). So I don't see how asserting that <myDT> is a subclass of rdf:langString could ever be a useful thing to do.

joernhees · 2019-04-10T12:18:35Z

oh, i thought along the "what i cannot make the statement <myDT> rdfs:subClassOf rdf:langString . ?" and not the RDF concepts route ;) ("foo"@en^^<myDT> would've definitely been even more fun ^^)

arthurpsmith · 2019-04-15T21:03:28Z

I have run into this so many times, this (language-independent string matching) would be my highest priority for 1.2. That said, there are a ton of similar issues in SPARQL execution caused by FILTER having second-class status (that is, only being applied AFTER a list of matching triples is created, rather than being applied proactively to generate a collection of triples). Case-insensitive matching is another just mentioned here, also substring matching (STRSTARTS, STRENDS, CONTAINS) etc. Hoping to match regexes might be a step too far, but if we could move these basic string FILTER actions into the main triple-matching syntax that could be a huge win (similar issues for date and numeric comparison operations too I think) - and I don't think requiring "full text search" is the only way to do that.

afs · 2019-04-16T09:52:44Z

@arthurpsmith -- good to hear about the issues you've encountered.

caused by FILTER having second-class status (that is, only being applied AFTER a list of matching triples is created, rather than being applied proactively to generate a collection of triples)

This would seem to be more about a lack of indexing/optimization. The SPARQL spec can define what the correct results are, but how that is achieved is up to the implementation. Any implementation has to balance a number of factors such as time to load vs amount of indexing done.

Case insensitive versions of STRSTARTS, STRENDS, CONTAINS and string equality sound like a good idea.

arthurpsmith · 2019-04-16T13:19:54Z

@afs - yes, it could be done now with smart optimization and use of indexes - are you aware of any existing sparql servers that do this? I'm mostly familiar with blazegraph and haven't seen any sign of this being possible. I think making some of the simpler string matching directly part of the 1.2 spec it would be a more obvious target for index-based optimization, rather than having to parse FILTER syntax.

arthurpsmith · 2019-04-16T20:30:59Z

Not sure if people have seen this over in the EasierRDF ideas list:
w3c/EasierRDF#22

VladimirAlexiev · 2019-04-17T11:58:13Z

It's not about FILTER parsing, it's the fact that repos often need to scan all literals.
You could use some prefix index (trie) for strstarts(), but would need another for case-insensitive; and things get much harder for regex().

I tried this query:

select * {
  ?x rdfs:label ?y
   filter(strstarts(?y,"Sofi"))
}

https://query.wikidata.org/: query timeout after about 100 results; the results are in raw format together with the exception
http://dbpedia.org/sparql; no timeout, 186 results
this functionally equivalent query returns 190 on DBpedia:

select * {
  ?x rdfs:label ?y
   filter(regex(?y,"^Sofi"))
}

this count returns 325 ?!?

select count(*) {
  ?x rdfs:label ?y
   filter(strstarts(?y,"Sofi"))
}

this count returns nothing ?!?!?!?!

select count(*) {
  ?x rdfs:label ?y
   filter(regex(?y,"^Sofi"))
}

joernhees · 2019-04-17T12:17:00Z

Hmm, i think this is a bit more complicated...

Yes, FILTER parsing with a good query optimizer should be enough, but practical implementations show us that it actually often doesn't seem to be implemented. In other words: a simple "language agnostic lookup" of a string in practice often goes unoptimized, even though all necessary indices are there already. That's why i'd like to make this common use-case a lot easier to write and identify.

For the different counts see #51, a thing you find "not interesting to standardize because SPARQL servers don't work like this" ;)

afs · 2019-04-17T12:31:05Z

STRSTART works on rdf:langString, REGEX needs a str().

https://www.w3.org/TR/sparql11-query/#func-arg-compatibility

arthurpsmith · 2019-04-17T14:40:25Z

So I can think of two ways of pushing this sort of thing out of FILTERs and into triple matching syntax:

Similar to what was suggested above, allow wild-cards and some simple regex indicators in literals - "XYZ"@* for any language, "XY"&* for strstarts, &*"YZ" for strends, "XYZ"&i for case independent perhaps? I think the wildcards would need to be outside the string quotes to avoid breaking backwards compatibility regarding searches containing those literal characters.
Allow literals to be subjects of triples in SPARQL queries, where the predicates come from some registered function list - this would look something like:
?s ?p ?literal . ?literal <&STRSTARTS> "XYZ" .

In both cases the query optimizer would have to recognize the situation and use an appropriate index, or fall back to the old FILTER approach. I think the second approach may be more flexible as it introduces potentially a mechanism for easily adding new match functions, allowing matches ('<', '!=', etc.) for other kinds of literals, etc.
(edited - '?' obviously doesn't work as wildcard in SPARQL!)

TallTed · 2019-04-29T21:18:53Z

@VladimirAlexiev - I'm not sure what you ran into with your testing against DBpedia; probably it was at least partly Virtuoso's Anytime Query functionality, that returns partial results upon timeout. I just tested your 4 queries, with a timeout of 3000 seconds (vs the endpoint's default of 30 seconds), and all returned 1465. The links below are all live, and should deliver the same results to you.

VladimirAlexiev · 2019-04-30T10:00:40Z

@TallTed which sort of proves the point, that repos need special indexing to handle this in a fast way.

@joernhees I agree with this proposal because it gives repos incentive to index str(), as opposed to random regex patterns which is a lot harder.

Re #51: I agree with you that timing out an expensive count is better than returning a wrong count. But here we're discussing ways to make querying faster.

JervenBolleman added the query Extends the Query spec label Apr 16, 2019

cygri mentioned this issue May 8, 2019

String matching using wildcards #85

Open

joernhees mentioned this issue Jul 1, 2019

direction agnostic querying w3c/rdf-dir-literal#19

Closed

afs mentioned this issue Aug 29, 2019

Evolution approach to compound literals. w3c/rdf-dir-literal#22

Open

afs mentioned this issue Apr 12, 2023

Idea: Datatype triple patterns w3c/sparql-query#56

Closed

redmer mentioned this issue Apr 17, 2023

Datatype triple patterns #182

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

easier langtag and datatype agnostic matching of literals #34

easier langtag and datatype agnostic matching of literals #34

joernhees commented Apr 3, 2019

cygri commented Apr 3, 2019

VladimirAlexiev commented Apr 3, 2019

cygri commented Apr 3, 2019

joernhees commented Apr 4, 2019

cygri commented Apr 4, 2019

joernhees commented Apr 9, 2019

dbooth-boston commented Apr 9, 2019

cygri commented Apr 10, 2019

joernhees commented Apr 10, 2019

arthurpsmith commented Apr 15, 2019

afs commented Apr 16, 2019

arthurpsmith commented Apr 16, 2019

arthurpsmith commented Apr 16, 2019 •

edited

Loading

VladimirAlexiev commented Apr 17, 2019

joernhees commented Apr 17, 2019 •

edited

Loading

afs commented Apr 17, 2019

arthurpsmith commented Apr 17, 2019 •

edited

Loading

TallTed commented Apr 29, 2019

VladimirAlexiev commented Apr 30, 2019

easier langtag and datatype agnostic matching of literals #34

easier langtag and datatype agnostic matching of literals #34

Comments

joernhees commented Apr 3, 2019

Why?

Previous work

Considerations for backward compatibility

cygri commented Apr 3, 2019

VladimirAlexiev commented Apr 3, 2019

cygri commented Apr 3, 2019

joernhees commented Apr 4, 2019

cygri commented Apr 4, 2019

joernhees commented Apr 9, 2019

dbooth-boston commented Apr 9, 2019

cygri commented Apr 10, 2019

joernhees commented Apr 10, 2019

arthurpsmith commented Apr 15, 2019

afs commented Apr 16, 2019

arthurpsmith commented Apr 16, 2019

arthurpsmith commented Apr 16, 2019 • edited Loading

VladimirAlexiev commented Apr 17, 2019

joernhees commented Apr 17, 2019 • edited Loading

afs commented Apr 17, 2019

arthurpsmith commented Apr 17, 2019 • edited Loading

TallTed commented Apr 29, 2019

VladimirAlexiev commented Apr 30, 2019

arthurpsmith commented Apr 16, 2019 •

edited

Loading

joernhees commented Apr 17, 2019 •

edited

Loading

arthurpsmith commented Apr 17, 2019 •

edited

Loading