-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
easier langtag and datatype agnostic matching of literals #34
Comments
If there was standard full text search that allowed matching any language, would that satisfy the requirement or would you still be asking for this addition? #17 is about using the I hope |
I think that FTS would fill the bill. The query would be written like eg ?x luc:label """ "foo bar" """" However, not all repos have FTS in all situations, and you need to configure an index, and standardizing FTS is a lot harder. |
But FTS would be a better solution to the stated problem. If the query author just wants a quick match and is not interested in the language tag, they are probably also not interested in details like capitalisation. The proposal here will only match if capitalisation in the query matches capitalisation in the data. That makes me think the proposal is actually not a very good solution to the stated problem. If FTS is available, that’s a better solution. If it is not avaiable, maybe the SPARQL 1.1 way with |
hmm, full text search would maybe be a solution for some, but in my cases i actually don't want to do any kind of fuzzy lookup, just a simple perfect string match. It's very common (at least from my experience) that you have some kind of identifier (i'll mention names and uuids here as example) from some system and now want to look it up in your RDF store's literals. Especially when doing this on many loaded datasets you don't know how people modeled these identifying strings in RDF literals... maybe they used a language tag or not, maybe they used a specific datatype or not. I simply want to be able to say "i don't care about those RDF specifics". You're right, my example above should be extended to the following to not match IRIs:
The further i extend this, the more obvious my argument becomes however: it's ever harder to correctly identify this and optimize the simple lookup we're talking about here, that nearly every backend will actually have an index for and could do in O(log(n)) time rather than a full table scan that this ends up with very frequently. Talking about |
The way to teach this is to say: “
It's not possible to subclass
I am not advocating for that syntax. I don't think it's a good idea. Another option would be |
hmm, couldn't easily find why not, any pointer? Apart from that i'd also favor |
Small philosophical comment: IMHO the root of the difficulty here is that literals can have two attributes -- type and language -- but they are disjoint and not modeled as explicit triples. That was a dirty design choice that is biting us now. I hope that any new solution at least approximates a cleaner solution. |
@joernhees What I meant is that literals of |
oh, i thought along the "what i cannot make the statement |
I have run into this so many times, this (language-independent string matching) would be my highest priority for 1.2. That said, there are a ton of similar issues in SPARQL execution caused by FILTER having second-class status (that is, only being applied AFTER a list of matching triples is created, rather than being applied proactively to generate a collection of triples). Case-insensitive matching is another just mentioned here, also substring matching (STRSTARTS, STRENDS, CONTAINS) etc. Hoping to match regexes might be a step too far, but if we could move these basic string FILTER actions into the main triple-matching syntax that could be a huge win (similar issues for date and numeric comparison operations too I think) - and I don't think requiring "full text search" is the only way to do that. |
@arthurpsmith -- good to hear about the issues you've encountered.
This would seem to be more about a lack of indexing/optimization. The SPARQL spec can define what the correct results are, but how that is achieved is up to the implementation. Any implementation has to balance a number of factors such as time to load vs amount of indexing done. Case insensitive versions of STRSTARTS, STRENDS, CONTAINS and string equality sound like a good idea. |
@afs - yes, it could be done now with smart optimization and use of indexes - are you aware of any existing sparql servers that do this? I'm mostly familiar with blazegraph and haven't seen any sign of this being possible. I think making some of the simpler string matching directly part of the 1.2 spec it would be a more obvious target for index-based optimization, rather than having to parse FILTER syntax. |
Not sure if people have seen this over in the EasierRDF ideas list: |
It's not about FILTER parsing, it's the fact that repos often need to scan all literals. I tried this query: select * {
?x rdfs:label ?y
filter(strstarts(?y,"Sofi"))
}
select count(*) {
?x rdfs:label ?y
filter(strstarts(?y,"Sofi"))
}
select count(*) {
?x rdfs:label ?y
filter(regex(?y,"^Sofi"))
} |
Hmm, i think this is a bit more complicated... Yes, For the different counts see #51, a thing you find "not interesting to standardize because SPARQL servers don't work like this" ;) |
https://www.w3.org/TR/sparql11-query/#func-arg-compatibility |
So I can think of two ways of pushing this sort of thing out of FILTERs and into triple matching syntax:
In both cases the query optimizer would have to recognize the situation and use an appropriate index, or fall back to the old FILTER approach. I think the second approach may be more flexible as it introduces potentially a mechanism for easily adding new match functions, allowing matches ('<', '!=', etc.) for other kinds of literals, etc. |
@VladimirAlexiev - I'm not sure what you ran into with your testing against DBpedia; probably it was at least partly Virtuoso's Anytime Query functionality, that returns partial results upon timeout. I just tested your 4 queries, with a timeout of 3000 seconds (vs the endpoint's default of 30 seconds), and all returned 1465. The links below are all live, and should deliver the same results to you. |
@TallTed which sort of proves the point, that repos need special indexing to handle this in a fast way. @joernhees I agree with this proposal because it gives repos incentive to index str(), as opposed to random regex patterns which is a lot harder. Re #51: I agree with you that timing out an expensive count is better than returning a wrong count. But here we're discussing ways to make querying faster. |
Why?
For exploration or candidate generation i'm often doing things like the following in order to get results independent of language tag or datatype of the literal:
The problem with this is that many SPARQL endpoints don't seem to optimize for such queries (i.e.,
VALUES ?l {"XYZ"@en "XYZ"@de "XYZ"@fr ...10more... "XYZ" } ?s ?p ?l .
tends to be a lot faster than the above FILTER clause!). While this could be seen as a common query plan optimization problem of SPARQL engines (failing to identify the perfect string lookup and using their existing indices to quickly answer the query), i'd like to honor this "frequent special case" with its own little syntax add-on, maybe also making it a lot easier to identify and optimize for.For a syntax extension some things come to mind, such as
"XYZ"@*
,"XYZ"^^*
and/or"XYZ"*
, but not decided at all.Previous work
Could be related to #13 and #17, but they seem to focus on other aspects.
Considerations for backward compatibility
How do current endpoints deal with "XYZ"@* ?
The text was updated successfully, but these errors were encountered: