-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem querying large hdt dataset in fuseki #61
Comments
Hi, Also, it is worth noticing that if you generate your hdt files with hdt-java, you won’t have this problem. Best, |
I wonder whether this has to do with the change in semantics of plain literals introduced in RDF 1.1? Just a thought. |
Thanks for your hints, @mn120110d !
Oh, that's a useful workaround. It does increase query execution time, though, since the endpoint first has to extract all triples and then remove most of them through the filter instead of directly looking in the index for relevant ones.
Does that mean that the Java and the C++ implementations produce different results when I convert the same source file to hdt? To me there seem to be four different ways how that can happen:
Given that the C++ implementation seems to be better maintained, my guess would be the first option. |
Interesting thought. Can you expand a bit on that? |
In RDF 1.0, the plain literal Now let's say the HDT file encodes such literals without a data type, but the hdt-jena layer expects them to have the data type |
OK, I see. When I inspected the hdt file with hdt-it!, I found the string |
You're welcome, @larsgsvensson . I agree that the solution with filter increases execution time, but I haven't found a better way to do it. Regarding your question:
Here is an issue that mentions bout versions and their differences: I believe the answer to your question would be this part of the conversation:
So my guess would be that no matter what tool you use, the generated hdt should be correct, but for now you'll have to adjust your queries to support different versions. I hope that someone more competent could give a better and more detailed explanation. :) |
I'm curious about two questions here:
Trying to answer these might lead to more clues about where the problem is. I think that the hdt-java version was originally created using a Jena version earlier than 3.0, which changed the semantics of plain literals to follow RDF 1.1. In fact it must be, since the project started around 2012 and Jena 3.0 was released in July 2015. So maybe when the dependency was updated eventually updated to Jena 3.0+, there were some code paths left that expected the old RDF 1.0 style plain literals. Or maybe the hdt-cpp tools assume RDF 1.0 style literals while hdt-java is all RDF 1.1. |
I'm not sure this is related to file size or which library generates the file. It seems like an incompatibility in the fuseki implementation. I was able to recreate the issue with the bulk instrument file from permid.org (14Mb gzipped). Using HDT-it! with the raw triples (which includes typed string literals) and default settings I get an HDT file that fuseki returns 0 rows for the queries:
and
If I remove the types from the file using sed
the following query works in fuseki, returning one row
Importantly, I get the same results whether using HDT-it (which relies on the CPP library) or the java-cli. I'm using the current build of HDT-it from the website on a mac and the current trunk from github for the CLI. |
After some time I managed to get a closer look at this (or rather at the Server.java implementation that uses part of hdt-jena). Having set up a TPF server using Server.java, I tried to query an HDT file where all literals have datatypes, particularly the default datatype is
I think that this isn't conform with RDF 1.1 §3.3:
The implementation in |
@larsgsvensson It is indeed possible to store simple literals in HDT. This means that the same RDF 1.1 term can be represented by two distinct terms in an HDT. |
@wouterbeek Yes, sure it is. The point I'm aiming at is that if I convert the RDF triple If there are two ways to store the literal, then I must be able to query them exactly as I stored them, or the search for As I see it, the only way to accomplish this is to mandate that in HDT files the datatype |
@larsgsvensson I agree with you that the best solution would be to always store Unfortunately, Instead, what is needed is HDT-specific code that transforms (Notice that this would not invalidate existing HDTs that use |
@larsgsvensson I've crated an issue for this in the proper place: rdfhdt/hdt-cpp#173 You can close this current issue if there are no other |
Thanks @wouterbeek. In rdfhdt/hdt-cpp#173 I suggested to do it the other way round since it seems that most implementations use the |
Just for documentation purposes, this is my current workaround:
I. e. I replace |
This might or might not be the right project for this issue...
I'm trying to query a large dataset (5GB .hdt file, 266 M-Triples) and have a problem searching for untyped literals. SPARQL queries with typed literals or URIs in the object position run fine. Also, when I create a small dataset (13 triples), SPARQL queries for typed literals run fine, so I assume that it's an issue with the hdt file size. The hdt files were created using hdt-cpp.
I have integrated HDT support into Fuseki as described and the service as a whole works fine.
The problem looks like this: I first run a
DESCRIBE
in order to get some triples:That query returns
When I then try to query for a literal like this:
I get zero results.
Also adding the datatype `xsd:string' to the literal doesn't help:
does not help.
If I inspect the hdt file using hdt-it!, a search for "A. F. W. Sommer"^^http://www.w3.org/2001/XMLSchema#stringreturns 231 hits, so the data is obviously present.
As a verification, I created a dataset consisting only of this one entity and configured fuseki to run a separate service with that dataset (15 triples) in a single named graph. With that configuration, SPARQL queries for untyped literals work so I guess that it's a problem with the hdt file size.
Any insights are much appreciated.
Thanks,
Lars
The text was updated successfully, but these errors were encountered: