Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem querying large hdt dataset in fuseki #61

Closed
larsgsvensson opened this issue Nov 2, 2017 · 16 comments
Closed

Problem querying large hdt dataset in fuseki #61

larsgsvensson opened this issue Nov 2, 2017 · 16 comments

Comments

@larsgsvensson
Copy link

This might or might not be the right project for this issue...

I'm trying to query a large dataset (5GB .hdt file, 266 M-Triples) and have a problem searching for untyped literals. SPARQL queries with typed literals or URIs in the object position run fine. Also, when I create a small dataset (13 triples), SPARQL queries for typed literals run fine, so I assume that it's an issue with the hdt file size. The hdt files were created using hdt-cpp.

I have integrated HDT support into Fuseki as described and the service as a whole works fine.

The problem looks like this: I first run a DESCRIBE in order to get some triples:

PREFIX gndo:  <http://d-nb.info/standards/elementset/gnd#>
PREFIX rdau:  <http://rdaregistry.info/Elements/u/> 
PREFIX dct:   <http://purl.org/dc/terms/> 
PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX dcterm: <http://purl.org/dc/terms/> 
PREFIX xsd:   <http://www.w3.org/2001/XMLSchema#> 
PREFIX skos:  <http://www.w3.org/2004/02/skos/core#> 
PREFIX dcterms: <http://purl.org/dc/terms/> 
PREFIX gnd:   <http://d-nb.info/gnd/> 
PREFIX dc:    <http://purl.org/dc/elements/1.1/> 
PREFIX dnbt: <http://d-nb.info/standards/elementset/dnb#>

DESCRIBE <http://d-nb.info/1000000354>
FROM <http://d-nb.info/dnb-all>

That query returns

@prefix gndo:  <http://d-nb.info/standards/elementset/gnd#> .
@prefix dnbt:  <http://d-nb.info/standards/elementset/dnb#> .
@prefix rdau:  <http://rdaregistry.info/Elements/u/> .
@prefix dct:   <http://purl.org/dc/terms/> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dcterm: <http://purl.org/dc/terms/> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix skos:  <http://www.w3.org/2004/02/skos/core#> .
@prefix gnd:   <http://d-nb.info/gnd/> .
@prefix dc:    <http://purl.org/dc/elements/1.1/> .

<http://d-nb.info/1000000354>
        a                <http://purl.org/ontology/bibo/Collection> ;
        dc:identifier    "(OCoLC)723788590" , "(DE-101)1000000354" ;
        dc:publisher     "A. F. W. Sommer" ;
        dc:subject       "830"^^dnbt:ddc-subject-category , "B"^^dnbt:ddc-subject-category ;
        dc:title         "Neuere Gedichte" ;
        dcterms:creator  gnd:118569317 ;
        dcterms:medium   <http://rdaregistry.info/termList/RDACarrierType/1044> ;
        rdau:P60163      "Wien" ;
        rdau:P60327      "August Friedrich Ernst Langbein" ;
        rdau:P60333      "Wien : A. F. W. Sommer" ;
        rdau:P60493      "1814" ;
        rdau:P60539      "30 cm" ;
        <http://www.w3.org/2002/07/owl#sameAs>
                <http://hub.culturegraph.org/resource/DNB-1000000354> .

When I then try to query for a literal like this:

SELECT ?entity
FROM <http://d-nb.info/dnb-all>
WHERE {
  ?entity ?p "A. F. W. Sommer"
}

I get zero results.
Also adding the datatype `xsd:string' to the literal doesn't help:

SELECT ?entity
FROM <http://d-nb.info/dnb-all>
WHERE {
  ?entity ?p "A. F. W. Sommer"^^<http://www.w3.org/2001/XMLSchema#string>
}

does not help.

If I inspect the hdt file using hdt-it!, a search for "A. F. W. Sommer"^^http://www.w3.org/2001/XMLSchema#stringreturns 231 hits, so the data is obviously present.

As a verification, I created a dataset consisting only of this one entity and configured fuseki to run a separate service with that dataset (15 triples) in a single named graph. With that configuration, SPARQL queries for untyped literals work so I guess that it's a problem with the hdt file size.

Any insights are much appreciated.

Thanks,

Lars

@mn120110d
Copy link

mn120110d commented Nov 2, 2017

Hi,
I encountered the same problem when generating hdt files with hdt-cpp. Only query that contains explicit cast and filter will give you the results, in your case that would be:
?entity ?p ?o .
filter (?o = xsd:string(“A. F. W. Sommer”)) .

Also, it is worth noticing that if you generate your hdt files with hdt-java, you won’t have this problem.

Best,
Nevena

@osma
Copy link
Contributor

osma commented Nov 2, 2017

I wonder whether this has to do with the change in semantics of plain literals introduced in RDF 1.1? Just a thought.

@larsgsvensson
Copy link
Author

Thanks for your hints, @mn120110d !

Only query that contains explicit cast and filter will give you the results, in your case that would be:
?entity ?p ?o .
filter (?o = xsd:string(“A. F. W. Sommer”)) .

Oh, that's a useful workaround. It does increase query execution time, though, since the endpoint first has to extract all triples and then remove most of them through the filter instead of directly looking in the index for relevant ones.

Also, it is worth noticing that if you generate your hdt files with hdt-java, you won’t have this problem.

Does that mean that the Java and the C++ implementations produce different results when I convert the same source file to hdt? To me there seem to be four different ways how that can happen:

  1. The spec is unambiguous, the C++ implementation makes it right and the Java implementation makes it wrong.
  2. The spec is unambiguous, the Java implementation makes it right and the C++ implementation makes it wrong
  3. The spec is unambiguous and neither the Java, nor the C++ implementation makes it right
  4. The spec is ambiguous and both the Java and the C++ implementations make it right while still producing different results.

Given that the C++ implementation seems to be better maintained, my guess would be the first option.

@larsgsvensson
Copy link
Author

@osma

I wonder whether this has to do with the change in semantics of plain literals introduced in RDF 1.1? Just a thought.

Interesting thought. Can you expand a bit on that?

@osma
Copy link
Contributor

osma commented Nov 3, 2017

In RDF 1.0, the plain literal "foo" is different from "foo"^^xsd:string. In RDF 1.1 they are the same since all plain literals (i.e. no language tag and no datatype) have an implicit datatype of xsd:string.

Now let's say the HDT file encodes such literals without a data type, but the hdt-jena layer expects them to have the data type xsd:string (or vice versa). They would have a different encoding and thus wouldn't match.

@larsgsvensson
Copy link
Author

OK, I see.

When I inspected the hdt file with hdt-it!, I found the string A. F. W. Sommer"^^http://www.w3.org/2001/XMLSchema#string (with datatype) but adding the datatype in the SPARQL query didn't help. So the only possibility would be that the hdt-jena implementation strips the datatype from the literal in the SPARQL query but that further down the line it's expected.

@mn120110d
Copy link

mn120110d commented Nov 6, 2017

You're welcome, @larsgsvensson .

I agree that the solution with filter increases execution time, but I haven't found a better way to do it.

Regarding your question:

Does that mean that the Java and the C++ implementations produce different results when I convert the same source file to hdt?

Here is an issue that mentions bout versions and their differences:
#58

I believe the answer to your question would be this part of the conversation:

regardless if a .hdt was created with C++ or Java, shouldn't it be the exact same format?

in theory yes, in practice not that easy without dedicated development teams :/

So my guess would be that no matter what tool you use, the generated hdt should be correct, but for now you'll have to adjust your queries to support different versions. I hope that someone more competent could give a better and more detailed explanation. :)

@osma
Copy link
Contributor

osma commented Nov 6, 2017

I'm curious about two questions here:

  • Is this really dependent on the input file size? @larsgsvensson mentioned that the problem was with a large input file, but not with a very small subset. If that is true, where is the limit? There's a rather large gap between 266M triples and 13 triples...
  • If the file generated with hdt-java works but the hdt-cpp version doesn't, what's the difference? Is the datatype of plain literals encoded differently in hdt files generated by the two tools?

Trying to answer these might lead to more clues about where the problem is.

I think that the hdt-java version was originally created using a Jena version earlier than 3.0, which changed the semantics of plain literals to follow RDF 1.1. In fact it must be, since the project started around 2012 and Jena 3.0 was released in July 2015. So maybe when the dependency was updated eventually updated to Jena 3.0+, there were some code paths left that expected the old RDF 1.0 style plain literals. Or maybe the hdt-cpp tools assume RDF 1.0 style literals while hdt-java is all RDF 1.1.

@TRnonodename
Copy link

I'm not sure this is related to file size or which library generates the file. It seems like an incompatibility in the fuseki implementation.

I was able to recreate the issue with the bulk instrument file from permid.org (14Mb gzipped).

Using HDT-it! with the raw triples (which includes typed string literals) and default settings I get an HDT file that fuseki returns 0 rows for the queries:

select ?instr where { ?instr <http://permid.org/ontology/common/hasName> "Jardine Strategic Holdings IDR" . }

and

select ?instr where { ?instr <http://permid.org/ontology/common/hasName> "Jardine Strategic Holdings IDR"^^<http://www.w3.org/2001/XMLSchema#string> . }

If I remove the types from the file using sed

gzcat OpenPermID-bulk-instrument-20171106_072520.ntriples.gz | sed 's/\^\^<http:\/\/www.w3.org\/2001\/XMLSchema#string>//' > stripped.nt

the following query works in fuseki, returning one row

select ?instr where { ?instr <http://permid.org/ontology/common/hasName> "Jardine Strategic Holdings IDR" . }

Importantly, I get the same results whether using HDT-it (which relies on the CPP library) or the java-cli. I'm using the current build of HDT-it from the website on a mac and the current trunk from github for the CLI.

@larsgsvensson
Copy link
Author

larsgsvensson commented May 4, 2018

After some time I managed to get a closer look at this (or rather at the Server.java implementation that uses part of hdt-jena).

Having set up a TPF server using Server.java, I tried to query an HDT file where all literals have datatypes, particularly the default datatype is xsd:string. It didn't work. I dug a bit through the source code and found that access to the HDT dictionary is handled by org.rdfhdt.hdtjena.NodeDictionary. Here, the method #nodeToString strips off the datatype from the literal if the datatype is xsd:string which explains why the querying doesn't work:

public static String nodeToStr(Node node) {
  if(node==null || node.isVariable()) {
    return "";
  }else if(node.isURI()) {
    return node.getURI();
  } else if(node.isLiteral()) {
    RDFDatatype t = node.getLiteralDatatype();
    if(t==null || XSDDatatype.XSDstring.getURI().equals(t.getURI())) {
      // String
      return "\""+node.getLiteralLexicalForm()+"\"";
    } else if(RDFLangString.rdfLangString.equals(t)) {
      // Lang
      return "\""+node.getLiteralLexicalForm()+"\"@"+node.getLiteralLanguage();
    } else {
      // Typed
      return "\""+node.getLiteralLexicalForm()+"\"^^<"+t.getURI()+">";
    }
  } else {
    return node.toString();
  }
}

I think that this isn't conform with RDF 1.1 §3.3:

Please note that concrete syntaxes MAY support simple literals consisting of only a lexical form without any datatype IRI or language tag. Simple literals are syntactic sugar for abstract syntax literals with the datatype IRI http://www.w3.org/2001/XMLSchema#string.

The implementation in NodeDictionary turns the syntactic sugar into a norm.

@wouterbeek
Copy link

@larsgsvensson It is indeed possible to store simple literals in HDT. This means that the same RDF 1.1 term can be represented by two distinct terms in an HDT.

@larsgsvensson
Copy link
Author

@wouterbeek Yes, sure it is. The point I'm aiming at is that if I convert the RDF triple
:subject :predicate "object"^^xsd:string .
to hdt using the cpp implementation, I cannot query it using the Java implementation since the cpp converter will retain the datatype xsd:string on all literals whereas the Java implementation strips the xsd:string from the literal before querying the NodeDictionary. That way the literal will never be found.

If there are two ways to store the literal, then I must be able to query them exactly as I stored them, or the search for "rdf-hdt" in the object position must deliver the same result as the search for "rdf-hdt"^^xsd:string (i. e. the implementation must consider
:subject :predicate "object"^^xsd:string .
and
:subject :predicate "object" .
equivalent in every respect). If the implementation lets me store it with an xsd:string datatype but doesn't let me query it that way, it means I shouldn't be allowed to store it that way in the first place.

As I see it, the only way to accomplish this is to mandate that in HDT files the datatype xsd:string is always added if not already present, or it's always removed if present. Then the implementations accessing the HDT file would need to be adjusted accordingly.

@wouterbeek
Copy link

@larsgsvensson I agree with you that the best solution would be to always store "..."^^<http://www.w3.org/2001/XMLSchema#> in HDT, and never store "...".

Unfortunately, "..." is a legal RDF term in Turtle, TriG, N-Triples, and N-Quads. This means that any compliant parser is allowed to emit "...", and that this should not be fixed in the parser (i..e, Serd).

Instead, what is needed is HDT-specific code that transforms "..." into "..."^^<http://www.w3.org/2001/XMLSchema#> upon HDT file creation. If somebody would be able to implement this in a pull request, then this would be very welcome.

(Notice that this would not invalidate existing HDTs that use "...". It would just guarantee that newly created HDTs are not ambiguous.)

@wouterbeek
Copy link

@larsgsvensson I've crated an issue for this in the proper place: rdfhdt/hdt-cpp#173

You can close this current issue if there are no other hdt-java specific components to it.

@larsgsvensson
Copy link
Author

Thanks @wouterbeek. In rdfhdt/hdt-cpp#173 I suggested to do it the other way round since it seems that most implementations use the "..." form when reading.
I don't think there are any other hdt-java issues here so I'll close.

@larsgsvensson
Copy link
Author

Just for documentation purposes, this is my current workaround:

  1. Since hdt-java depends on hdt-jena, update org.rdfhdt.hdtjena.NodeDictionary so that the methods NodeDictionary#nodeToStr(Node) and NodeDictionary#nodeToStr(Node, PrefixMapping) are non-static. Update testcases accordingly.
  2. In HdtBasedRequestProcessorForTPFs, overwrite the constructor as follows:
public HdtBasedRequestProcessorForTPFs(final String hdtFile) throws IOException {
	this.datasource = HDTManager.mapIndexedHDT(hdtFile, null); // listener=null
	this.dictionary = new NodeDictionary(this.datasource.getDictionary()) {
		@Override
		public String nodeToStr(final Node node) {
			if (node == null || node.isVariable()) {
				return "";
			} else if (node.isURI()) {
				return node.getURI();
			} else if (node.isLiteral()) {
				final RDFDatatype t = node.getLiteralDatatype();
					if (t == null) {
					// String
					return "\"" + node.getLiteralLexicalForm() + "\"";
				} else if (RDFLangString.rdfLangString.equals(t)) {
					// Lang
					return "\"" + node.getLiteralLexicalForm() + "\"@"
							+ node.getLiteralLanguage();
				} else {
					// Typed
					return "\"" + node.getLiteralLexicalForm() + "\"^^<"
							+ t.getURI() + ">";
				}
			} else {
				return node.toString();
			}
		}
	};
}

I. e. I replace nodeToStr( Node) with an implementation that keeps the datatype even when it's a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants