-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rdf2hdt can't handle more than 4 billions triples #135
Comments
I got it: rdf2hdt does not handle more than 232 triples (because it uses This index is quite easy to fix by using |
👍 for fixing this. |
I wonder why HDT was not designed to handle large datasets from the beginning (aka streaming content), considering that compressing large dataset is where this tool shines. |
This is required but probably not enough to handle big datasets such as Wikidata (4.65 B triples). This should partially address issue rdfhdt#135.
@wouterbeek The paper is not accessible. Yay for open data :) |
If this is the same project... LOD-a-lot is a single HDT file with more than 28 billion unique triples, using 16GB of RAM? How did you manage to build a single HDT with 28 billion triples? I can't even imagine how much RAM the conversion process consumed! |
here's your open data: https://aic.ai.wu.ac.at/qadlod/lodalot/iswc2017/ The HDT itself takes 16 GB, the generation much more: "Note that HDT creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM. " |
Is this issue supposed to be fixed by the |
I haven't had the time to check yet, but given @wouterbeek has managed to build a hdt file for Wikidata using this branch, I assume that yes. |
@Arkanosis I can confirm this, |
64-bit support has now been added to the |
Hello,
I'm trying to get an hdt file from the Wikidata turtle dump of the 1st of November 2017 using the current HEAD revision of rdf2hdt (7d92f3f).
Everything looked okay until the “Removing duplicate triples” step which goes from 0 % progress to ~95 % progress (~91 % to ~94 % progress overall, respectively) and then starts over. It started after roughly 20 hours of runtime; it has been running for almost 37 hours now, that is around 17 hours “Removing duplicate triples”.
I've lost all hope having the conversion finish, but if I can give you useful information from the still running process before I kill it (it's eating ~200 GiB RAM, I can't let it run indefinitely), please let me know.
The text was updated successfully, but these errors were encountered: