rdf2hdt can't handle more than 4 billions triples #135

Arkanosis · 2017-11-07T12:01:03Z

Hello,

I'm trying to get an hdt file from the Wikidata turtle dump of the 1st of November 2017 using the current HEAD revision of rdf2hdt (7d92f3f).

Everything looked okay until the “Removing duplicate triples” step which goes from 0 % progress to ~95 % progress (~91 % to ~94 % progress overall, respectively) and then starts over. It started after roughly 20 hours of runtime; it has been running for almost 37 hours now, that is around 17 hours “Removing duplicate triples”.

I've lost all hope having the conversion finish, but if I can give you useful information from the still running process before I kill it (it's eating ~200 GiB RAM, I can't let it run indefinitely), please let me know.

The text was updated successfully, but these errors were encountered:

Arkanosis · 2017-11-07T14:09:02Z

I got it: rdf2hdt does not handle more than 2³² triples (because it uses unsigned ints as indexes) and Wikidata has now more than 4.6 billion triples, making the TriplesList::removeDuplicates' index overflow.

This index is quite easy to fix by using std::vector<TripleID>::size_type instead of unsigned int as index type, but I'm afraid that 32 bits indexes are used in other places where they don't cause a crash but generate wrong results :-/

akuckartz · 2017-11-07T15:28:29Z

👍 for fixing this.

sharpaper · 2017-11-07T16:03:24Z

I wonder why HDT was not designed to handle large datasets from the beginning (aka streaming content), considering that compressing large dataset is where this tool shines.
On the same note, being able to handle large datasets is the only option that will help spread HDT as a format.

This is required but probably not enough to handle big datasets such as Wikidata (4.65 B triples). This should partially address issue rdfhdt#135.

sharpaper · 2017-11-07T21:15:17Z

@wouterbeek The paper is not accessible. Yay for open data :)

sharpaper · 2017-11-07T21:20:38Z

If this is the same project... LOD-a-lot is a single HDT file with more than 28 billion unique triples, using 16GB of RAM? How did you manage to build a single HDT with 28 billion triples? I can't even imagine how much RAM the conversion process consumed!

mielvds · 2017-11-08T08:55:37Z

here's your open data: https://aic.ai.wu.ac.at/qadlod/lodalot/iswc2017/

The HDT itself takes 16 GB, the generation much more: "Note that HDT creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM. "

v4ss4llo · 2017-12-12T11:45:28Z

Is this issue supposed to be fixed by the develop-64 branch in the source?

Arkanosis · 2017-12-12T11:52:59Z

I haven't had the time to check yet, but given @wouterbeek has managed to build a hdt file for Wikidata using this branch, I assume that yes.

wouterbeek · 2017-12-12T13:09:28Z

@Arkanosis I can confirm this, develop-64 was used to create the Wikidata file.

wouterbeek · 2018-07-20T10:54:35Z

64-bit support has now been added to the develop branch.

Arkanosis changed the title ~~rdf2hdt's “Removing duplicate triples” stuck in infinite loop~~ rdf2hdt's can't handle more than 4 billions triples Nov 7, 2017

Arkanosis changed the title ~~rdf2hdt's can't handle more than 4 billions triples~~ rdf2hdt can't handle more than 4 billions triples Nov 7, 2017

Arkanosis added a commit to Arkanosis/hdt-cpp that referenced this issue Nov 7, 2017

Handle lists of more than 2^32 triples

8d84ff0

This is required but probably not enough to handle big datasets such as Wikidata (4.65 B triples). This should partially address issue rdfhdt#135.

Arkanosis mentioned this issue Nov 7, 2017

Handle lists of more than 2^32 triples #136

Closed

v4ss4llo mentioned this issue Dec 6, 2017

Update Wikidata HDT on the website #139

Closed

wouterbeek closed this as completed Jul 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rdf2hdt can't handle more than 4 billions triples #135

rdf2hdt can't handle more than 4 billions triples #135

Arkanosis commented Nov 7, 2017 •

edited

Loading

Arkanosis commented Nov 7, 2017 •

edited

Loading

akuckartz commented Nov 7, 2017

sharpaper commented Nov 7, 2017

sharpaper commented Nov 7, 2017

sharpaper commented Nov 7, 2017 •

edited

Loading

mielvds commented Nov 8, 2017 •

edited

Loading

v4ss4llo commented Dec 12, 2017

Arkanosis commented Dec 12, 2017

wouterbeek commented Dec 12, 2017

wouterbeek commented Jul 20, 2018

rdf2hdt can't handle more than 4 billions triples #135

rdf2hdt can't handle more than 4 billions triples #135

Comments

Arkanosis commented Nov 7, 2017 • edited Loading

Arkanosis commented Nov 7, 2017 • edited Loading

akuckartz commented Nov 7, 2017

sharpaper commented Nov 7, 2017

sharpaper commented Nov 7, 2017

sharpaper commented Nov 7, 2017 • edited Loading

mielvds commented Nov 8, 2017 • edited Loading

v4ss4llo commented Dec 12, 2017

Arkanosis commented Dec 12, 2017

wouterbeek commented Dec 12, 2017

wouterbeek commented Jul 20, 2018

Arkanosis commented Nov 7, 2017 •

edited

Loading

Arkanosis commented Nov 7, 2017 •

edited

Loading

sharpaper commented Nov 7, 2017 •

edited

Loading

mielvds commented Nov 8, 2017 •

edited

Loading