Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdf2hdt can't handle more than 4 billions triples #135

Closed
Arkanosis opened this issue Nov 7, 2017 · 10 comments
Closed

rdf2hdt can't handle more than 4 billions triples #135

Arkanosis opened this issue Nov 7, 2017 · 10 comments

Comments

@Arkanosis
Copy link

Arkanosis commented Nov 7, 2017

Hello,

I'm trying to get an hdt file from the Wikidata turtle dump of the 1st of November 2017 using the current HEAD revision of rdf2hdt (7d92f3f).

Everything looked okay until the “Removing duplicate triples” step which goes from 0 % progress to ~95 % progress (~91 % to ~94 % progress overall, respectively) and then starts over. It started after roughly 20 hours of runtime; it has been running for almost 37 hours now, that is around 17 hours “Removing duplicate triples”.

I've lost all hope having the conversion finish, but if I can give you useful information from the still running process before I kill it (it's eating ~200 GiB RAM, I can't let it run indefinitely), please let me know.

@Arkanosis
Copy link
Author

Arkanosis commented Nov 7, 2017

I got it: rdf2hdt does not handle more than 232 triples (because it uses unsigned ints as indexes) and Wikidata has now more than 4.6 billion triples, making the TriplesList::removeDuplicates' index overflow.

This index is quite easy to fix by using std::vector<TripleID>::size_type instead of unsigned int as index type, but I'm afraid that 32 bits indexes are used in other places where they don't cause a crash but generate wrong results :-/

@Arkanosis Arkanosis changed the title rdf2hdt's “Removing duplicate triples” stuck in infinite loop rdf2hdt's can't handle more than 4 billions triples Nov 7, 2017
@Arkanosis Arkanosis changed the title rdf2hdt's can't handle more than 4 billions triples rdf2hdt can't handle more than 4 billions triples Nov 7, 2017
@akuckartz
Copy link

👍 for fixing this.

@sharpaper
Copy link

I wonder why HDT was not designed to handle large datasets from the beginning (aka streaming content), considering that compressing large dataset is where this tool shines.
On the same note, being able to handle large datasets is the only option that will help spread HDT as a format.

Arkanosis added a commit to Arkanosis/hdt-cpp that referenced this issue Nov 7, 2017
This is required but probably not enough to handle big datasets such
as Wikidata (4.65 B triples).

This should partially address issue rdfhdt#135.
@sharpaper
Copy link

@wouterbeek The paper is not accessible. Yay for open data :)

@sharpaper
Copy link

sharpaper commented Nov 7, 2017

If this is the same project... LOD-a-lot is a single HDT file with more than 28 billion unique triples, using 16GB of RAM? How did you manage to build a single HDT with 28 billion triples? I can't even imagine how much RAM the conversion process consumed!

@mielvds
Copy link
Member

mielvds commented Nov 8, 2017

here's your open data: https://aic.ai.wu.ac.at/qadlod/lodalot/iswc2017/

The HDT itself takes 16 GB, the generation much more: "Note that HDT creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM. "

@v4ss4llo
Copy link

Is this issue supposed to be fixed by the develop-64 branch in the source?

@Arkanosis
Copy link
Author

I haven't had the time to check yet, but given @wouterbeek has managed to build a hdt file for Wikidata using this branch, I assume that yes.

@wouterbeek
Copy link
Contributor

@Arkanosis I can confirm this, develop-64 was used to create the Wikidata file.

@wouterbeek
Copy link
Contributor

64-bit support has now been added to the develop branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants