-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Can't build dictionary" #5
Comments
The generator is probably exhausted in a wrapper before the actual DAWG building (because the C++ library needs sorted data). There are a couple of places where error reporting may be better, I'll add a couple of extra checks soon so we can debug this further. |
Could you try it with the latest master? .. and wow, 3GB is a pretty huge DAWG! |
Yea, imagine how big they'd be as a dictionary! Anyway, getting a nearly identical traceback:
|
This mean the error happens in DictionaryBuilder.Build method of C++ library. I have no idea if it is a "32bit vs 64bit" issue or an other limitation on maximum DAWG sizes in dawgdic. So I think it would be better to try to debug this issue at C/C++ level (see https://code.google.com/p/dawgdic/wiki/Readme ). There is a command-line utility in dawgdic library named "dawgdic-build". It accepts a file with newline-separated keys; you may try building dictionary using this utility in order to localize the issue. The following code should write a necessary file for BytesDAWG (I didn't check it):
If the command-line utility will fail, I think raising an issue here: https://code.google.com/p/dawgdic/issues/list may be a good idea. Otherwise the issue is in a wrapper and I'll try to debug this further (but now I can't see obvious places where may this fail in a wrapper). By the way, what is the data? For example, if the values are not generally repeated and they don't have many common parts, it may be better to store them outside DAWG (e.g. in a plain array) because DAWG won't be able to compress them. If the values are unique then this would "explode" DAWG - the compression will be very limited. |
Thanks I'll give this a try. The keys are lists of five 'words' (usually but not always english), with many repeats. DAWGs are giving me a huge savings in memory and hopefully letting me avoid using an on-disk solution. DAWGs are also allowing me very fast lookups of all the keys with the same first four words, which should turn a multi-week computation task into a multi-day one. Thanks again! |
Thanks for using, that's a great story! Please let me know if you figure something out. If you can afford providing a data sample on which DAWG fails (my email is kmike84@gmail.com if the data is private) then I can also give it a try (but transferring large amounts of data may become an issue in this case). |
...and a side note: you may also give https://github.com/kmike/marisa-trie a try; the wrapper has a similar API and MARISA-Trie data structure also compresses textual data quite good (this is not just an usual succint trie). The underlying C++ library for the marisa-trie is by the same author (Susumu Yata) as a dawgdic's C++ library (which is used in this wrapper), and it seems Susumu invested more time in marisa-trie. |
I was getting the same error using I ported my code over to use Anyway, thanks for the help and for both of these great modules. |
Thanks! In my understanding DAWG in a dawgdic library is basically a minimized Double-Array Trie; MARISA-Trie is a recursive LOUDS-Trie. "Recursive" part made MARISA-Trie look more like partially minimized DAWG (this is no longer just a trie); the LOUDS storage scheme is more memory efficient than Double-Array. So in my understanding DAWG has minimal number of states, but MARISA-Trie has more efficient state representation; for some data DAWG may take less memory, for some data MARISA-Trie may win. MARISA-Trie also natively supports memory mapped IO that make it especially good for huge data sets. I can't see how this issue can be resolved in a Python wrapper (it seems to be an intended limitation of a dawgdic), so I'm closing this ticket. |
I'm building ByteDAWGs of increasing size, while the first several are fine, once they exceed a certain size they all fail with the "Can't build dictionary" exception (see below for traceback). Prior to version 0.5 this was causing a segmentation fault, but the new version yields a python exception.
It's happing after iteration through the data (
bundleSubGenerator()
generates a sequence of (unicode,bytes) tuples, and completes it), but before the function returns the dawg.These are pretty huge data structures, somewhere over 3GB when written to disk using the
dwg.save()
method. There are about 55000 keys, each with between 20 and 50 characters, and each with a potentially large number of values, between 8 and 300 bytes.Is this simply an issue of size? It looks like it's simply getting a 'false' from a function in the
dawgdic
library, but I can't figure out what the problem is.Thanks for the wonderful library. I hope this issue at least has a work-around.
The text was updated successfully, but these errors were encountered: