"Can't build dictionary" #5

mcmahanp · 2012-10-10T20:07:03Z

I'm building ByteDAWGs of increasing size, while the first several are fine, once they exceed a certain size they all fail with the "Can't build dictionary" exception (see below for traceback). Prior to version 0.5 this was causing a segmentation fault, but the new version yields a python exception.
It's happing after iteration through the data (bundleSubGenerator() generates a sequence of (unicode,bytes) tuples, and completes it), but before the function returns the dawg.

These are pretty huge data structures, somewhere over 3GB when written to disk using the dwg.save() method. There are about 55000 keys, each with between 20 and 50 characters, and each with a potentially large number of values, between 8 and 300 bytes.

Is this simply an issue of size? It looks like it's simply getting a 'false' from a function in the dawgdic library, but I can't figure out what the problem is.

Thanks for the wonderful library. I hope this issue at least has a work-around.

---------------------------------------------------------------------------
---> 38         dwg = dawg.BytesDAWG(bundleSubGenerator(bundleDBBasePath+str(y)))

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.BytesDAWG.__init__ (src/dawg.cpp:7321)()

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.CompletionDAWG.__init__ (src/dawg.cpp:5318)()

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.DAWG.__init__ (src/dawg.cpp:1739)()

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.DAWG._build_from_iterable (src/dawg.cpp:1978)()

Exception: Can't build dictionary

The text was updated successfully, but these errors were encountered:

kmike · 2012-10-10T21:40:35Z

The generator is probably exhausted in a wrapper before the actual DAWG building (because the C++ library needs sorted data).

There are a couple of places where error reporting may be better, I'll add a couple of extra checks soon so we can debug this further.

kmike · 2012-10-10T21:57:55Z

Could you try it with the latest master?

.. and wow, 3GB is a pretty huge DAWG!

mcmahanp · 2012-10-10T22:38:35Z

Yea, imagine how big they'd be as a dictionary!

Anyway, getting a nearly identical traceback:

---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
     36     for y in yearRange:
     37         print('starting year %d' % y)
---> 38         dwg = dawg.BytesDAWG(bundleSubGenerator(bundleDBBasePath+str(y)))
     39         print('writing year %d' % y)
     40         dwg.save(bundleDawgBasePath+str(y)+'.dawg')

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.BytesDAWG.__init__ (src/dawg.cpp:7404)()

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.CompletionDAWG.__init__ (src/dawg.cpp:5398)()

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.DAWG.__init__ (src/dawg.cpp:1760)()

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.DAWG._build_from_iterable (src/dawg.cpp:2057)()

Error: Can't build dictionary

kmike · 2012-10-11T08:54:39Z

This mean the error happens in DictionaryBuilder.Build method of C++ library. I have no idea if it is a "32bit vs 64bit" issue or an other limitation on maximum DAWG sizes in dawgdic.

So I think it would be better to try to debug this issue at C/C++ level (see https://code.google.com/p/dawgdic/wiki/Readme ). There is a command-line utility in dawgdic library named "dawgdic-build". It accepts a file with newline-separated keys; you may try building dictionary using this utility in order to localize the issue.

The following code should write a necessary file for BytesDAWG (I didn't check it):

data_iter = bundleSubGenerator(bundleDBBasePath+str(y))
d = BytesDAWG()
with open('keys.txt', 'wt') as f:
    for k,v in sorted(data_iter):
        f.write(d._raw_key(k, v))
        f.write("\n")

If the command-line utility will fail, I think raising an issue here: https://code.google.com/p/dawgdic/issues/list may be a good idea. Otherwise the issue is in a wrapper and I'll try to debug this further (but now I can't see obvious places where may this fail in a wrapper).

By the way, what is the data? For example, if the values are not generally repeated and they don't have many common parts, it may be better to store them outside DAWG (e.g. in a plain array) because DAWG won't be able to compress them. If the values are unique then this would "explode" DAWG - the compression will be very limited.

mcmahanp · 2012-10-11T14:02:38Z

Thanks I'll give this a try.

The keys are lists of five 'words' (usually but not always english), with many repeats. DAWGs are giving me a huge savings in memory and hopefully letting me avoid using an on-disk solution.

DAWGs are also allowing me very fast lookups of all the keys with the same first four words, which should turn a multi-week computation task into a multi-day one.

Thanks again!

kmike · 2012-10-11T14:33:25Z

Thanks for using, that's a great story! Please let me know if you figure something out.

If you can afford providing a data sample on which DAWG fails (my email is kmike84@gmail.com if the data is private) then I can also give it a try (but transferring large amounts of data may become an issue in this case).

kmike · 2012-10-11T19:38:13Z

...and a side note: you may also give https://github.com/kmike/marisa-trie a try; the wrapper has a similar API and MARISA-Trie data structure also compresses textual data quite good (this is not just an usual succint trie). The underlying C++ library for the marisa-trie is by the same author (Susumu Yata) as a dawgdic's C++ library (which is used in this wrapper), and it seems Susumu invested more time in marisa-trie.

mcmahanp · 2012-10-14T14:29:56Z

I was getting the same error using dawgdic-build so I posted an issue over at dawgdic. It looks like it was in fact a size limit I was hitting -- though unclear exactly which.

I ported my code over to use marisa_trie (you're right, it's pretty much a drop-in replacement) and everything's working great. In fact, the tries are considerably smaller than the DAWGs, which I find surprising. I'm not deeply familiar with the two data structures, but it was my impression that DAWGS would be at most as big as their corresponding trie.

Anyway, thanks for the help and for both of these great modules.

kmike · 2012-10-14T20:48:20Z

Thanks!

In my understanding DAWG in a dawgdic library is basically a minimized Double-Array Trie; MARISA-Trie is a recursive LOUDS-Trie. "Recursive" part made MARISA-Trie look more like partially minimized DAWG (this is no longer just a trie); the LOUDS storage scheme is more memory efficient than Double-Array. So in my understanding DAWG has minimal number of states, but MARISA-Trie has more efficient state representation; for some data DAWG may take less memory, for some data MARISA-Trie may win. MARISA-Trie also natively supports memory mapped IO that make it especially good for huge data sets.

I can't see how this issue can be resolved in a Python wrapper (it seems to be an intended limitation of a dawgdic), so I'm closing this ticket.

kmike closed this as completed Oct 14, 2012

kmike mentioned this issue Apr 30, 2013

RecordDAWG only stores a very small subset of data when reading from generator function #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Can't build dictionary" #5

"Can't build dictionary" #5

mcmahanp commented Oct 10, 2012

kmike commented Oct 10, 2012

kmike commented Oct 10, 2012

mcmahanp commented Oct 10, 2012

kmike commented Oct 11, 2012

mcmahanp commented Oct 11, 2012

kmike commented Oct 11, 2012

kmike commented Oct 11, 2012

mcmahanp commented Oct 14, 2012

kmike commented Oct 14, 2012

"Can't build dictionary" #5

"Can't build dictionary" #5

Comments

mcmahanp commented Oct 10, 2012

kmike commented Oct 10, 2012

kmike commented Oct 10, 2012

mcmahanp commented Oct 10, 2012

kmike commented Oct 11, 2012

mcmahanp commented Oct 11, 2012

kmike commented Oct 11, 2012

kmike commented Oct 11, 2012

mcmahanp commented Oct 14, 2012

kmike commented Oct 14, 2012