Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Can't build dictionary" #5

Closed
mcmahanp opened this issue Oct 10, 2012 · 9 comments
Closed

"Can't build dictionary" #5

mcmahanp opened this issue Oct 10, 2012 · 9 comments

Comments

@mcmahanp
Copy link

I'm building ByteDAWGs of increasing size, while the first several are fine, once they exceed a certain size they all fail with the "Can't build dictionary" exception (see below for traceback). Prior to version 0.5 this was causing a segmentation fault, but the new version yields a python exception.
It's happing after iteration through the data (bundleSubGenerator() generates a sequence of (unicode,bytes) tuples, and completes it), but before the function returns the dawg.

These are pretty huge data structures, somewhere over 3GB when written to disk using the dwg.save() method. There are about 55000 keys, each with between 20 and 50 characters, and each with a potentially large number of values, between 8 and 300 bytes.

Is this simply an issue of size? It looks like it's simply getting a 'false' from a function in the dawgdic library, but I can't figure out what the problem is.

Thanks for the wonderful library. I hope this issue at least has a work-around.

---------------------------------------------------------------------------
---> 38         dwg = dawg.BytesDAWG(bundleSubGenerator(bundleDBBasePath+str(y)))

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.BytesDAWG.__init__ (src/dawg.cpp:7321)()

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.CompletionDAWG.__init__ (src/dawg.cpp:5318)()

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.DAWG.__init__ (src/dawg.cpp:1739)()

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.DAWG._build_from_iterable (src/dawg.cpp:1978)()

Exception: Can't build dictionary

@kmike
Copy link
Member

kmike commented Oct 10, 2012

The generator is probably exhausted in a wrapper before the actual DAWG building (because the C++ library needs sorted data).

There are a couple of places where error reporting may be better, I'll add a couple of extra checks soon so we can debug this further.

@kmike
Copy link
Member

kmike commented Oct 10, 2012

Could you try it with the latest master?

.. and wow, 3GB is a pretty huge DAWG!

@mcmahanp
Copy link
Author

Yea, imagine how big they'd be as a dictionary!

Anyway, getting a nearly identical traceback:

---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
     36     for y in yearRange:
     37         print('starting year %d' % y)
---> 38         dwg = dawg.BytesDAWG(bundleSubGenerator(bundleDBBasePath+str(y)))
     39         print('writing year %d' % y)
     40         dwg.save(bundleDawgBasePath+str(y)+'.dawg')

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.BytesDAWG.__init__ (src/dawg.cpp:7404)()

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.CompletionDAWG.__init__ (src/dawg.cpp:5398)()

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.DAWG.__init__ (src/dawg.cpp:1760)()

/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/DAWG-0.5-py2.7-macosx-10.5-x86_64.egg/dawg.so in dawg.DAWG._build_from_iterable (src/dawg.cpp:2057)()

Error: Can't build dictionary

@kmike
Copy link
Member

kmike commented Oct 11, 2012

This mean the error happens in DictionaryBuilder.Build method of C++ library. I have no idea if it is a "32bit vs 64bit" issue or an other limitation on maximum DAWG sizes in dawgdic.

So I think it would be better to try to debug this issue at C/C++ level (see https://code.google.com/p/dawgdic/wiki/Readme ). There is a command-line utility in dawgdic library named "dawgdic-build". It accepts a file with newline-separated keys; you may try building dictionary using this utility in order to localize the issue.

The following code should write a necessary file for BytesDAWG (I didn't check it):

data_iter = bundleSubGenerator(bundleDBBasePath+str(y))
d = BytesDAWG()
with open('keys.txt', 'wt') as f:
    for k,v in sorted(data_iter):
        f.write(d._raw_key(k, v))
        f.write("\n")

If the command-line utility will fail, I think raising an issue here: https://code.google.com/p/dawgdic/issues/list may be a good idea. Otherwise the issue is in a wrapper and I'll try to debug this further (but now I can't see obvious places where may this fail in a wrapper).

By the way, what is the data? For example, if the values are not generally repeated and they don't have many common parts, it may be better to store them outside DAWG (e.g. in a plain array) because DAWG won't be able to compress them. If the values are unique then this would "explode" DAWG - the compression will be very limited.

@mcmahanp
Copy link
Author

Thanks I'll give this a try.

The keys are lists of five 'words' (usually but not always english), with many repeats. DAWGs are giving me a huge savings in memory and hopefully letting me avoid using an on-disk solution.

DAWGs are also allowing me very fast lookups of all the keys with the same first four words, which should turn a multi-week computation task into a multi-day one.

Thanks again!

@kmike
Copy link
Member

kmike commented Oct 11, 2012

Thanks for using, that's a great story! Please let me know if you figure something out.

If you can afford providing a data sample on which DAWG fails (my email is kmike84@gmail.com if the data is private) then I can also give it a try (but transferring large amounts of data may become an issue in this case).

@kmike
Copy link
Member

kmike commented Oct 11, 2012

...and a side note: you may also give https://github.com/kmike/marisa-trie a try; the wrapper has a similar API and MARISA-Trie data structure also compresses textual data quite good (this is not just an usual succint trie). The underlying C++ library for the marisa-trie is by the same author (Susumu Yata) as a dawgdic's C++ library (which is used in this wrapper), and it seems Susumu invested more time in marisa-trie.

@mcmahanp
Copy link
Author

I was getting the same error using dawgdic-build so I posted an issue over at dawgdic. It looks like it was in fact a size limit I was hitting -- though unclear exactly which.

I ported my code over to use marisa_trie (you're right, it's pretty much a drop-in replacement) and everything's working great. In fact, the tries are considerably smaller than the DAWGs, which I find surprising. I'm not deeply familiar with the two data structures, but it was my impression that DAWGS would be at most as big as their corresponding trie.

Anyway, thanks for the help and for both of these great modules.

@kmike
Copy link
Member

kmike commented Oct 14, 2012

Thanks!

In my understanding DAWG in a dawgdic library is basically a minimized Double-Array Trie; MARISA-Trie is a recursive LOUDS-Trie. "Recursive" part made MARISA-Trie look more like partially minimized DAWG (this is no longer just a trie); the LOUDS storage scheme is more memory efficient than Double-Array. So in my understanding DAWG has minimal number of states, but MARISA-Trie has more efficient state representation; for some data DAWG may take less memory, for some data MARISA-Trie may win. MARISA-Trie also natively supports memory mapped IO that make it especially good for huge data sets.

I can't see how this issue can be resolved in a Python wrapper (it seems to be an intended limitation of a dawgdic), so I'm closing this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants