Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Ability to have byte string keys #3

Open
dan-blanchard opened this issue Sep 25, 2012 · 2 comments
Open

Feature Request: Ability to have byte string keys #3

dan-blanchard opened this issue Sep 25, 2012 · 2 comments

Comments

@dan-blanchard
Copy link
Contributor

I'm trying to store a large number of 4-grams in memory to speed up computation as part of a larger program I'm working on, and I think I found a reasonable way to do it using your DAWG library, but I ran into an issue because you currently don't allow byte string keys (I'm not sure if this is a limitation of your wrapper or the underlying library).

The idea I had was to have one DAWG that maps from words to unique numbers, and then store the 4-gram counts in another DAWG by converting the 4-grams to struct-packed byte strings of the format 'LLLL' where each packed item in the byte string is one of the unique numbers from the word DAWG. However, I can't currently do this because the keys are required to be unicode.

Is this something that would be feasible to add in the future?

@kmike
Copy link
Member

kmike commented Sep 25, 2012

Underlying dawgdic C++ library doesn't allow zero bytes in keys (because it uses null-terminated strings) - that's why the wrapper disallows arbitrary bytestring keys. It stores utf8-encoded unicode and utf8 is guaranteed not to have zero bytes. Struct-packed 'LLLL' numbers will have zero bytes for sure.

But the wrapper supports binary values (in BytesDAWG and RecordDAWG). This is implemented using a hack: binary data is encoded/decoded to/from base64.

I don't know the best solution for N-Gram storage.

You may try https://github.com/kmike/marisa-trie - its base storage is not as efficient and fast as DAWG but it supports perfect hashing (each key is assigned an unique number "for free" - storing unique numbers in DAWG may be costly) and it also supports binary keys (including keys containing zero bytes).

Another option is to just store N-grams in DAWG unpacked (e.g. whitespace-separated); I'd not be surprised if this will be quite efficient.

@kmike
Copy link
Member

kmike commented Sep 25, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants