Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The construction algorithm does not scale well. #6

Closed
aolieman opened this issue Apr 14, 2015 · 22 comments
Closed

The construction algorithm does not scale well. #6

aolieman opened this issue Apr 14, 2015 · 22 comments

Comments

@aolieman
Copy link

It is noted in the documentation that the construction algorithm is not suitable for more than a couple of thousands of keywords. In my experience, the memory requirements go through the roof at around 10K keywords.

I am, however, quite motivated to (somehow) use Aho-Corasick search in python with > 100K keywords. Several workarounds are possible for my current use-case, but if it is possible to improve the scalability of Acora, I'd much rather spend my time on contributing to that.

So, where do we start?
It seems that tree construction is possible in linear runtime: http://link.springer.com/chapter/10.1007%2F11496656_15

@scoder
Copy link
Owner

scoder commented Apr 15, 2015

Yes, it's sad that the construction is such a serious bottleneck. The problem is that it a) unfolds the tree from a nondeterministic automaton to a deterministic one using powerset construction, meaning that it suffers from state explosion, and b) does it in Python dicts, which are not a memory efficient representation of a tree. There are certainly tree constructions available that perform substantially better and produce more efficient trees. The paper you referenced seems to provide such a way. There's a pre-print version here, BTW:
http://cs.haifa.ac.il/~landau/gadi/shiri.pdf

The tree construction in acora is done entirely in Python, though. It's in _nfa2dfa.py. If you can get a Python implementation of a better tree construction running, it should be possible to integrate it with the search engine without major changes.

@aolieman
Copy link
Author

Thanks for the explanation! Good news that the tree construction is Python-only; that gives me a much better chance at providing an alternative than if you had coded it in C.

Guess I have to do some homework about the implementation specifics in the paper I referenced. I'm pretty sure the NFA->DFA transform is the bigger problem of the two you mentioned. Using dictionaries may not be memory-efficient for this purpose, but that alone wouldn't cause the exponential memory requirement that seems to be present now.

Using the method described in the paper should be possible using a Python-only trie, but if we're going for scalability it is probably good to use e.g. marisa-trie. How would you feel about such a dependency? The most important criteria according to me are covered: pip-installable, supports unicode, and can be pickled.

By the way, the current workaround I'm using is to split my set of keywords into subsets of maximum 10K and to construct a separate tree for each. Then, I apply a very basic form of federated search to get results from each tree. It works okay, but selecting the longest matches is a bit of a hassle.

@scoder
Copy link
Owner

scoder commented Apr 17, 2015

Adding a dependency on marisa-trie might be premature optimisation. And it seems it doesn't support byte strings out of the box, so that would have to be faked with Latin-1 decoding/encoding. I'd say, let's see how far we can get without it.

Splitting the construction is only a work-around. If algorithmic improvements can be implemented, they are largely preferable.

@aolieman
Copy link
Author

Of course only an algorithmic improvement would be sensible to include in Acora. I just mentioned my workaround for searchers who stumble across this issue report in the meantime. Once I extend the workaround with satisfactory longest overlapping match selection, it could be added as a recipe in the docs though.

Given my current schedule, I'll only get around to giving the alternative tree construction algorithm a serious go after the summer.

I guess you're also right that a dependency on marisa-tree would be premature optimisation. I might as well try implementing the algorithm described in the paper using only basic Python datatypes. Then see where we go from there based on (memory) profiling.

@pombredanne
Copy link

@aolieman Any progress?

@pombredanne
Copy link

@scoder @aolieman reading the algorithm from @shirki I am under the impression that there is a big hidden constant in the linear construction.
See also http://www.cs.ucr.edu/~stelo/cpm/cpm05/cpm05_4_5_Dori.ppt

You have to:

  1. sort inputs
  2. construct the trie
  3. construct a generalized suffix tree on the reversed inputs
  4. compute the proper ancestors on that suffix tree to create the failure links.
  5. create failure links in 2. from 4. by mapping the tree to the trie

@shirki: am I missing something? Would you have an example of code by chance?

@scoder
Copy link
Owner

scoder commented Jan 3, 2016

Note that this code is currently being rewritten. I'll soon have a correct implementation available for testing.

@pombredanne
Copy link

@scoder Thanks! ... Unrelated... Do you think that an id could be added to each strings stored in the automaton?

@aolieman
Copy link
Author

aolieman commented Jan 3, 2016

@scoder that's wonderful! Looking forward to your implementation.

@pombredanne regarding the addition of IDs: what would be the use-case? Given that your keywords are unique, I see no problem mapping them to IDs external to the automaton (e.g. using a dict or database).

@pombredanne
Copy link

@aolieman This all depends on your definition of what a keyword is. This can be an arbitrary long string. They do not have to be unique either. For long strings and many of them, storing an external dict with the string as key is a terribly bad solution: at best you end up storing the strings twice, at worst you can exhaust RAM.

@pombredanne
Copy link

@aolieman ... which is why most Trie-like structures allow you to store an id (or more) in a dict-like fashion: pyahocorasick, esmre, datrie, dawg, marisa-trie and most of them use this approach

@aolieman
Copy link
Author

aolieman commented Jan 3, 2016

@pombredanne I see that you've opened #8 to discuss this issue. It's a legitimate request of course, although I wonder what you mean by non-unique keywords.

@pombredanne
Copy link

@aolieman You say keywords, but I think of arbitrary string. Say I want to index the text of a 1000 web pages, each treated as a "keyword" of possibly 1000 words and characters. I may do not know ahead of time if as a coincidence two of these strings are duplicates.

@pombredanne
Copy link

@aolieman another case where id matters would be if you store short words, bona-fide "keywords" in multiple languages, say English and Dutch in the same trie. There may be two words that are spelled out the same in both languages. Knowing if a word is defined in one or both language matters. I would not want to have to "dedupe" each language ahead of time to pick one unique keyword. There again it does not make sense to me space-wise to store the words both in the trie and in another dictionary. If anything, one of the interest of a trie is some compression.

@scoder
Copy link
Owner

scoder commented Jan 9, 2016

I uploaded the rewrite in a new branch: https://github.com/scoder/acora/tree/rewrite
It's clearly better than the old implementation, so please give it a try. I'm considering to remove some of the backwards compatibility code that now gets somewhat in the way. Can't say yet if that would improve things further, although I would hope so. It might allow for further tuning of the initial tree construction.

@pombredanne
Copy link

@scoder Thanks! Can you highlight how the new implementation works in a nutshell ?

@scoder
Copy link
Owner

scoder commented Jan 9, 2016

The main difference is that it avoids state explosion by not doing the NFA-to-DFA transformation anymore. Instead, it correctly implements Aho-Corasick and only eliminates multi-step fallbacks by merging them into the edges of the final graph. I also implemented a graphviz dot dump that helps visualising the generated tree, so you can take a look at what you get.

@scoder
Copy link
Owner

scoder commented Jan 22, 2016

Any comments on the new implementation?

@aolieman
Copy link
Author

Despite my excitement, I haven't had a chance to try the new implementation. I'll likely find the time next week.

@aolieman
Copy link
Author

aolieman commented Mar 2, 2016

@scoder the improvements are amazing! Sorry that it took me so long to find this out.

The Cython implementation now performs excellently in terms of time and space complexity. I ran some tests with a set of 200k keywords that failed to build with the previous implementation. On my laptop the tree builds in 10-20s, using ~600mb of ram.

Just for kicks I also tested the pure Python implementation. Space complexity is still an issue here, but I was positively surprised to find that the tree can be built at all. This takes about 90s and takes ~7gb ram in my case. There was, however, no discernible difference in runtime with Cython (~1s to find 1000 matches in the same number of sentences). This is not what I expected, given the docs, earlier benchmarks, and common sense. (edit: I'm not certain that the search used pure python after all)

In conclusion: thank you very much @scoder! And I think the rewrite deserves to be on PyPI.

@aolieman
Copy link
Author

This issue can be closed. @scoder would you mind doing a new release with the improvements?

@scoder
Copy link
Owner

scoder commented Mar 17, 2016

Released as acora 2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants