wordscanner.py

Scan through characters (grids, array, whatever) seeking words (English or otherwise).

This is a proof-of-concept to demonstrate the utility of using a Python set() as an alternative to Trie or DAWG structures for finding word segments (based on a given dictionary, English or otherwise).

Overview

Given a character grid (or any character stream) where you can scan through individual characters, find all matching English words (according to an English dictionary, say, with roughly 100,000 entries). This is a standard exercise in natural language processing (with many examples and utils in NLTK).

This is typically done with a Trie, i.e., a prefix tree that can be implemented in Python using a nested dict:

{'a': 
    {'p': 
        {'p': 
            {'l': 
                {'e': 
                    {'$': '$'}
                }
            }
        }, 
    'r': 
        {'e': 
            {'$': '$'}
        }
    }
}

The advantage with a Trie is that it can store a large dictionary of words in a way that is useful for pruning through potential candidate matches (e.g., if I'm scanning letters and find an "a", then it matches the above Trie, but if the next character is not a "p" or "r", then I can stop searching on that path). Basically, a Trie (or DAWG) allows for efficient pruning and matching of words.

Additionally, there are numerous "Super-fast, efficient" Trie implementations. I used datrie for this example.

Hypothesis

Rather than use a DAWG or Trie (even a "Super-fast, efficient" Trie), it is preferable to use a Python set() with the following algorithm:

almost_words = set([])
for word in dictionary:
    for i in range(len(word)-1)
        almost_words.add(word[0:i+1])

In other words, a flattened structure that contains all of the enumerated keys within a Trie or DAWG. The above example of ["apple", "are"] would look like this,

{'a', 'ap', 'app', 'appl', 'ar'}

For Natural Language Processing tasks, this is nearly as memory efficient as a Trie or DAWG, and with equal to or better performance. I created test scripts to demonstrate this, using the included 1000x1000 character grid, as well as a generated 10,000 x 10,000 character grid.

Trie vs set() for N=1,000,000

Scan through a 1000x1000 character grid. Interestingly, datrie performed the slowest, most likely due to the initial construction of the Trie itself. It was noticeably slower than the nested dict, but it was the most memory efficient. For larger grids, the slower construction of the Trie should not matter (as it's constant, based on the size of the dictionary).

$ /usr/bin/time ./test_w_datrie.py
3155
3.50user 0.03system 0:03.54elapsed 99%CPU (0avgtext+0avgdata 38028maxresident)k
0inputs+0outputs (0major+8880minor)pagefaults 0swaps

$ /usr/bin/time ./test_w_trie.py
3155
2.00user 0.05system 0:02.05elapsed 99%CPU (0avgtext+0avgdata 81704maxresident)k
0inputs+0outputs (0major+20976minor)pagefaults 0swaps

$ /usr/bin/time ./test_w_set.py
3155
1.45user 0.01system 0:01.47elapsed 99%CPU (0avgtext+0avgdata 41860maxresident)k
0inputs+0outputs (0major+9539minor)pagefaults 0swaps

Trie vs set() for N=100,000,000

Scan through a 10,000 x 10,000 character grid. As expected, datrie outperformed the naive nested dict Trie, but failed to overtake the much simpler python set() implementation.

$ /usr/bin/time ./test_w_datrie.py
9787
127.80user 0.16system 2:07.98elapsed 99%CPU (0avgtext+0avgdata 134288maxresident)k
0inputs+0outputs (0major+58289minor)pagefaults 0swaps

$ /usr/bin/time ./test_w_trie.py
9787
161.65user 0.12system 2:41.80elapsed 99%CPU (0avgtext+0avgdata 179600maxresident)k
0inputs+0outputs (0major+54897minor)pagefaults 0swaps

$ /usr/bin/time ./test_w_set.py
9787
120.05user 0.25system 2:00.42elapsed 99%CPU (0avgtext+0avgdata 139864maxresident)k
0inputs+0outputs (0major+76902minor)pagefaults 0swaps

Trie vs set() for Japanese (N=1,000,000)

Interestingly, for Japanese characters (scanning through a random grid of mostly Kanji), the performance difference was more pronounced. This is useful to know because Japanese (like Chinese) does not use obvious word boundaries and would benefit from using set() rather than Trie for Japanese language parsers. For 日本.txt, I extracted all kanji and kana from EDICT.

$ /usr/bin/time ./test_j_set.py
4140
2.44user 0.07system 0:02.52elapsed 99%CPU (0avgtext+0avgdata 115208maxresident)k
0inputs+0outputs (0major+31642minor)pagefaults 0swaps

$ /usr/bin/time ./test_j_trie.py
4140
3.89user 0.14system 0:04.04elapsed 99%CPU (0avgtext+0avgdata 286012maxresident)k
0inputs+0outputs (0major+71679minor)pagefaults 0swaps

Trie vs set() for Japanese (N=100,000,000)

The same performance benefit of set() holds up even for extremely large input grids. The 10,000 x 10,000 character grid was not added in git due to space constraints, but can easily be rendered with the built-in gen_grid_random() function in wordscanner.py

$ /usr/bin/time ./test_j_set.py
50220
177.84user 0.42system 2:58.54elapsed 99%CPU (0avgtext+0avgdata 507412maxresident)k
0inputs+0outputs (0major+145801minor)pagefaults 0swaps

$ /usr/bin/time ./test_j_trie.py
50220
250.44user 0.56system 4:11.86elapsed 99%CPU (0avgtext+0avgdata 680960maxresident)k
0inputs+0outputs (0major+184571minor)pagefaults 0swaps

Conclusion

For natural language processing, a naive set() of prefixes is faster than Trie, and is as memory efficient as a DAWG or libdatrie.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
english.txt		english.txt
grid1k.txt		grid1k.txt
jgrid1k.txt		jgrid1k.txt
test_j_set.py		test_j_set.py
test_j_trie.py		test_j_trie.py
test_w_datrie.py		test_w_datrie.py
test_w_set.py		test_w_set.py
test_w_trie.py		test_w_trie.py
wordscanner.py		wordscanner.py
日本.txt		日本.txt

License

timwarnock/wordscanner.py

Folders and files

Latest commit

History

Repository files navigation

wordscanner.py

Overview

Hypothesis

Trie vs set() for N=1,000,000

Trie vs set() for N=100,000,000

Trie vs set() for Japanese (N=1,000,000)

Trie vs set() for Japanese (N=100,000,000)

Conclusion

About

Resources

License

Stars

Watchers

Forks

Languages