An optimized tree of integer tokens with counts and extra data in each node
This package provides an optimized tree structure containing integer tokens and extra data associated with them. This is intended for supporting certain machine learning applications involving strings and grammatical learning.
For now, install from the github repository.
This package includes tests written using the
They can be run using, for example,
nose, which can be installed
pip3 install nose.
To run the tests, just use the following command in the top-level directory:
Adding data to a tree
To create a tree and add sequences, use something like the following:
from tokentree import TokenTree t = TokenTree() t.add((1, 2, 3)) t.add((1, 3, 8, 10))
The tree is designed to accomodate millions or even hundreds of millions of values. Insertion time is linear in the length of the sequence. The sequences can consist of arbitrary integers, positive or negative. Tokens must fit in a 64-bit integer.
It is possible to store extra data with each node. Functions for
merging extra data with each node are provided as arguments the
constructor (see below). The extra data is provided as the
keyword argument to
It is also possible to specify a count or weight for each addition.
This value will be added to the count of each node in the path. The
value is stored as a floating point number and defaults to 1. The
optional count is specified with the
count keyword argument to
The following illustrates adding with extra data and weight:
t.add(seq, count=3.5, extra="noun")
Creating a tree
The tree is created using the
TokenTree() constructor. This
constructor accepts the following keyword arguments:
merge_extra: a function to merge the
extraargument in a newly added node to the information already in a tree node. This should return the new data. Default is
None. The function is called as
merge_extra(seq, i, old_data, extra, count). For the first call,
old_datawill be None.
seqis the sequence being added;
iis the position at the sequence (0 means at tree root,
len(seq)means inserting the last node).
Enumerating sequences from the tree
The tree works as an iterator. This makes it very easy to enumerate through all paths in the tree. The enumerator will also enumerate through all prefixes of sequences that were added to the tree. It is guaranteed that the prefix of a sequence will always be enumerated before any longer sequences with the same prefix. In effect, the iteration is in depth-first order. Within each prefix, the sequences of the next length are enumerated in ascending order of their token values.
The following illustrates enumerating the sequences. For each
iteration, the iterator returns
(seq, node), where
seq is a
tuple of integers, and
node is the node at that sequence.
import tokentree t = TokenTree() t.add((1, 2)) ... for seq, node in t: print(node.get_count(), seq)
Looking up a specific sequence
find function looks up a specific sequence from the tree. It is
possible to look up added sequences and their prefixes. It returns
if the given sequence is not in the tree.
import tokentree t = TokenTree() t.add((1, 2)) ... node = t.find(seq) if node is None: print("Not found") print(node.get_count(), node.get_token(), node.get_extra())
get_token_count(token) method on the tree returns the number
of times that the given token has been added to the tree (in any
position in any sequence), or when
count is used when adding, the
sum of all counts for all occurrences of the token.
This function merely retrieves precomputed data, so it is quite fast.
Tree-level information functions
get_root() method returns the root node of the tree. The root
node always exists and corresponds to the empty sequence.
get_count() method returns the sum of all weights added to the
tree. it is the same as the value of
get_count() on the root
Node-level information and iteration functions
A node is mostly opaque for applications. However, the following functions can be used to query information about a node.
have_children() method on a node returns
True if the node
has children, and
get_token() method on a node returns the token of the node. The token
is always an integer value (but can be negative).
get_count() method on a node returns the count value of the
node. The count is the sum of the
count values of all sequences
that have been added that have the sequence leading to the node as
their prefix (including the count value for the sequence of the node).
Iterating over children of a node
It is possible to iterate over the children of the node by simply using the node as an iterator. The following illustrates this:
import tokentree t = TokenTree() t.add([1, 3, 4]) ... node = t.find([1, 3]) for child in node: print(child.get_count(), child.get_token())
The official repository of this project is on github.
Please email to ylo at clausal.com if you wish to contribute or have patches or suggestions.
Credit and linking to the project's website and/or citing any future papers on the project would be highly appreciated.