ToC
- Learning Objective
You will know the different kinds of index indices and how to use them for searching.
- Difficulty
Average
- Duration
1.5 h
- Prerequisites
tutorial-sequences
,tutorial-iterators
SeqAn provides a common interface, called the Virtual String Tree Iterator (VSTreeIterator VSTree Iterator
), which lets you traverse the IndexEsa
, IndexWotd
and IndexDfi
as a suffix tree (glossary-suffix-tree
definition), the IndexQGram
as a suffix trie, and the FMIndex
as a prefix trie. In the first part of this tutorial we will concentrate on the TopDownIterator TopDown Iterator
which is one of the two index iterator specializations (besides the BottomUpIterator BottomUp Iterator
). The second part will then deal with the DFS.
For index based pattern search or algorithms traversing only the upper parts of the suffix tree the TopDownIterator TopDown Iterator
or TopDownHistoryIterator TopDown History Iterator
is the best solution. Both provide the functions TopDownIterator#goDown
and TopDownIterator#goRight
to go down to the first child node or go to the next sibling. The TopDownHistoryIterator TopDown History Iterator
additionally provides TopDownHistoryIterator#goUp
to go back to the parent node. The child nodes in IndexEsa
indices are lexicographically sorted from first to last. For IndexWotd
and IndexDfi
indices this holds for all children except the first.
In the next example we want to use the TopDownIterator TopDown Iterator
to efficiently search a text for exact matches of a pattern. We therefore want to use TopDownIterator#goDown
which has an overload to go down an edge beginning with a specific character.
Important
The following examples show how to iterate IndexEsa
, IndexWotd
or IndexDfi
, i.e. Index
specializations representing suffix trees. The result of the iteration will look different on Index
specializations representing tries, e.g. FMIndex
or IndexQGram
. Indeed, the topology of an Index
changes depending on the chosen tree or trie specialization. Note that any suffix tree edge can be labeled by more than one character, whereas any trie edge is always labeled by exactly one character.
First we create an index of the text "How much wood would a woodchuck chuck?"
demos/tutorial/index_iterators/index_search.cpp
Afterwards we create the TopDownIterator TopDown Iterator
using the metafunction Iterator, which expects two arguments, the type of the container to be iterated and a specialization tag (see the VSTree Iterator hierarchy and the tutorial-iterators
Tutorial for more details).
demos/tutorial/index_iterators/index_search.cpp
The main search can then be implemented using the functions VSTreeIterator#repLength
and VSTreeIterator#representative
. Since TopDownIterator#goDown
might cover more than one character it is necessary to compare parts of the pattern against the representative of the iterator. The search can now be implemented as follows. The algorithm descends the suffix tree along edges beginning with the corresponding pattern character. In each step the unseen
edge characters have to be verified.
demos/tutorial/index_iterators/index_search.cpp
If all pattern characters could successfully be compared we end in the topmost node who's leaves point to text positions starting with the pattern. Thus, the suffixes represented by this node are the occurrences of our pattern and can be retrieved with VSTreeIterator#getOccurrences
.
demos/tutorial/index_iterators/index_search.cpp
Program output:
demos/tutorial/index_iterators/index_search.cpp.stdout
Alternatively, we could have used TopDownIterator#goDown
to go down the path of a pattern instead single characters:
demos/tutorial/index_iterators/index_search2.cpp
demos/tutorial/index_iterators/index_search2.cpp.stdout
- Type
Review
- Objective
Copy the code into a demo program and replace the text with a string set containing the strings
"How much"
,"wood would"
and" a woodchuck chuck?"
.- Solution
demos/tutorial/index_iterators/iterator_solution1.cpp
- The difference is the format of the positions of the found occurrences.
Here, we need a
Pair
to indicate the string within theStringSet
and a position within the string.
- Type
Review
- Objective
Write a program that traverses the nodes of the suffix tree of
"tobeornottobe"
in the order shown here:- :align: center
- width
300px
At each node print the text of the edges from the root to the node. You may only use the functions
TopDownIterator#goDown
,TopDownIterator#goRight
,TopDownHistoryIterator#goUp
andVSTreeIterator#isRoot
to navigate andVSTreeIterator#representative
which returns the string that represents the node the iterator points to.- Hint
- Use a
TopDownHistoryIterator TopDown History Iterator
. The code skeleton could look like this:
demos/tutorial/index_iterators/iterator_assignment2.cpp
- Use a
- Solution
One iteration step of a preorder DFS can be described as follows:
- if possible, go down one node
if not:
- if possible, go to the next sibling
if not:
- go up until it is possible to go to a next sibling
- stop the whole iteration after reaching the root node
Thus, the DFS walk can be implemented in the following way:
demos/tutorial/index_iterators/iterator_solution2.cpp
- Type
Review
- Objective
Modify the program to efficiently skip nodes with representatives longer than 3. Move the whole program into a template function whose argument specifies the index type and call this function twice, once for the
IndexEsa
and once for theIndexWotd
index.- Solution
- We modify the DFS traversal to skip the descent if we walk into a node whose representative is longer than 3.
We then proceed to the right and up as long as the representative is longer than 3.
demos/tutorial/index_iterators/index_assignment4.cpp
demos/tutorial/index_iterators/index_assignment4.cpp.stdout
The tree traversal in assignment 2 is equal to a the tree traversal in a full depth-first search (dfs) over all suffix tree nodes beginning either in the root (preorder dfs) or in a leaf node (postorder dfs). A preorder traversal (figure-stree-preorder
) halts in a node when visiting it for the first time whereas a postorder traversal (figure-stree-postorder
) halts when visiting a node for the last time. The following two figures give an example in which order the tree nodes are visited.
Since these traversals are frequently needed SeqAn provides special iterators which will we describe next.
We want to construct the suffix tree of the string "abracadabra" and output the substrings represented by tree nodes in preorder dfs. In order to do so, we create the string "abracadabra" and an index specialized with the type of this string.
demos/tutorial/index_iterators/index_preorder.cpp
The StringTreeConcept#Iterator
metafunction expects two arguments, the type of the container to be iterated and a specialization tag, as described earlier. In this example we chose a TopDownHistoryIterator TopDown History Iterator
whose signature in the second template argument is TopDown< ParentLinks<Preorder> >
.
demos/tutorial/index_iterators/index_preorder.cpp
As all DFS suffix tree iterators implement the VSTreeIterator VSTree Iterator
, they can be used via VSTreeIterator#goNext
, VSTreeIterator#atEnd
, etc.
demos/tutorial/index_iterators/index_preorder.cpp
Program output:
demos/tutorial/index_iterators/index_preorder.cpp.stdout
Tip
There are currently 2 iterators in SeqAn supporting a DFS search:
Iterator | Preorder | Postorder |
---|---|---|
BottomUpIterator |
no | yes |
TopDownHistoryIterator |
yes | yes |
If solely a postorder traversal is needed the BottomUpIterator BottomUp Iterator
should be preferred as it is more memory efficient. Please note that the BottomUp Iterator is only applicable to IndexEsa
indices.
Tip
A relaxed suffix tree (see glossary-suffix-tree
) is a suffix tree after removing the $ characters and empty edges. For some bottom-up algorithms it would be better not to remove empty edges and to have a one-to-one relationship between leaves and suffices. In that cases you can use the tags PreorderEmptyEdges or PostorderEmptyEdges instead of Preorder or Postorder or EmptyEdges for the TopDown Iterator.
Note that the VSTreeIterator#goNext
is very handy as it simplifies the tree traversal in assignment 2 greatly.
- Type
Review
- Objective
Write a program that constructs an index of the
StringSet
"tobeornottobe", "thebeeonthecomb", "beingjohnmalkovich" and outputs the strings corresponding to suffix tree nodes in postorder DFS.- Solution
- First we have to create a
StringSet
ofCharString
(shortcut forString<char>
) and append the 3 strings to it. This could also be done by using
StringConcept#resize
and then assigning the members withoperator[]
. The first template argument of the index class has to be adapted and is now a StringSet.demos/tutorial/index_iterators/index_assignment1.cpp
To switch to postorder DFS we have two change the specialization tag of
ParentLinks
fromPreorder
toPostorder
. Please note that theTopDownHistoryIterator
always starts in the root node, which is the last postorder DFS node. Therefore, the iterator has to be set explicitly to the first DFS node viaVSTreeIterator#goBegin
.demos/tutorial/index_iterators/index_assignment1.cpp
Alternatively to a
TopDownHistoryIterator
you also could have used aBottomUpIterator
with the same result. The BottomUp Iterator automatically starts in the first DFS node as it supports no random access.demos/tutorial/index_iterators/index_assignment1.cpp
Program output:
demos/tutorial/index_iterators/index_assignment1.cpp.stdout
As a last assignment lets try out one of the specialised iterators, which you can find at the bottom of this page. Look there for the specialisation which iterates over all maximal unique matches (MUMS).
- Type
Review
- Objective
Write a program that outputs all maximal unique matches (MUMs) between
"CDFGHC"
and"CDEFGAHC"
.- Solution
Again, we start to create a
StringSet
ofCharString
and append the 2 strings.demos/tutorial/index_iterators/index_assignment2.cpp
After that we simply use the predefined iterator for searching MUMs, the
MumsIterator
. Its constructor expects the index and optionally a minimum MUM length as a second parameter. The set of all MUMs can be represented by a subset of suffix tree nodes. The iterator will halt in every node that is a MUM of the minimum length. The corresponding match is the node'sVSTreeIterator#representative
.demos/tutorial/index_iterators/index_assignment2.cpp
Program output:
demos/tutorial/index_iterators/index_assignment2.cpp.stdout
In the previous subsection we have seen how to walk through a suffix tree. We now want to know what can be done with a suffix tree iterator. As all iterators are specializations of the general VSTree Iterator class, they inherit all of its functions. There are various functions to access the node the iterator points at (some we have already seen), so we concentrate on the most important ones.
VSTreeIterator#representative
returns the substring that represents the current node, i.e. the concatenation of substrings on the path from the root to the current node
VSTreeIterator#getOccurrence
returns a position where the representative occurs in the text
VSTreeIterator#getOccurrences
returns a string of all positions where the representative occurs in the text
VSTreeIterator#isRightTerminal
tests if the representative is a suffix in the text (corresponds to the shaded nodes in the
glossary-suffix-tree
figures)VSTreeIterator#isLeaf
tests if the current node is a tree leaf
TopDownIterator#parentEdgeLabel
returns the substring that represents the edge from the current node to its parent (only TopDownHistory Iterator)
Important
There is a difference between the functions isLeaf and isRightTerminal. In a relaxed suffix tree (see glossary-suffix-tree
) a leaf is always a suffix, but not vice versa, as there can be internal nodes a suffix ends in. For them isLeaf returns false and isRightTerminal returns true.
Some algorithms require to store auxiliary information (e.g. weights, scores) to the nodes of a suffix tree. To attain this goal SeqAn provides so-called property maps, simple Strings of a property type. Before storing a property value, these strings must first be resized with StringTreeConcept#resizeVertexMap
. The property value can then be assigned or retrieved via VSTreeIterator#assignProperty
, VSTreeIterator#getProperty
, or VSTreeIterator#property
. It is recommended to call StringTreeConcept#resizeVertexMap
prior to every call of VSTreeIterator#assignProperty
to ensure that the property map has sufficient size. The following example iterates over all nodes in preorder dfs and recursively assigns the node depth to each node. First we create a String
of int
to store the node depth for each suffix tree node.
demos/tutorial/index_iterators/index_property_maps.cpp
The main loop iterates over all nodes in preorder DFS, i.e. parents are visited prior children. The node depth for the root node is 0 and for all other nodes it is the parent node depth increased by 1. The functions VSTreeIterator#assignProperty
, VSTreeIterator#getProperty
and VSTreeIterator#property
must be called with a StringTreeConcept#VertexDescriptor
. The vertex descriptor of the iterator node is returned by VSTreeIterator#value
and the descriptor of the parent node is returned by TopDownIterator#nodeUp
.
demos/tutorial/index_iterators/index_property_maps.cpp
At the end we again iterate over all nodes and output the calculated node depth.
demos/tutorial/index_iterators/index_property_maps.cpp
Program output:
demos/tutorial/index_iterators/index_property_maps.cpp.stdout
Tip
In SeqAn there is already a function TopDownHistoryIterator#nodeDepth
defined to return the node depth.
By now, we know the following iterators (n = text size, σ = alphabet size, d = tree depth):
Iterator specialization | Description | Space | Index tables |
---|---|---|---|
BottomUpIterator |
postorder dfs | 𝒪(d) | SA, LCP |
TopDownIterator |
can go down and go right | 𝒪(1) | SA, Lcp, Childtab |
TopDownHistoryIterator |
can also go up, preorder/postorder dfs | 𝒪(d) | SA, Lcp, Childtab |
Besides the iterators described above, there are some application-specific iterators in SeqAn:
Iterator specialization | Description | Space | Index tables |
---|---|---|---|
MaxRepeatsIterator |
maximal repeats | 𝒪(n) | SA, Lcp, Bwt |
SuperMaxRepeatsIterator |
supermaximal repeats | 𝒪(d + σ) | SA, Lcp, Childtab, Bwt |
SuperMaxRepeatsFastIterator |
supermaximal repeats (optimized for ESA) | 𝒪(σ) | SA, Lcp, Bwt |
MumsIterator |
maximal unique matches | 𝒪(d) | SA, Lcp, Bwt |
MultiMemsIterator |
multiple maximal exact matches (w.i.p.) | 𝒪(n) | SA, Lcp, Bwt |
Given a string s a repeat is a substring r that occurs at 2 different positions i and j in s. The repeat can also be identified by the triple (i,j,|r|). A maximal repeat is a repeat that cannot be extended to the left or to the right, i.e. s[i-1]≠s[j-1] and s[i+|r|]≠s[j+|r|]. A supermaximal repeat r is a maximal repeat that is not part of another repeat. Given a set of strings s1, ..., sm a MultiMEM (multiple maximal exact match) is a substring r that occurs in each sequence si at least once and cannot be extended to the left or to the right. A MUM (maximal unique match) is a MultiMEM that occurs exactly once in each sequence. The following examples demonstrate the usage of these iterators:
DemoMaximalUniqueMatches Demo Maximal Unique Matches
DemoSupermaximalRepeats Demo Supermaximal Repeats
DemoMaximalRepeats Demo Maximal Repeats