<p>Pre-requisite : Trie</p>
<p>Suffix tree is a compressed trie of all the suffixes of a given string. Suffix trees help in solving a lot of string related problems like pattern matching, finding distinct substrings in a given string, finding longest palindrome etc. In this tutorial following points will be covered:</p>
<ul>
<li>Compressed Trie</li>
<li>Suffix Tree Construction (Brute Force)</li>
<li>Brief description of Ukkonen's Algorithm</li>
</ul>
<p><br />Before going to suffix tree, let's first try to understand what a compressed trie  is. <br />
Consider the following set of strings:  {$ "banana", "nabd", "bcdef", "bcfeg", "aaaaaa", "aaabaa"$ }<br /> 
A standard trie for the above set of strings will look like:<br />
<img alt="enter image description here" src="https://he-s3.s3.amazonaws.com/media/uploads/5a87821.png" /><br />
And a compressed trie for the given set of strings will look like:<br />
<img alt="enter image description here" src="https://he-s3.s3.amazonaws.com/media/uploads/639747e.png" /><br />
As it might be clear from the images show above, in a compressed trie, edges that direct to a node having single child are combined together to form a single edge and  their edge labels are concatenated. So this means that each internal node in a compressed trie has atleast two children. Also it has atmost $N$ leaves, where $N$ is the number of strings inserted in the compressed trie. Now both the facts: Each internal node having atleast two children, and that there are $N$ leaves, implies that there are atmost $2N-1$ nodes in the trie. So the space complexity of a compressed trie is $O(N)$ as compared to the $O(N^2)$  of a normal trie. <br />
So that is one reason why to use compressed tries over normal tries. </p>
<p>Before going to construction of suffix trees, there is one more thing that should be understood, Implicit Suffix Tree. In 
Implicit suffix trees, there are atmost $N$ leaves, while in normal one there should be exactly $N$ leaves. The reason for atmost $N$ leaves is one suffix being prefix of another suffix. Following example will make it clear.
Consider the string $"banana"$<br />
Implicit Suffix Tree for the above string is shown in image below:<br />
<img alt="enter image description here" src="https://he-s3.s3.amazonaws.com/media/uploads/71cea69.png" /><br />
To avoid getting an Implicit Suffix Tree we append a special character that is not equal to any other character of the string. Suppose we append $ to the given string then, so the new string is $"banana$"$. Now its suffix tree will be<br />
<img alt="enter image description here" src="https://he-s3.s3.amazonaws.com/media/uploads/a55f8db.png" /><br />
Now let's go to the construction of the suffix trees.<br />
Suffix tree as mentioned previously is a compressed trie of all the suffixes of a given string, so the brute force approach will be to consider all the suffixes of the given string as separate strings and insert them in the trie one by one. But time complexity of the brute force approach is $O(N^2)$, and that is of no use for large values of $N$. </p>
<p>The pseudo code for the brute force approach is given below:</p>

In [1]:
class SuffixTree(object):
    
    class Node(object):
        def __init__(self, lab):
            self.lab = lab # label on path leading to this node
            self.out = {}  # outgoing edges; maps characters to nodes
    
    def __init__(self, s):
        """ Make suffix tree, without suffix links, from s in quadratic time
            and linear space """
        s += '$'
        self.root = self.Node(None)
        self.root.out[s[0]] = self.Node(s) # trie for just longest suf
        # add the rest of the suffixes, from longest to shortest
        for i in range(1, len(s)):
            # start at root; we’ll walk down as far as we can go
            cur = self.root
            j = i
            while j < len(s):
                if s[j] in cur.out:
                    child = cur.out[s[j]]
                    lab = child.lab
                    # Walk along edge until we exhaust edge label or
                    # until we mismatch
                    k = j+1 
                    while k-j < len(lab) and s[k] == lab[k-j]:
                        k += 1
                    if k-j == len(lab):
                        cur = child # we exhausted the edge
                        j = k
                    else:
                        # we fell off in middle of edge
                        cExist, cNew = lab[k-j], s[k]
                        # create “mid”: new node bisecting edge
                        mid = self.Node(lab[:k-j])
                        mid.out[cNew] = self.Node(s[k:])
                        # original child becomes mid’s child
                        mid.out[cExist] = child
                        # original child’s label is curtailed
                        child.lab = lab[k-j:]
                        # mid becomes new child of original parent
                        cur.out[s[j]] = mid
                else:
                    # Fell off tree at a node: make new edge hanging off it
                    cur.out[s[j]] = self.Node(s[j:])
    
    def followPath(self, s):
        """ Follow path given by s.  If we fall off tree, return None.  If we
            finish mid-edge, return (node, offset) where 'node' is child and
            'offset' is label offset.  If we finish on a node, return (node,
            None). """
        cur = self.root
        i = 0
        while i < len(s):
            c = s[i]
            if c not in cur.out:
                return (None, None) # fell off at a node
            child = cur.out[s[i]]
            lab = child.lab
            j = i+1
            while j-i < len(lab) and j < len(s) and s[j] == lab[j-i]:
                j += 1
            if j-i == len(lab):
                cur = child # exhausted edge
                i = j
            elif j == len(s):
                return (child, j-i) # exhausted query string in middle of edge
            else:
                return (None, None) # fell off in the middle of the edge
        return (cur, None) # exhausted query string at internal node
    
    def hasSubstring(self, s):
        """ Return true iff s appears as a substring """
        node, off = self.followPath(s)
        return node is not None
    
    def hasSuffix(self, s):
        """ Return true iff s is a suffix """
        node, off = self.followPath(s)
        if node is None:
            return False # fell off the tree
        if off is None:
            # finished on top of a node
            return '$' in node.out
        else:
            # finished at offset 'off' within an edge leading to 'node'
            return node.lab[off] == '$'

In [2]:
stree = SuffixTree('there would have been a time for such a word')

In [3]:
stree.hasSubstring('nope')

False

In [4]:
stree.hasSubstring('such a word')

True

In [5]:
stree.hasSuffix('would have been')

False

<p>So as mentioned previously the above code will not be correct choice for large values of $N.$ Ukkonen's Algorithm comes to the rescue here.
<br />Ukkonen's Algorithm constructs the suffix tree in a worst case time complexity of $O(N)$.</p>
<p>Ukkonen's Algorithm divides the process of constructing suffix tree into phases and each phase is further divided into extensions. In $i^{th}$ phase $i^{th}$ character is introduced in the trie. In $i^{th}$ phase, all the suffixes of the string $S[1..i]$ are inserted into the trie, and inserting $j^{th}$ suffix in a phase is called $j^{th}$ extension of that phase. So, in $i^{th}$ phase there are $i$ extensions and overall there $N$ such phases, where $N$ is the length of given string.  Right now it must look like a $O(N^2)$ task, but the algorithm exploits the fact that these are suffixes of same string and introduces several tricks that bring down the time complexity to $O(N)$. <br /></p>
<p>Let's see how to perform the $j^{th}$ extension of $i^{th}$ phase. In $j^{th}$ extension of $i^{th}$ phase string $S[j...i]$ is to be inserted. Before going for phase $i$, $i-1$ phases are already complete, that means we have a trie having all suffixes of string $S[1...i-1]$. So search for the path of string $S[j..i-1]$ in the trie. Now there are 3 possibilities and each of those correspond to one rule, that has to be followed.  </p>
<ol>
<li>The complete string is found and  path ends at a leaf node. In that case the $i^{th}$ character is appended to last edge label and no new node is created.</li>
<li>The complete string is found and path ends in between an edge label, and the next character of edge label is not equal to $S[i]$ or the path ends at an internal node. In this case new nodes are created. If the path ends in an internal node then a new leaf node is created, and if the path ends in between an edge label then a new internal node and one new leaf node is created.  </li>
<li>Complete suffix $S[j..i-1]$ is found and the path ended in between an edge label and the next character of that edge label is equal to $i^{th}$ character. In that case do nothing.</li>
</ol>
<p>So given above are the 3 extension rules used to perform extensions in a phase.
<br />Note that still we are doing $N$ phases and in $i^{th}$ phase we are performing $i$ extensions.
<br />For every extension we need to  find the path of a string $S[j...i]$ in the trie built in previous phases. If we go with brute force approach time complexity will be $O(N^3)$, for that we use suffix links, which are explained below:<br /></p>
<p><strong>Suffix Links:</strong> Suppose a string  $X$ is present in the trie, and its path from root ends at a node $v$, and string $aX$ is also present in the trie where $a$ is any character, and its path from root ends at a node $w$, then a link from $w$ to $v$ is called a Suffix Link.<br /></p>
<p>Now how does a suffix link help? When we have to perform $j^{th}$ extension of phase $i$, we have to look for end of path of string $S[j...i-1]$, and in the next phase look for end of path of string $S[j+1...i-1]$, but before coming to phase $i$, we have performed $i-1$ phases, that means we have inserted strings $S[j..i-1]$ and $S[j+1...i-1]$ in the trie. Now clearly $S[j...i-1]$ is nothing but $S[j]S[j+1...i-1]$, so we will have a suffix link from node ending at path $S[j..i-1]$ to node ending at path $S[j+1...i-1]$, so instead of traversing down from root for $(j+1)^{th}$ extension of $i^{th}$ phase, we can make use of the suffix link.<br />
Use of suffix link makes processing of a phase an $O(number  of  nodes)$ process, and number of nodes in a compressed trie are of $O(N)$. So right now each phase is done in $O(N)$ and there are $N$ such phases, so overall complexity right now is $O(N^2)$.<br /></p>
<p>Before going further there is one more problem, and that is the edge labels. If the edge labels are stored as strings space complexity will turn out to be $O(N^2)$, no matter the what the number of nodes are. So for that instead of using strings as edge label, use two variables for each label and those will denote the start index and end index of the label in the string. That way each label will take constant space and overall space complexity will be $O(N)$.</p>
<p>There are several more tricks that help in bringing down the complexity to linear. <br /></p>
<p>In any phase, the extension rules are applied in the order. In first few extensions, rule 1 is applied, in the next few extensions rule 2 is applied and in the rest rule 3 is applied.<br /></p>
<p>If in $i^{th}$ phase rule 3 is applied in extension $j$ for the first time, then in all the extensions after that i.e. in extensions $j+1$ to $i$, rule 3 will be applied, so its ok to halt a phase as soon as rule 3 starts applying.<br /></p>
<p>Once a leaf node is created it will always remain a leaf node, only edge label of the edge between itself and its parent, will keep on increasing because of application of rule 1, and also for all the leaf node the end index (discussed earlier) will remain same, so in any phase rule 1 can also be applied in a constant time by maintain a global end index for all the leaf nodes.<br /></p>
<p>New leaf nodes are created when rule 2 is applied, and in all the extensions in which rule 2 is applied in any phase $i-1$, in the next phase $i$, rule 1 will be applied in all those extensions.<br /></p>
<p>So a maximum of $N$ times rule 2 will be applied as there are $N$ leaves, so this means all the phases can be completed in complexity $O(N)$. </p>