Sequence sets and the Burrows-Wheeler Transform
=====

The <a href="https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform">Burrows-Wheeler Transform</a> (**BWT**) is traditionally used to improve the compressibility of large texts. For bioinformatics applications, it allows for very rapid searches of arbitrarily long strings for substring $W$ in $O(\ |W|\ )$ time. The BWT is generally applicable to strings of symbols that have a well-defined unique position for each symbol, such as genomes and chromosomes.

Every possible substring of the original string of symbols can be represented by a range of BWT entries.

[figure]

Spiral's sequence set format extends the BWT to support multiple strings with ambiguous position. The memory efficiency of BWT representation is leveraged, making the sequence set extremely compact. This makes it suitable for searching for substrings and overlaps in large numbers of similar strings (such as sequencing reads).

Every possible substring of every original string, as well as sequences that overlap two or more original strings, can be represented by a range of sequence set entries.

[figure]

Conventions
-----

Terminology and definitions used here are similar to <a href="http://bioinformatics.oxfordjournals.org/content/25/14/1754">doi:10.1093/bioinformatics/btp324</a>, _Fast and accurate short read alignment with Burrows–Wheeler transform_ by Heng Li and Richard Durbin, with some important differences.

We use common programming conventions relating to sets:
 * The first element of a sequence is always the 0th element
 * Ranges are represented as <a href="http://mathworld.wolfram.com/Half-ClosedInterval.html">half-closed intervals</a>, and use the notation [start, end)
 
For example this range of integers:

> 2, 3, 4, 5, 6

...is represented by the interval [2,7). The value at index 2 is 4. There are five items in the list, equivalent to _end - start = 7 - 2 = 5_.

Definitions
-----

For an alphabet of symbols $\Sigma = [ A, C, G, T ] $

$\alpha = $ one symbol of alphabet $\Sigma$

${X}$ = source string $\alpha_{0}\alpha_{1}\ldots\alpha_{n-1}$

$\$ =$ end of string

${n}$ = number of symbols in ${X}$

${i} = 0,1,\ldots,{n-1}$

${X}[i] = \alpha_i$

${X}[i,j+1) = \alpha_i\ldots\alpha_j$ (substring)

${X_i} = X[i,n)$ (suffix)

$X[n] = \$$ (string terminator)

$S(i) = i$ th lexicographically smallest suffix (aka index into X where suffix starts)

$B$ is the BWT string: list of symbols that precede the first symbol in the sorted suffix list

>$B[i] = \$$ when $S(i) = 0$

>$B[i] = X[S(i) - 1]$

$W =$ a potential substring of $X$

Bounds:

>$\underline{R}(W) = min\{k:W $ is the prefix of $X_{S(k)}\}$

>$\overline{R}(W) = max\{k:W $ is the prefix of $X_{S(k)}\}$

For empty string $W = \$$:

>$\underline{R}(W) = 0$

>$\overline{R}(W) = n - 1$

_Note that Li and Durbin define $\underline{R}(W) = 1$ for empty string W, but this requires off-by-one fix-ups later._

Set of positions of all occurrences of $W$ in $X$:

>$\{S(k):\underline{R}(W) <= k <= \overline{R}(W)\}$

Is W a substring of X?

> If $\underline{R}(W) > \overline{R}(W)$ then $W$ is not a substring of $X$.

> If $\underline{R}(W) = \overline{R}(W)$ then $W$ matches exactly one BWT entry.

> If $\underline{R}(W) < \overline{R}(W)$ then $W$ matches all BWT entries between (inclusive).

$SA$ interval $= \big[\ \underline{R}(W), \overline{R}(W)\ \big] =$ the range of BWT entries that match $W$ (inclusive). 

When using half-closed intervals, _end - start = number of matches_:

$SA$ interval $= \big[\ \underline{R}(W), \overline{R}(W) + 1\ \big)$ (exclusive)

Backward search in $O(\ |W|\ )$ time:

>$C(\alpha) =$ the number of symbols in $X[0,n-1)$ that are lexicographically smaller than $\alpha$

>$O(\alpha,i) =$ # of occurrences of $\alpha$ in $B[0,i]$ (inclusive)

If $W$ is a substring of $X$:

>$\underline{R}(\alpha{W}) = C(\alpha) + O(\alpha,\underline{R}(W)-1)+1$

>$\overline{R}(\alpha{W}) = C(\alpha) + O(\alpha, \overline{R}(W))$

An example BWT implementation
-----

The following code demonstrates a simple (but inefficient) method for implementing and querying the BWT. It is intended to illustrate the general BWT approach. Real implementations use vastly more efficient methods for BWT production and search.

In [52]:
# For string X
X = "ATTGCTAC$"

# Sort all suffixes
suffixes = sorted([X[i:] for i in range(len(X))])

print "# suffix"
for i, suffix in enumerate(suffixes):
    print "{i} {suffix}".format(i=i, suffix=suffix)

# suffix
0 $
1 AC$
2 ATTGCTAC$
3 C$
4 CTAC$
5 GCTAC$
6 TAC$
7 TGCTAC$
8 TTGCTAC$


In [45]:
# Calculate S
S = []
for suffix in suffixes:
    S.append(X.find(suffix))
    
print S

[8, 6, 0, 7, 4, 3, 5, 2, 1]


In [46]:
# C(a) =  # of symbols in X[0,n−1) that are lexicographically smaller than a.

# Precalculate the C(a) table. This lets us look up C(a) without knowing B.
Ca = {}

# all unique symbols in X except for $
symbols = ''.join(sorted(list(set(X)))[1:])

for symbol in symbols:
    smaller = [x for x in X[:-1] if x < symbol]
    print symbol + ': ' + str(smaller)
    Ca[symbol] = len(smaller)

print '\n', Ca

A: []
C: ['A', 'A']
G: ['A', 'C', 'A', 'C']
T: ['A', 'G', 'C', 'A', 'C']

{'A': 0, 'C': 2, 'T': 5, 'G': 4}


In [47]:
# B: X[S(i)-1]
def B(i):
    return X[S[i]-1]

# n == |X| == |B| == |S|
n = len(X)

# String representation of B
B_str = ''.join([B(i) for i in range(n)])

print B_str
print n

CT$AGTCTA
9


In [48]:
# O(a,i): number of occurrences of a in B up to index i (inclusive)
def O(a, i):
    count = 0
    for base in B_str[:i+1]:
        if base == a:
            count += 1
    return count

# r('$') == 0
r_cache = {'': 0}

# r underbar: lower limit of substring W in BWT
def r(w):
    # Precache all substrings.
    for aW in [w[i:] for i in range(len(w))][::-1]:
        if(not aW in r_cache):
            a = aW[0]
            W = aW[1:]
            r_cache[aW] = Ca[a] + O(a, r(W) - 1) + 1
    return r_cache[w]

# by definition, R('$') == n - 1
R_cache = {'': n - 1}

# R overbar: upper limit of substring W in BWT
def R(w):
    for aW in [w[i:] for i in range(len(w))][::-1]:
        if(not aW in R_cache):
            a = aW[0]
            W = aW[1:]
            R_cache[aW] = Ca[a] + O(a, R(W))
    return R_cache[w]

# SA value: compute [i,j) for W
def calc_SA(w):
    return [r(w), R(w)]

# Print calc_SA[i,j] in [inclusive, exclusive) format
def SA(W):
    i, j = calc_SA(W)
    return "'{}': [{}, {})".format(W, i, j + 1)

In [49]:
# Let's find SA values for some substrings
print "# suffix"
for i, suffix in enumerate(suffixes):
    print "{i} {suffix}".format(i=i, suffix=suffix)

print "\nB = " + B_str + "\n"

for symbol in symbols:
    print symbol + ':', SA(symbol)

queries = [
    'GCT', # exactly one match
    'GA',  # not in X
    'T',   # more than one match
    '',    # empty string, full range
]

print
for q in queries:
    print "SA('" + q + "') = " + str(SA(q))

# suffix
0 $
1 AC$
2 ATTGCTAC$
3 C$
4 CTAC$
5 GCTAC$
6 TAC$
7 TGCTAC$
8 TTGCTAC$

B = CT$AGTCTA

A: 'A': [1, 3)
C: 'C': [3, 5)
G: 'G': [5, 6)
T: 'T': [6, 9)

SA('GCT') = 'GCT': [5, 6)
SA('GA') = 'GA': [5, 5)
SA('T') = 'T': [6, 9)
SA('') = '': [0, 9)


In [40]:
# Turn BWT entry numbers back into indexes in X by looking them up in S
def find_it(query):
    begin, end = calc_SA(query)

    print '\n{} is found at:'.format(query)
    for i in range(begin, end + 1):
        print ' {} ({})'.format(S[i], X[S[i]:S[i]+len(query)])
        
print X

find_it('A')
find_it('TG')
find_it('TAC')

ATTGCTAC$

A is found at:
 6 (A)
 0 (A)

TG is found at:
 2 (TG)

TAC is found at:
 5 (TAC)


The Biograph sequence set format
-----