# JC2BIM 2021: TP Burrows-Wheeler Transform
>Téo Lemane, Claire Lemaitre

In [None]:
from typing import List, Dict, Tuple, Callable
from utils import (
    print_table,    # Pretty-printing. Usage: print_table(text, suffix_array, bwt, rank)
    naive_matching, # Naive pattern matching, for comparison. Usage: naive_matching(pattern, text)
    timeit          # Function timer. Usage: timeit(func, *args)
)

## 1. Construction

### 1.1 Suffix Array

**Q1)** Write a function `suffix_array(seq, term)` which returns the suffix array of `seq`. 

`term` is the character symbol indicating the end of the sequence, do not forget to add it at the end of your sequence, this is crucial for the BWT. 

(Hints: [sorted](https://docs.python.org/3/howto/sorting.html#sorting-basics), [key](https://docs.python.org/3/howto/sorting.html#key-functions), [lambda](https://docs.python.org/3.10/reference/expressions.html?#lambda))

In [None]:
sa_t = List[int] # suffix array is list of integers

def suffix_array(seq: str, term: str="$") -> sa_t:
    pass

In [None]:
from data import (
    seq_test, # test sequence
    sa_test   # expected suffix array
)

print_table(seq_test+"$", sa_test, None, None)

### 1.2 BWT

Reminder :

$BWT_S[i] = \left\{ 
    \begin{array}{rll}
         S[SA[i]-1] & \mbox{if}
         & SA[i]>0 \\ '\$'  & \mbox{if} & SA[i]=0
    \end{array}\right.$

**Q2)** Write a function `bwt(seq, sa)` which returns the bwt of `seq`.

In [None]:
bwt_t = List[str] # bwt is a list of characters

def bwt(seq: str, sa: sa_t) -> bwt_t:
    pass

In [None]:
from data import bwt_test # expected bwt of seq_test
print_table(seq_test+"$", sa_test, bwt_test, None)

### 1.3 FM-index : BWT + 2 tables (rank and occ) and LF mapping function

**Q3)** Write a function `rank_occ(bwt, lexi)` which returns the rank array and the occurence table.

**Example:**  
BWT = $T_0T_1G_0\$_0G_1C_0C_1A_0A_1A_2A_3A_4C_2A_5C_3$  
rank = [0, 1, 0, 0, 1, 0, 1, 0, 1, 2, 3, 4, 2, 5, 3]  
occ = {'$': 1, 'A': 6, 'C': 4, 'G': 2, 'T': 2}

In [None]:
rank_t = List[int]     # rank is a list of integers 
occ_t = Dict[str, int] # occ is a dictionary

def rank_occ(bwt: bwt_t, lexi: str="$ACGT") -> Tuple[rank_t, occ_t]:
    pass

In [None]:
from data import (
    rank_test, # expected rank of bwt_test
    occ_test   # expected occ of bwt_test
)
print_table(seq_test+"$", sa_test, bwt_test, rank_test)

**Q4)** Write a function `LF(c, i, occ, lexi)` which returns the position of $c_i$ in $F$. $c_i$ is the i-th character of type $c$ in $L$ (BWT).

In [None]:
def LF(c: str, i: int, occ: occ_t, lexi: str="$ACGT") -> int:
    pass

**Q5)** Using the LF-mapping property, write a function `reverse(bwt, rank, occ)` which returns the original sequence from its BWT.

In [None]:
def reverse(bwt: bwt_t, rank: rank_t, occ: occ_t) -> str:
    pass

## 2. Pattern matching with the FM-index

### 2.1 Useful functions
**Q6.1)** Write a function `find_first(c, i, j, bwt)` which returns the position of the first `c` in $BWT[i,j]$.  
**Q6.2)** Write a function `find_last(c, j, i, bwt)` which returns the position of the last `c` in $BWT[i,j]$.

In [None]:
def find_first(c: str, i: int, j: int, bwt: bwt_t) -> int:
    pass

def find_last(c: str, j: int, i: int, bwt: bwt_t) -> int:
    pass

### 2.2 Pattern matching function
**Q7)** Write a function `backward_search(pattern, bwt, rank, occ)` which returns `True` if `pattern` is found, `False` otherwise.

In [None]:
def backward_search(pattern: str, bwt: bwt_t, rank: rank_t, occ: occ_t) -> bool:
    pass

## 3. Application to real data

TODO : describe the data

**Q8)** Apply your indexing data structure and the pattern matching function to query the XX kmers in `seq`.

In [None]:
from data import seq, seq_queries

**Q9)** Perform the same queries using the function `naive_matching(pattern, text)`, and compare the results and running times.

If you need to check your FM-index...

In [None]:
from data import sa_seq, bwt_seq, rank_seq, occ_seq