# Minimum Skew Problem

We're going to solve the Minimum Skew Problem in chpater 1.7 of the book. I Won't repeat the biology here, but general idea is that there will be less Gs on the reverse strand compared and less Cs on the forward strand. This means that as you go over the reverse strand you'll see more Cs than Gs and as you go over the forward strand you'll see more Gs than Cs. 

First let's try to calculate the differnece between Gs and Cs in a given genome:

In [2]:
genome = 'CCTATCGGTGGATTAGCATGTCCCTGTACGTTTCGCCGCGAACTAGTTCACACGGCTTGATGGCAAATGGTTTTTCCGGCGACCGTAATCGTCCACCGAG'

In [3]:
diff = 0
skew = []
for i, c in enumerate(genome):
    if c == 'G':
        diff += 1
    if c == 'C':
        diff -= 1
    skew.append(diff)

We stored the difference at each point in the genome in the array `skew`. Now let's find the minimum:

In [4]:
m = min(skew)

Now that we know the minimum, let's find all the positions in the genome where the skew has this value:

In [5]:
answers = []
for i, d in enumerate(skew):
    if d == m:
        answers.append(i + 1)
print(answers)

[53, 97]


What is the complexity of the algorithm in terms of the length of the genome?

Assuming the lenght of the genome is $n$, we checked each position once to update the skew array and then we sweeped over the skew array which itself has length $n$ to find the minimum points, so $O(2n)$ operations in total, linear in size of the genome.

# Hamming Distance

The Hamming distance between two strings is the number of characters that are different between the two. Strings have to be of the same length for the Hamming distance to make sense. The Hamming distance is an old concept and has application in error correction and in communication channels. Let's try to implement it.

In [7]:
p = 'GGGCCGTTGGT'
q = 'GGACCGTTGAC'

In [8]:
def hamming_distance(p, q):
    l = len(p)
    d = 0
    for i in range(l):
        if p[i] != q[i]:
            d += 1
    return d

In [9]:
hamming_distance(p, q)

3

It's indeed very straighforward to implement.

# Approximate Matching

Equipped witht the concept of Hamming Distance, we can do approximate matching between strings, e.g we call two string to approximately match if there are less than $d$ mismatches between them. We choose $d$ based on how much error seems reasonable for our settings and applications.

In [10]:
def approximate_match(p, q, d):
    if hamming_distance(p, q) <= d:
        return True
    return False

Now let's try to find all substring of a large string (text) that approximately match with a shorter one (pattern).

In [11]:
text = 'CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAATGCCTAGCGGCTTGTGGTTTCTCCTACGCTCC'
pattern = 'ATTCTGGA'

In [14]:
def find_approximate_matches(text, pattern, d):
    l = len(pattern)
    for i in range(0, len(text) - l + 1):
        if approximate_match(pattern, text[i: i + l], d):
            print(i)

In [15]:
find_approximate_matches(text, pattern, 3)

6
7
26
27
78


What is the complexity of this algorithm?

With the text being of length $n$ and patterns being of length $k$, there are $n - k$ strings to compare the pattern to. Each comparison takes $O(k)$ so total complexity is $O((n - k)\times k)$. For a fixed $k$, this will be of $O(n)$.

# Frequent Words Problem

Now assume we're given a text and we want to find the most frequent sequence of length $k$ in it. Simplest appraoch is to keep track of the number of times all possible sequences of length $k$ are seen in the pattern and choose the most common one.

In [16]:
def frequent_words(text, k):
    frequency = {}
    for i in range(0, len(text) - k + 1):
        kmer = text[i: i + k]
        if not kmer in frequency:
            frequency[kmer] = 0
        frequency[kmer] += 1
    return max(frequncy, key = frequency.get)

# Frequent Words with Mismatch and Reverse Complement Problem

Finally, we try to find the most common kmer with up to one mismatch allowed. We assume a kmer and its reverse complement to be the same.


Let's find the reverse complement first, no the most elegant solution:

In [61]:
def reverse_complement(seq):
    rc = ''
    for c in seq[-1::-1]:
        if c == 'G':
            rc += 'C'
        if c == 'C':
            rc += 'G'
        if c == 'A':
            rc += 'T'
        if c == 'T':
            rc += 'A'
    return rc

In [62]:
reverse_complement('ATCG')

'CGAT'

Now let's calculate the neighborhood for $d = 1$. We just need to generate all sequences that have one base changed compared to the input:

In [36]:
def generate_neighborhood(seq):
    neighborhood = {}
    bases = ['A', 'C', 'G', 'T']
    for i in range(len(seq)):
        for d in bases:
            if seq[i] != d:
                tmp = seq[:i] + d + seq[i + 1:]
                neighborhood[tmp] = True
    neighborhood[seq] = True
    return neighborhood

In [46]:
generate_neighborhood('GATG')

{'AATG': True,
 'CATG': True,
 'TATG': True,
 'GCTG': True,
 'GGTG': True,
 'GTTG': True,
 'GAAG': True,
 'GACG': True,
 'GAGG': True,
 'GATA': True,
 'GATC': True,
 'GATT': True,
 'GATG': True}

Now we can find the most frequent kmers with mismatches considered. For each kmer in the text, just increment all its neighbords and choose the most frequent strings in

In [63]:
def frequent_words(text, k, d):
    frequency = {}
    neighborhoods = {}
    for i in range(0, len(text) - k + 1):
        kmer = text[i: i + k]
        if not kmer in neighborhoods:
            neighborhoods[kmer] = generate_neighborhood(kmer)
            #neighborhoods[kmer].update(generate_neighborhood(reverse_complement(kmer)))
        for seq in neighborhoods[kmer]:
            if not seq in frequency:
                frequency[seq] = 0
            frequency[seq] += 1
    m_key = max(frequency, key = frequency.get)
    m_value = frequency[m_key]
    for key in frequency:
        if frequency[key] == m_value:
            print(key)

In [64]:
frequent_words('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4, 1)

ATGT
ACAT
ATGC
GCAT
TCAT
ATGA


What is the complecity here?

We still have $O(n)$ kmers in text. For $d = 1$, each kmer will have $3\times k + 1$ neighbors, or $O(k)$ in short. All of these have to be checked for each kmer of text and each check take $O(k)$ time, so total number of checks is $O(n\times k\times k)$.

How do we calculate the number of neighbors for arbitrary $d$?

$\quad$ Have to sum over neighbors for all values of $d$ from zero to $d$.

For a given $d$ there will be ${d \choose k} \times (4^d - 1)$ neighbors.

Why $4 ^ d - 1$?

$\quad$ Because one of the possible permutations for each selection of $d$ positions to change is the original sequence itself.