# Bioinformatics Algorithms, Week 3

This week, we are switching gears from looking for the *DnaA* boxes in bacterial genomes, to looking for **regulatory motifs** responsible for maintaining circadian rhythms. Regulatory motifs are short nucleotide sequences, typically upstream of genes, which are bound by transcription factors that may activate or repress transcription. 

(A promoter a kind of a regulatory motif in DNA; activators and repressors are proteins. A *DnaA* box is not a promoter, but it is a regulatory motif found near the promoter. DNA polymerase is not an activator). 

### Identifying the evening element
In [Harmer et al. (2000)](https://doi.org/10.1126/science.290.5499.2110), Steve Kay's team used a DNA microarray to identify genes expressed at different times of the day (transcriptomes). They then extracted the upstream regions (L = 1000 bps) of 500 genes that exhibited circadian behavior, and looked for Frequent Words. They found the sequence AAAATATCT 46 times. The expected number of occurrences of a 9-mer in 500 random DNA strings, each of length 1000 bps is 1.892, since:

- The probability of randomly generating any 9-mer is $0.25^9$.
- In a 1000 bp sequence, the 9-mer can occupy $992$ (i.e. $1000-9+1$) positions.
- There are $500$ such sequences.
- Therefore, $0.25^9 \times 992 \times 500 = 1.89208984375$ is the expected number.

So it turns out that the sequence AAAATATCT was **about 24 times more frequent** than expected. Harmer et al. (2000) named this promoter sequence the **evening element** and verified that it does, indeed, govern circadian behavior. The idea is that a single activator, or a set of activators that recognize the same promoter AAAATATCT, will activate the transcription of all 500 genes associated with circadian behavior at the same time in a plant, so that all necessary proteins are available for the plant to express circadian behavior.  

Perhaps because maintaining a correct circadian rhythm is a matter of life or death for plants, the evening element is remarkably well-conserved in plants. However, regulatory motifs in other genes and organisms may not be so conserved. Fruit fly **immunity genes** have very similar, but not identical, 12-mer promoters, where a transcription factor called **NF-kB** binds and activates transcription. 

```
1   T C G G G G g T T T t t
2   c C G G t G A c T T a C
3   a C G G G G A T T T t C
4   T t G G G G A c T T t t
5   a a G G G G A c T T C C
6   T t G G G G A c T T C C
7   T C G G G G A T T c a t
8   T C G G G G A T T c C t
9   T a G G G G A a c T a C
10  T C G G G t A T a a C C
```

The uppercase letters indicate the most common nucleotide in each column.

When we have a case like this, where there are numerous (>2) differences between strings and our *k*-mers get quite long, the **Frequent Words** solution that we spent the last two weeks developing will no longer help us because the algorithm will be too slow. Moreover, unlike *DnaA* boxes, which occur in clumps near the *ori*, regulatory motifs tend to be scattered throughout the genome, which means we might be dealing with windows that are tens or hundreds of thousands of base pairs wide. We need another approach.

### Motif Enumeration
We can first try a brute-force, **exhaustive search** which considers every possible candidate and checks whether each candidate solves the problem. While such a search is guaranteed to return a correct solution, its time complexity makes it an NP problem, and it is not guaranteed to return an answer in a reasonable amount of time. The function below does not entirely follow the pseudocode given in the course, because I could not make sense of the pseudocode.

**Implanted Motif Problem**: *Find all (k, d)-motifs in a collection of strings.*
- **Input**: A collection of strings *Dna*, and integers *k* and *d*.
- **Output**: All (*k*, *d*)-motifs, which are *k*-mers appearing in every string in *Dna* with at most *d* mismatches.


In [1]:
def hamming_distance(s1, s2):
    diff = [i for i in range(len(s1)) if s1[i] != s2[i]]
    return len(diff)

def neighbors(pattern, d):
    if d == 0:
        return {pattern}
    if len(pattern) == 1:
        return {"A", "C", "G", "T"}
    neighborhood = set()
    suffix_neighbors = neighbors(pattern[1:], d)
    for text in suffix_neighbors:
        if hamming_distance(pattern[1:], text) < d:
            nts = ["A", "C", "G", "T"]
            for nt in nts:
                neighborhood.add(nt+text)
        else:
            neighborhood.add(pattern[0]+text)
    return neighborhood
    
def kmer_in_all_dna(kmer, kmers_dict):
    for array in kmers_dict.values():
        if kmer not in array:
            return False
    return True
    
def motif_enumeration(dna, k, d):
    patterns = set()
    kmers_dict = {}
    all_kmers = set()

    for pattern in dna:
        kmers = {pattern[i:i+k] for i in range(len(pattern)-k+1)}
        kmers_dict[pattern] = set()
        for kmer in kmers:
            kmer_neighbors = neighbors(kmer, d)
            for i in kmer_neighbors:
                kmers_dict[pattern].add(i) # d-neighbors of the k-mer
                all_kmers.add(i)
            kmers_dict[pattern].add(kmer) # the k-mer itself
            all_kmers.add(kmer)
                        
    for kmer in all_kmers:
        if kmer_in_all_dna(kmer, kmers_dict):
            patterns.add(kmer)
    
    return patterns

In [2]:
dna = ["ATTTGGC", "TGCCTTA", "CGGTATC", "GAAAATT"]
k = 3 
d = 1
motif_enumeration(dna, k, d)

{'ATA', 'ATT', 'GTT', 'TTT'}

### Motif Scoring
The **Implanted Motif** problem has some limitations. If even a single string in *Dna* does not contain the regulatory motif, a (*k*, *d*)-motif does not exist at all. 

A different way to think about the problem might be to score individual instances of the motif depending on how similar they are to an "ideal" motif. But since the ideal motif is unknown, we select a *k*-mer from each string, and score these *k*-mers based on how similar they are to **each other**.

Given a list of *t* strings *Dna*, where each string has a length *n*, and we want to score *k*-mers. We will be dealing with a $t \times k$ matrix *Motifs*.

```
               k

   [[T C G G G G g T T T t t]
    [c C G G t G A c T T a C]
    [a C G G G G A T T T t C]
    [T t G G G G A c T T t t]
t   [a a G G G G A c T T C C]
    [T t G G G G A c T T C C]
    [T C G G G G A T T c a t]
    [T C G G G G A T T c C t]
    [T a G G G G A a c T a C]
    [T C G G G t A T a a C C]]
```

By varying the choice of *k*-mers from each string in *Dna*, we can construct a large number of different *Motif* matrices. Our goal is to obtain the most "conserved" matrix, with the least amount of differences across the *k*-mers in the matrix. To quantify the similarity of the *k*-mers, we can simply count the number of nucleotides at each position in *k* that are not in the plurality. 

In [49]:
def score(motifs):
    k = len(motifs[0])
    
    score_dicts = []
    for i in range(k):
        nt_counts = {"A":0, "T":0, "G":0, "C":0}
        for motif in motifs:
            nt_counts[motif[i]] += 1
        score_dicts.append(nt_counts)

    scores = []
    for d in score_dicts:
        dict_score = 0
        for key, value in d.items():
            if key != max(d, key=d.get):
                dict_score += value
        scores.append(dict_score)
    return sum(scores)

In [50]:
dna = ["TCGGGGGTTTTT",
    "CCGGTGACTTAC",
    "ACGGGGATTTTC",
    "TTGGGGACTTTT",
    "AAGGGGACTTCC",
    "TTGGGGACTTCC",
    "TCGGGGATTCAT",
    "TCGGGGATTCCT",
    "TAGGGGAACTAC",
    "TCGGGTATAACC"]

score(dna)

30

But, currently, we ignore the difference between a column having 6C, 2A, and 2T versus a column having 6C and 4T. Both end up with the score of 6, yet the latter is actually more conserved than the former! This is important because many regulatory motifs have a few "strongly conserved" positions and many "weakly conserved" positions that allow one of several nucleotides with no clear preference for one over the other. 

To account for the degree to which having the "right" nucleotide matters, we turn to **entropy**. It is a measure of uncertainty for probability distributions, and is defined as: 

$$ H(p_1, \dots, p_N) = -\Sigma^N_{i=1} (p_i\times\log_2{p_i})$$

In [79]:
import math
def safe_log2(value):
    try:
        return math.log2(value)
    except: 
        return 0 # log(0) is undefined but we need to return 0
        
def entropy(row):
    plogp = [value*safe_log2(value) for value in row]
    ent = -1*sum(plogp)
    return ent

def total_entropy(matrix):
    pf = profile(matrix)
    entropies = [ entropy(row) for row in pf.T ]
    return sum(entropies)

In [81]:
total_entropy(dna)

9.916290005356972

Entropy offers an improved method of scoring motif matrices than just counting the non-plurality nucleotides, and in practice entropy is used more often, but for simplicity's sake, we'll use the **score** function instead.

We can also obtain a **count** matrix, and if we divide all elements in **count** by *t*, the number of strings in *Dna*, we obtain the **profile** matrix.

In [54]:
import numpy as np

def count(motifs):
    k = len(motifs[0])
    counts = [[0]*k for i in range(4)] # empty matrix to fill
    nt_index_map = {"A": 0, "C": 1, "G": 2, "T": 3}
    for motif in motifs:
        for i in range(k):
            nt = motif[i]
            counts[nt_index_map[nt]][i] += 1
    return counts

def profile(motifs):
    counts = np.array(count(motifs))
    return counts/len(motifs)

In [55]:
profile(dna)

array([[0.2, 0.2, 0. , 0. , 0. , 0. , 0.9, 0.1, 0.1, 0.1, 0.3, 0. ],
       [0.1, 0.6, 0. , 0. , 0. , 0. , 0. , 0.4, 0.1, 0.2, 0.4, 0.6],
       [0. , 0. , 1. , 1. , 0.9, 0.9, 0.1, 0. , 0. , 0. , 0. , 0. ],
       [0.7, 0.2, 0. , 0. , 0.1, 0.1, 0. , 0.5, 0.8, 0.7, 0.3, 0.4]])

Finally, using the **profile** matrix, we can find out what the **consensus string** is. 

In [60]:
def consensus(motifs):
    pf = profile(motifs)
    index_nt_map = {0: "A", 1: "C", 2: "G", 3: "T"}
    max_values = np.argmax(pf, axis=0)
    
    return "".join([index_nt_map[i] for i in max_values])

In [61]:
consensus(dna)

'TCGGGGATTTCC'

Super cool! The consensus string is not actually found in any of the 10 strings in *Dna*, but it still counts. 

### The Motif Finding Problem

Now that we've learned how to evaluate a collection of *k*-mers, we can formulate the **Motif Finding** problem.

**Motif Finding Problem**: *Given a collection of strings, find a set of k-mers, one from each string, that minimizes the score of the resulting motif.*
-**Input**: A collection of strings *Dna* and an integer *k*.
-**Output**: A collection **Motifs** of *k*-mers, one from each string in *Dna*, minimizing *Score*(*Motifs*) among all possible choices of *k*-mers.

So, this is an optimization problem. A brute force algorithm would be way too slow (O(n<sup>k</sup>)), so we have to figure out a faster algorithm. Given the same matrix as before:

```
               k

   [[T C G G G G g T T T t t]
    [c C G G t G A c T T a C]
    [a C G G G G A T T T t C]
    [T t G G G G A c T T t t]
t   [a a G G G G A c T T C C]
    [T t G G G G A c T T C C]
    [T C G G G G A T T c a t]
    [T C G G G G A T T c C t]
    [T a G G G G A a c T a C]
    [T C G G G t A T a a C C]]
```

We can actually calculate score by tallying up the column-consensus nucleotide per row (in this matrix, it would be the capitalized letters). It also happens that the score per row is just the Hamming distance between the row and the consensus string for this matrix, and so the overall *Score*(*Motifs*) equals the sum of Hamming distances.

<img src="http://bioinformaticsalgorithms.com/images/Motifs/motifs_score_consensus.png" width=500px />

Alright. The instructors define a **function *d*(*Pattern*, *Motifs*) as the sum of Hamming distances between *Pattern* and *Motif*** in *Dna*. And because *Score*(*Motifs*) = *d*(*Pattern*, *Motifs*), rather than looking for a set *Motifs* that minimizes *Score*, we can look for a string *Pattern* that minimizes *d* given all possible sets *Motifs* in *Dna*.

**Equivalent Motif Finding Problem**: *Given a collection of strings, find a collection of k-mers (one from each string) that minimizes the distance between all possible patterns and all possible collections of k-mers.*
- **Input**: A collection of strings *Dna* and an integer *k*.
- **Output**: A *k*-mer *Pattern* and a collection of *k*-mers, one from each string in *Dna*, minimizing *d*(*Pattern*, *Motifs*) among all possible choices of *Pattern* and *Motifs*.

In [None]:
# stolen code from comment section
def hamming_dist(p,q):
    dist = 0
    if(len(p) != len(q)):
        return "Unable to compute."
    else:
        for i in range(0,len(p)):
            if p[i] == q[i]:
                next
            else:
                dist+=1
    return dist
    
def closest_match(dna,pattern):
    """
    given a dna fragment and a pattern with length=k, 
    find a substring k-mer in dna fragment that has the 
    smallest hamming distance from the pattern.
    """
    distances = {}
    for i in range(0,len(dna)-len(pattern)+1):
        kmer = dna[i:i+len(pattern)]
        kmer_dist = hamming_dist(kmer,pattern)
        distances[kmer] = kmer_dist
    min_val = min(distances, key=distances.get)
    return min_val

def median_string(dnas,pattern):
    """
    Within each dna in dnas, find a k-mer min_val such that it is the closest match to the given pattern. 
    If there is a substring "AAA", it will be the closest match.
    """
    kmers = []
    for d in dnas:
        min_val = closest_match(d,pattern)
        kmers.append(min_val)
    return kmers

print(*median_string(["TTACCTTAAC","GATATCTGTC","ACGGCGTTCG","CCCTAAAGAG","CGTCAGAGGT"],"AAA"))


```
DistanceBetweenPatternAndStrings(Pattern, Dna)
    k ← |Pattern|
    distance ← 0
    for each string Text in Dna
        HammingDistance ← ∞
        for each k-mer Pattern’ in Text
            if HammingDistance > HammingDistance(Pattern, Pattern’)
                HammingDistance ← HammingDistance(Pattern, Pattern’)
        distance ← distance + HammingDistance
    return distance
```

In [None]:
# my own implementation
def DistanceBetweenPatternAndStrings(pattern, dnas):
    k = len(pattern)
    distance = 0
    for dna in dnas:
        hamming_dist = len(dna)
        for i in range(len(dna)-len(pattern)+1):
            kmer = dna[i:i+len(pattern)]
            current_hamming = hamming(pattern, kmer)
            if hamming_dist > current_hamming:
                hamming_dist = current_hamming
        distance += hamming_dist
    return distance

In [None]:
pattern = "TTACG"
dnas = ["GCCCGTACGCGCCGGGCATTAATTGAGGGGCCAAAACCTTGACGAGGGAAAAGTGCCTACATAACTAGCCGTAACATCTGTCCGACTGCGAAACAAGGGAAGAGC",
"AAAGTACGTTATATATGCTCCCGCCACATACGGCGGGCTAATTACAATGCAAATGCCGCTTAATGATTGCAATGGTGATAACAGTATGGCAATGTTTAAGGGACG",
"GCGCGATACCACCGTGAGGCGGAGATGAGTGGCGTTCGTCCGTAACTAGTGAGCATGTTACCACGGGTCCAATAGATCCCATAGGACCCTCAATTATGGTGGAGA",
"CTTCTGTGCACGGATATTTTACCCGGGAGAGATGCGGCTCTGGTTGCGACAATAACCGGGGTGGGTTTAAGTTGGAACTGGTGACGATAGTTAATAGACCATCGT", 
"TCAAATAGCTCCTGGATCAGGATTGATTAGCGTATGACGGGTCACCGCGCTCATGCATTCAGCGTATAGAAGTCCCCTCCGACGCAAACCCGCGCCGAGACGCTC",
"AATCGAATGGGGGCGGTTTTGCACCAGTTCGATAGGTAGTGCCCCGAATACACACGTCGTGCTCATAACATTCGCTGCCTTTTGGGCTGGTAATTTTGGAAATCT",
"CGGGAGGTATCCACTGAATCCGACCAGCCCTGGGTTTTGGTTCACCCGCTTCTCGTCTTGGAGTACACGTTTGCTGCGCGCGATTGTCATTTTGGATATGTCAAC",
"CTGAGACCAATCGGCGCATATTAGTTGACCGATGTACCAAACTGATAACACACCCAGCTTCCTTCATCCAATCATCGGCTAGTGAGCTAGAGCCGGCCCGGCCCT",
"CAAGCAGCATCCCCTCAAGAGCCCATTTTAAAGCGCGAGCCATTTAGGGCCGTTTTTCCTGTTTCCTCTCTCGTAAGCAGGTTGCTCCTACCTACAACAGACCAA",
"GTAGAAGTGATGACTGTCGGCTGTTGTTGTGACTGGTCTCAACCGTGTGGCGTCAAGATAGTGCGGTATCCTCTTCGATTATCAGATTTTCTGGGAGAGATTTAC",
"CGCCACATGATGAATATAACATCTGTAGGGGTGTGTCGACAATGGCCTTGATGTCTCTTTCCCCGACGTCTGTGAAACAAAATTTTGTCGGCAGGTGGGCCCACT",
"GACAAGTATTCACGCTCGTTTATTTGGGTTGAAGATTTAGCCTATCGACAACAGCCGTGGGGGGCGGTGGGCGGCAGAACTAACTCTATACTCGGATACCTTATA",
"AAGGACACTGGTACGATCCTAGAAGGGGCAATGAGAATATGGACACTGGTCTAGAATGGATCCGGTAAGCTCCTCATAAGCCACTATCGGGATGGCCGGATTCTC",
"GCTGCAGGCCGCTTTTCTCCGTTTAGGCCCCACCTGACGGGGTAACGAGTTTAGCGAGCCCTCTTTGAAAGGGGGAACGCATTTGATGGTTGCAGTAGACATAGT",
"CAGATTTACATAACATCTAATTATATAGGCACTACCTACATGTGGTGATGTGCACGTGAACACGGAGTGGGCGCATTCCGCGTTATGTATGCCAGAGACCTTTCA",
"CAATGGGTCTTTTATAACAAGGTTATACAATATTAGAGAATACCAAACGACGGCCCGGTATACCTGAGTTGCTCCAAACCGACTGATTCTGGTTGAACAAAGGTT",
"TACTACCATGAGTTTATCGTCAGAGTATTCGCACTCGTAGACGAGATTCTGAAGCGCATTAACTGCTCCTTGGGCTGGGTTCCAAGGTCGGTTAGTCTTAGGGAT",
"TGTAAGCGCCGCCTACAGCGTTACTCCTCTACATCAATCATAGTTGGTAGAAAGTCAGTAGGTACCCAACGACACAATGTTTATAGATCCTAATGTGAGTGGCTC",
"CCTTACGTTGAGGTATGCTTACTTCGGCAAATGAACCGTTCGAACAAACCGGGTCAAATAGCTAGAGAGACTCAGAGTGGAGGCCATAATATATGGAGAGACATG",
"GGGTAATTACTCTAATATTCTCAAATCACGGCAATACGGAGATAAACCTCTTTACCGACAGAATCTTAGGAAGCACATAATAACTAGGTAGGTAACATTATCAAC",
"CCCATTCAGGTACCATGAAAAAAATTTATTTTCATGGTTGCCAAAATCCACGTTACACATGCCGGCAGCTTGTGGGCAACTGAGCAAGTCCGAGGACGTGAAGCT",
"TTCATTACTGTGCATCAGGCACTGTACGTCTACCCAACTTAAGCTTATGATAAACGCTTCAGACCTTGGGTATTACCTCTGCTTGCTGACCTGGATAGCGCAAAT",
"TAGTGCAGCTTGCAAGTAGAAAACTATATTCGGGGTCCGCTCACTATGAGGGTATACGATCCTACTGGGAGGCCTCTATGAAACACACCGTTAGGGGAGCCCATA",
"CAGCCCATACATAGAAGAGGACGCACTCTACGTGTTGCATCTAGTAGTCAGCACTAATTACAACTCTATCCATAGTAGCAAAGGAATAAACGACAAAAGGTCTAA",
"GTATGCTTCGTCTGTCTCACATAAACAGCGGATACTGTAAGTAACAAACAGATTGACCATAGCGTCAGGTCCGCTAGCACCCCCTGGCATTGCGGGACGAGTCTC",
"CAGTCTTTAACCAGGACCAACAATCCTCGTAAACCTGTGGCTCTCTCCAGGTCTGTTTATTTATTGCATCAGCGTCTCGCACCTTGTGCACAAACGACGTCAATG",
"GCAATGTAGATTTAACATTGTCGACAGTTACCCAGAGCCTGAGACACCTTTCGTCTTTCTAGGCCACGACTTCGCCAGGTTAGATAAGTGAGCCCGTGTGGAAGG",
"CGCATAACAGACAAATGTAACTGTGTACGTCTAAGGCGCCCTTTTTGTAGTCATCCCACCGCCACTATGGCTTGCTCCGTTCTAGCTGTGGTAGACGGCGCCCCA",
"CACCGAGCATGAATTATCAGTTCTTTTGACTGGGAGGACTGATTGGTGGATAACCTGAACGGTGGATTACGGGTAACTCACTTAGCATGACCCCACTAGGTTATA",
"CAGGTGTCCGTTTGTCAATACTCTCCAGAATGTAGGAAGTTCGTACTTACTCAGGCTCTTCCGGTTGCCTCAACCCCTCTAAATTGTCACGACGTTTGTCGAGTA",
"AGTAACCCGAAAGCCTATATGTTGAATTCCTGGTGTTGCAAGTATATGCCAGAGCAGATTTTGGAAGATCCCGCCAGGCGCGGCTTGTTAAACTATTTACTATTC",
"CGATCGGGGTAAAGCAGCTTCAAGCCCTCCGTTAAGGCGCAGTCTACTGCATCTTCGTCAGTAATCTTAAGGCAGCGCATCGTTTGAGTACTCGTGGATTAACGG",
"CCTGAAGCCCGACGAGAAATGGGAACAACTTTTAGCCGTTCGAACGTTCTGCATCGGCGGAAGCTTGATACGTATCTTGCGCATTTGGCGACCAGACAAGCTAAG",
"TTGGCAACGAACCTGCAGATTTCTAACGGAACGTTCTTCCGCGCCATTGAAGTATGACTCGCCTACCAGTTTCGGAATGTCGTAGCATTACCTACTTTACTGGTC"]
        

DistanceBetweenPatternAndStrings(pattern, dnas)

```
MedianString(Dna, k)
    distance ← ∞
    Patterns ← AllStrings(k)
    for i ← 0 to |Patterns|
        Pattern ← Patterns[i]
        if distance > DistanceBetweenPatternAndStrings(Pattern, Dna)
            distance ← DistanceBetweenPatternAndStrings(Pattern, Dna)
            Median ← Pattern
    return Median
```

In [None]:
def medianString(dnas, k):
    distance = float('inf')
    patterns = neighbors("".join(["A" for i in range(k)]), k)
    for pattern in patterns:
        current_distance = DistanceBetweenPatternAndStrings(pattern, dnas)
        if distance > current_distance:
            distance = current_distance
            median = pattern
    return median

In [None]:
k=6

dnas = [
    "TATGCCATCGGCCCAAGAGCAGGTGTGCGGCTGAATGAGTCA",
    "GATAGTTGTTAATGACTACTCTTTATCGGGACCCGTCATCCG",
    "CTGGGCCGTAGTAAACGTGGCGCCATCGGTATAGCACGTAAT",
    "CGTACCGCCTGCATCTTTATCGGAACTCCGATCCAACCCATC",
    "ATCGGTGATCAGCTTTGCCCTCCCGGCTGGGAGTACCTACAA",
    "AGCAAATTAGGCGATGGCACGACGCTCAGCCCTACGATCGGC",
    "GGCGGATCGTCCGGTCGGAGCCTGATCGGCCTTGTTTCACAC",
    "AGTGCCTAACCTGTCGTGATCGGAAAACCGAACATGGCAGGT",
    "TTGAGTCCGGAAAATCCACCAGGAAATATTAATGACATCGGT",
    "TCAAGGTCTCATTCGCTAGACCCAATCGGCCGACATTCGCGG"]
medianString(dnas, k)

In [None]:
k=7
dnas = ["CTCGATGAGTAGGAAAGTAGTTTCACTGGGCGAACCACCCCGGCGCTAATCCTAGTGCCC", 
        "GCAATCCTACCCGAGGCCACATATCAGTAGGAACTAGAACCACCACGGGTGGCTAGTTTC",
        "GGTGTTGAACCACGGGGTTAGTTTCATCTATTGTAGGAATCGGCTTCAAATCCTACACAG"]

def medianString_all(dnas, k):
    medians = []
    distance = float('inf')
    patterns = neighbors("".join(["A" for i in range(k)]), k)
    for pattern in patterns:
        current_distance = DistanceBetweenPatternAndStrings(pattern, dnas)
        if distance > current_distance:
            distance = current_distance
            medians.append(pattern)
        elif distance == current_distance:
            medians.append(pattern)
            
    return medians
    
medianString_all(dnas, k)

In [None]:
from itertools import product

def hamming_distance(pattern1, pattern2):
    """Calculate the Hamming distance between two strings."""
    return sum(c1 != c2 for c1, c2 in zip(pattern1, pattern2))

def median_string(dna, k):
    """Find the median string with the smallest total Hamming distance to all strings in the given DNA sequences."""
    distance = float('inf')
    median = None
    
    # Generate all possible k-mers
    for pattern in [''.join(p) for p in product('ACGT', repeat=k)]:
        # Calculate the total distance between the current pattern and all strings in DNA
        total_distance = sum(min(hamming_distance(pattern, dna_str[j:j+k]) for j in range(len(dna_str) - k + 1)) for dna_str in dna)
        
        # Update the median and distance if the total distance is smaller
        if total_distance < distance:
            distance = total_distance
            median = pattern
    
    return median

In [None]:
median_string(dnas, k)

In [None]:
def most_probable_kmer(text, k, profile):
    max_prob = -1
    for i in range(len(text)-k+1):
        kmer = text[i:i+k]
        prob = 1
        for j in range(k):
            if kmer[j] == "A":
                prob *= profile[0][j]
            elif kmer[j] == "C":
                prob *= profile[1][j]
            elif kmer[j] == "G":
                prob *= profile[2][j]
            elif kmer[j] == "T":
                prob *= profile[3][j]
        if prob > max_prob:
            max_prob = prob
            max_kmer = kmer
    return max_kmer

In [None]:
f = open("mostprobable.txt", "r")
input = f.read().split("\n")
f.close()


text = input[0]
k = int(input[1])
profile = [i.split(" ") for i in input[2:]]
profile = [[float(i) for i in line] for line in profile]
print(profile)

In [None]:
most_probable_kmer(text, k, profile)

Implement **GreedyMotifSearch**.

    Input: Integers k and t, followed by a space-separated collection of strings Dna.
    Output: A collection of strings BestMotifs resulting from applying GreedyMotifSearch(Dna, k, t). If at any step you find more than one Profile-most probable k-mer in a given string, use the one occurring first.


 Sample Input:

3 5
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG

Sample Output:

CAG CAG CAA CAA CAA

In [None]:
k = 3
t = 5
dnas = ["GGCGTTCAGGCA", "AAGAATCAGTCA", "CAAGGAGTTCGC", "CACGTCAATCAC", "CAATAATATTCG"]

f = open("rosalind_ba2d.txt", "r")
input = f.read().rstrip("\n").split("\n")
f.close()

k = int(input[0].split(" ")[0])
t = int(input[0].split(" ")[1])
dnas = input[1:]
print(k)
print(t)
print(dnas)

In [None]:
import numpy as np

def score(dna):
    score_dicts = []
    scores = []
    for i in range(len(dna[0])):
        nt_counts = {"A":0, "T":0, "G":0, "C":0}
        for seq in dna:
            nt_counts[seq[i]] += 1
        score_dicts.append(nt_counts)
    for d in score_dicts:
        dict_score = 0
        for key, value in d.items():
            if key != max(d, key=d.get):
                dict_score += value
        scores.append(dict_score)
    return sum(scores)


def profile(dna):
    score_dicts = []
    counts = {}
    for i in range(len(dna[0])):
        nt_counts = {"A":0, "C":0, "G":0, "T":0}
        for seq in dna:
            nt_counts[seq[i]] += 1
        for k, v in nt_counts.items(): # necessary because of floating point error if done in prev for in
            nt_counts[k] = v/len(dna)
        score_dicts.append(nt_counts)
    score_profile = np.array([list(i.values()) for i in score_dicts])
    score_profile = score_profile.T # transpose
    return score_profile


def most_probable_kmer(text, k, profile):
    max_prob = -1
    for i in range(len(text)-k+1):
        kmer = text[i:i+k]
        prob = 1
        for j in range(k):
            if kmer[j] == "A":
                prob *= profile[0][j]
            elif kmer[j] == "C":
                prob *= profile[1][j]
            elif kmer[j] == "G":
                prob *= profile[2][j]
            elif kmer[j] == "T":
                prob *= profile[3][j]
        if prob > max_prob:
            max_prob = prob
            max_kmer = kmer
    return max_kmer


def greedy_motif_search(k, t, dnas):
    # artitrarily choose first k chars from each dna string
    best_motifs = [dna[:k] for dna in dnas] 
    # artitrarily start with first dna string and find all k-mers in this string
    dna0_motifs = [dnas[0][i:i+k] for i in range(len(dnas[0])-k+1)] 
    
    for motif in dna0_motifs:
        current_motifs = [motif] # initialize a current_motifs list for each motif.
        for i in range(1, t): # for each dna string aside from the first,
            current_dna = dnas[i]
            prof = profile(current_motifs) # get the probability profile given previous current_motifs
            motifi = most_probable_kmer(current_dna, k, prof) # and find the most probable k-mer in the dna string given prev probs
            current_motifs.append(motifi) # and update current_motifs
        if score(current_motifs) < score(best_motifs): # if current_motifs has lower disagreement
            best_motifs = current_motifs
    return best_motifs

In [None]:
print(" ".join(greedy_motif_search(k, t, dnas)))

In [None]:
def laplace_profile(dna):
    score_dicts = []
    counts = {}
    for i in range(len(dna[0])):
        nt_counts = {"A":1, "C":1, "G":1, "T":1}
        for seq in dna:
            nt_counts[seq[i]] += 1
        for k, v in nt_counts.items(): # necessary because of floating point error if done in prev for in
            nt_counts[k] = v/(len(dna)+4) # +4 since we add 1 to each nucleotide
        score_dicts.append(nt_counts)
    score_profile = np.array([list(i.values()) for i in score_dicts])
    score_profile = score_profile.T # transpose
    return score_profile


def greedy_motif_search_pseudocounts(k, t, dnas):
    # artitrarily choose first k chars from each dna string
    best_motifs = [dna[:k] for dna in dnas] 
    # artitrarily start with first dna string and find all k-mers in this string
    dna0_motifs = [dnas[0][i:i+k] for i in range(len(dnas[0])-k+1)] 
    
    for motif in dna0_motifs:
        current_motifs = [motif] # initialize a current_motifs list for each motif.
        for i in range(1, t): # for each dna string aside from the first,
            current_dna = dnas[i]
            prof = laplace_profile(current_motifs) # get the probability profile given previous current_motifs
            motifi = most_probable_kmer(current_dna, k, prof) # and find the most probable k-mer in the dna string given prev probs
            current_motifs.append(motifi) # and update current_motifs
        if score(current_motifs) < score(best_motifs): # if current_motifs has lower disagreement
            best_motifs = current_motifs
    return best_motifs

In [None]:
prof = [[0.4, 0.3, 0.0, 0.1, 0.0, 0.9],
        [0.2, 0.3, 0.0, 0.4, 0.0, 0.1],
        [0.1, 0.3, 1.0, 0.1, 0.5, 0.0],
        [0.3, 0.1, 0.0, 0.4, 0.5, 0.0]]
k = 6
text = "GAGCTA"
def calculate_probability(text, k, profile):
    prob = 1
    for i in range(k):
        if text[i] == "A":
            prob *= profile[0][i]
        elif text[i] == "C":
            prob *= profile[1][i]
        elif text[i] == "G":
            prob *= profile[2][i]
        elif text[i] == "T":
            prob *= profile[3][i]
    return prob

calculate_probability(text, k, prof)