<img align="left" style="padding-right:10px;" src="figures/cartel.jpg">
<!--COURSE_INFORMATION-->
## This notebook contains the index from the course [Biology Meets Programming](https://www.coursera.org/learn/bioinformatics/home/welcome) by University of California in Coursera 


### The content is available [on GitHub](https://github.com/vencejo/Curso_BiologyMeetsProgramming).

<!--NAVIGATION-->
< | [Contents](Index.ipynb) | [4.2 Randomized Motif Search](4.2 Randomized Motif Search.ipynb) >


### What is the probability that the sun will not rise tomorrow?

In 1650, after the Scots proclaimed Charles II as king during the English Civil War, Oliver Cromwell made a famous appeal to the Church of Scotland. Urging them to see the error of their royal alliance, he pleaded,

    I beseech you, in the bowels of Christ, think it possible that you may be mistaken.

The Scots rejected the appeal, and Cromwell invaded Scotland in response. His quotation later inspired the statistical maxim called Cromwell’s rule, which states that we should not use probabilities of 0 or 1 unless we are talking about logical statements that can only be true or false. In other words, we should allow a small probability for extremely unlikely events, such as “this book was written by aliens” or “the sun will not rise tomorrow”. We cannot speak to the likelihood of the former event, but in the 18th Century, the French mathematician Pierre-Simon Laplace actually estimated the probability that the sun will not rise tomorrow (1/1826251), given that it has risen every day for the past 5000 years. Although this estimate was ridiculed by his contemporaries, Laplace’s approach to this question now plays an important role in statistics. 

In any observed data set, there is the possibility, especially with low-probability events or small data sets, that an event with nonzero probability does not occur. Its observed frequency is therefore zero; however, setting the empirical probability of the event equal to zero represents an inaccurate oversimplification that may cause problems. By artificially adjusting the probability of rare events, these problems can be mitigated.


### Laplace’s Rule of Succession

Cromwell’s rule is relevant to the calculation of the probability of a string based on a profile matrix. For example, consider the following Profile:

<img align="center" style="padding-right:10px;" src="figures/fig54.png">

In order to improve this unfair scoring, bioinformaticians often substitute zeroes with small numbers called pseudocounts. The simplest approach to introducing pseudocounts, called Laplace’s Rule of Succession, is similar to the principle that Laplace used to calculate the probability that the sun will not rise tomorrow. In the case of motifs, pseudocounts often amount to adding 1 (or some other small number) to each element of Count(Motifs). For example, say that we have the following motif, count, and profile matrices:

<img align="center" style="padding-right:10px;" src="figures/fig55.png">

Laplace’s Rule of Succession adds 1 to each element of Count(Motifs), updating the two matrices to the following:

<img align="center" style="padding-right:10px;" src="figures/fig56.png">

Code Challenge (2 points): Write a function CountWithPseudocounts(Motifs) that takes a list of strings Motifs as input and returns the count matrix of Motifs with pseudocounts as a dictionary of lists. Then add this function to Motifs.py. (Hint: how can you solve this problem by making a very small change to your original function Count(Motifs)?

Click here for this problem's test datasets.

Sample Input:

AACGTA
CCCGTT
CACCTT
GGATTA
TTCCGG

Sample Output:

{'T': [2, 2, 1, 2, 5, 3], 'A': [2, 3, 2, 1, 1, 3], 'C': [3, 2, 5, 3, 1, 1], 'G': [2, 2, 1, 3, 2, 2]}

```python
def CountWithPseudocounts(Motifs):
    count = {}
    k = len(Motifs[0])
    for symbol in "ACGT":
        count[symbol] = []
        for j in range(k):
            count[symbol].append(1)

    t = len(Motifs)
    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1

    return  count

```



Code Challenge (3 points): Now that you have written a function CountWithPseudocounts(Motifs), write a function ProfileWithPseudocounts(Motifs) that takes a list of strings Motifs as input and returns the profile matrix of Motifs with pseudocounts as a dictionary of lists. Then add this function to Motifs.py. Make sure to use CountWithPseudocounts(Motifs) as a subroutine!

Click here for this problem's test datasets.

Sample Input:

AACGTA
CCCGTT
CACCTT
GGATTA
TTCCGG

Sample Output:

{'A': [0.2222222222222222, 0.3333333333333333, 0.2222222222222222, 0.1111111111111111, 0.1111111111111111, 0.3333333333333333], 'T': [0.2222222222222222, 0.2222222222222222, 0.1111111111111111, 0.2222222222222222, 0.5555555555555556, 0.3333333333333333], 'G': [0.2222222222222222, 0.2222222222222222, 0.1111111111111111, 0.3333333333333333, 0.2222222222222222, 0.2222222222222222], 'C': [0.3333333333333333, 0.2222222222222222, 0.5555555555555556, 0.3333333333333333, 0.1111111111111111, 0.1111111111111111]}

```python
def ProfileWithPseudocounts(Motifs):
    profile = {}
    count = CountWithPseudocounts(Motifs)
    k = len(Motifs[0])
    for symbol in "ACGT":
        profile[symbol] = []
        for j in range(k):
            total = count["A"][j] + count["C"][j] + count["G"][j] + count["T"][j]
            profile[symbol].append(count[symbol][j]/total)

    return profile

```

STOP and Think: How would you use Laplace’s Rule of Succession to address the shortcomings of GreedyMotifSearch? 



### An improved greedy motif search

<img align="center" style="padding-right:10px;" src="figures/fig57.png">

<img align="center" style="padding-right:10px;" src="figures/fig58.png">

<img align="center" style="padding-right:10px;" src="figures/fig59.png">

<img align="center" style="padding-right:10px;" src="figures/fig60.png">



Code Challenge (3 points): Write a function GreedyMotifSearchWithPseudocounts(Dna, k, t) that takes a list of strings Dna followed by integers k and t and returns the result of running GreedyMotifSearch, where each profile matrix is generated with pseudocounts. Then add this function to Motifs.py. (Hint: Ideally, you should only need an extremely small modification to your original GreedyMotifSearch function.)

Click here for this problem's test datasets.

Sample Input:

3 5
GGCGTTCAGGCA
AAGAATCAGTCA
CAAGGAGTTCGC
CACGTCAATCAC
CAATAATATTCG

Sample Output:

TTC
ATC
TTC
ATC
TTC

```python
# Input:  A list of kmers Dna, and integers k and t (where t is the number of kmers in Dna)
# Output: GreedyMotifSearch(Dna, k, t)
def GreedyMotifSearchWithPseudocounts(Dna, k, t):
    BestMotifs = []
    for i in range(0, t):
        BestMotifs.append(Dna[i][0:k])

    n = len(Dna[0])
    for i in range(n-k+1):
        Motifs = []
        Motifs.append(Dna[0][i:i+k])
        for j in range(1, t):
            P = ProfileWithPseudocounts(Motifs[0:j])
            Motifs.append(ProfileMostProbablePattern(Dna[j], k, P))
        if Score(Motifs) < Score(BestMotifs):
            BestMotifs = Motifs

    return BestMotifs

# Copy all needed subroutines here.  These subroutines are the same used by GreedyMotifSearch(),
# except that you should replace Count(Motifs) and Profile(Motifs) with the new functions
# CountWithPseudocounts(Motifs) and ProfileWithPseudocounts(Motifs).
def ProfileMostProbablePattern(Text, k, Profile):
    """ Explicación de porque poner la pMax a -1
    (Sacado del comentario de Cris Lawrence ) a la tarea:
    I too did this in PyCharm and got the expected answer but failed the test until
    I also set p to -1 initially as someone else did.
    I guess the issue is that by setting p = 0 for starters,
    you could wind up with just an empty string if all possibilities
    occur with 0 probability.  The algorithm spec says return the first best.
    If all have 0 probability, that would be the first 5-mer in the sequence.
    Setting the probability to less than 0 guarantees you return something.
    I guess that's why an initial value for p of 0 doesn't work. """
    pMax = -1
    mostProbPattern = ""
    for i in range(0, len(Text)-k + 1):
        pattern = Text[i:i+k]
        p = Pr(pattern, Profile )
        if p > pMax:
            mostProbPattern = pattern
            pMax = p

    return mostProbPattern

def Pr(Text, Profile):
    p = 1
    for i in range(len(Text)):
        p *= Profile[Text[i]][i]
    return p

def Score(Motifs):
    score = 0
    k = len(Motifs[0])
    t = len(Motifs)
    consensus = Consensus(Motifs)

    for j in range(k):
        cont = 0
        for i in range(t):
            if Motifs[i][j] != consensus[j]:
                cont += 1
        score += cont

    return score

def CountWithPseudocounts(Motifs):
    count = {}
    k = len(Motifs[0])
    for symbol in "ACGT":
        count[symbol] = []
        for j in range(k):
            count[symbol].append(1)

    t = len(Motifs)
    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1

    return  count



def ProfileWithPseudocounts(Motifs):
    profile = {}
    count = CountWithPseudocounts(Motifs)
    k = len(Motifs[0])
    for symbol in "ACGT":
        profile[symbol] = []
        for j in range(k):
            total = count["A"][j] + count["C"][j] + count["G"][j] + count["T"][j]
            profile[symbol].append(count[symbol][j]/total)

    return profile

def Consensus(Motifs):
    k = len(Motifs[0])
    count = CountWithPseudocounts(Motifs)
    consensus = ""
    for j in range(k):
        m = 0
        frequentSymbol = ""
        for symbol in "ACGT":
            if count[symbol][j] > m:
                m = count[symbol][j]
                frequentSymbol = symbol
        consensus += frequentSymbol

    return consensus
```





Exercise Break (1 point): Apply GreedyMotifSearch with pseudocounts to find motifs in the DosR dataset with k-mer length equal to 15 (click [here](dnas/DosR.txt) to download).

Sample Input:

3 5
GGCGTTCAGGCA
AAGAATCAGTCA
CAAGGAGTTCGC
CACGTCAATCAC
CAATAATATTCG

Sample Output:

['TTC', 'ATC', 'TTC', 'ATC', 'TTC']
2

Passed test #2. Correct! Below is the best set of Motifs in the DosR with k = 15 using GreedyMotifSearch with pseudocounts (with a score of 35):
GGACTTCAGGCCCTA



You have now seen the power of pseudocounts illustrated on a small example. Running GreedyMotifSearch with pseudocounts to solve the Subtle Motif Problem returns a collection of 15-mers Motifs with Score(Motifs) = 41 and Consensus(Motifs) = "AAAAAtAgaGGGGtt". Thus, Laplace’s Rule of Succession has provided a significant improvement over the original GreedyMotifSearch, which returned the consensus string "gttAAAtAgaGatGtG" with Score(Motifs) = 58.

You may be satisfied with the performance of GreedyMotifSearch, but you should know by now that your authors are never satisfied. Can we design an even more accurate motif finding algorithm?




<!--NAVIGATION-->
< | [Contents](Index.ipynb) | [4.2 Randomized Motif Search](4.2 Randomized Motif Search.ipynb) >