# Consensus and Profile

## Problem

A matrix is a rectangular table of values divided into rows and columns. An $m×n$ matrix has $m$ rows and $n$ columns. Given a matrix $A$, we write $A_{i,j}$ to indicate the value found at the intersection of row $i$ and column $j$.

Say that we have a collection of DNA strings, all having the same length $n$. Their profile matrix is a $4×n$ matrix $P$ in which $P_{1,j}$ represents the number of times that 'A' occurs in the $j$th position of one of the strings, $P_{2,j}$ represents the number of times that C occurs in the $j$th position, and so on (see below).

A consensus string cc is a string of length $n$ formed from our collection by taking the most common symbol at each position; the $j$th symbol of cc therefore corresponds to the symbol having the maximum value in the $j$-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

                      A T C C A G C T
                      G G G C A A C T
                      A T G G A T C T
      DNA Strings	 A A G C A A C C
                      T T G G A A C T
                      A T G C C A T T
                      A T G G C A C T
                      
                  A   5 1 0 0 5 5 0 0
        Profile   C   0 0 1 4 2 0 6 1
                  G   1 1 6 3 0 1 0 0
                  T   1 5 0 0 0 1 1 6
                  
       Consensus	  A T G C A A C T
        
        
_Given_: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

_Return_: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

**Sample Dataset**

        >Rosalind_1
        ATCCAGCT
        >Rosalind_2
        GGGCAACT
        >Rosalind_3
        ATGGATCT
        >Rosalind_4
        AAGCAACC
        >Rosalind_5
        TTGGAACT
        >Rosalind_6
        ATGCCATT
        >Rosalind_7
        ATGGCACT
        
**Sample Output**

        ATGCAACT
        A: 5 1 0 0 5 5 0 0
        C: 0 0 1 4 2 0 6 1
        G: 1 1 6 3 0 1 0 0
        T: 1 5 0 0 0 1 1 6

_______________________
## Solution

Starting from this notebook we start to define multipurpose, modular functions that we will surely reuse in later problems. In this notebook we define a helper function called 'number_to_symbol' which maps integers 0 to 3 to the four bases 'ACGT'. We also create two more functions: a 'get_profile' function that takes in a list of sequences and outputs the profile, with the option to normalize the output, which will be useful for other problems. Finally, the 'consensus_string' takes in a profile, normalized or not, and outputs the consensus sequence. As we have done in previous notebooks, we will also explore Biopython's built-in functions that achieve similar results.

We start by reading in the sample data, and creating the simple 'number_to_symbol' helper function:

In [1]:
sample_string = """
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
"""
sample = []
for x in sample_string.split('>')[1:]:
    sample.append(''.join(x.split('\n')[1:]))

# Helper function mapping (0,1,2,3) to ('A','C','G','T')
def number_to_symbol(n):
    if n == 0:
        return 'A'
    elif n == 1:
        return 'C'
    elif n == 2:
        return 'G'
    else:
        return 'T'

Now we move on to the two main functions. The 'get_profile' functions takes a list of sequences and a boolean for whether or not we want the output normalized. It then initiates an $4xn$ matrix of zeros, $n$ being the number of nucleotides per sequence. Reading the input list column by column we fill this zero matrix with the appropriate counts for each nucleotide, making use of the 'number_to_symbol' helper function to achieve an elegant, concise solution.

In [2]:
import numpy as np

def get_profile(seqs, normalize = True):
    seq_length = len(sample[0])
    # Initialize empty profile matrix
    profile = np.zeros((4,seq_length))
    # Replace values with appropriate column counts
    for i in range(seq_length):
        col = [row[i] for row in sample]
        for j in range(4):
            profile[j][i] = col.count(number_to_symbol(j))/(1+ normalize*(len(seqs)-1))
    return profile

print('Profile without normalization: \n', get_profile(sample,normalize=False).astype(int))
np.set_printoptions(formatter={'float': lambda x: "{0:0.2f}".format(x)})
print('\nProfile with normalization: \n', get_profile(sample))
np.set_printoptions()

Profile without normalization: 
 [[5 1 0 0 5 5 0 0]
 [0 0 1 4 2 0 6 1]
 [1 1 6 3 0 1 0 0]
 [1 5 0 0 0 1 1 6]]

Profile with normalization: 
 [[0.71 0.14 0.00 0.00 0.71 0.71 0.00 0.00]
 [0.00 0.00 0.14 0.57 0.29 0.00 0.86 0.14]
 [0.14 0.14 0.86 0.43 0.00 0.14 0.00 0.00]
 [0.14 0.71 0.00 0.00 0.00 0.14 0.14 0.86]]


The 'consensus_string' function, takes in a profile as argument, and outputs the consensus string, once again making use of the helper function for readability.

In [3]:
def consensus_string(profile):
    consensus = ''
    seq_length = len(profile[0])
    for i in range(seq_length):
        col = [row[i] for row in profile]
        consensus += number_to_symbol(col.index(max(col)))
    return consensus

profile = get_profile(sample, normalize=False)
print('Consensus string: %s' % consensus_string(profile))


Consensus string: ATGCAACT


We choose to leave the output format of both functions as they are, in order to reuse the functions in the future. So, to present the results in the desired formats we need to do a little manipulation:

In [4]:
profile = get_profile(sample, normalize=False)
ans_str = consensus_string(profile) + '\n'
for i in range(4):
    ans_str += number_to_symbol(i) + ': '
    for x in profile[i]:
        ans_str += str(int(x)) + ' '
    ans_str += '\n'
print('Output in the desired format: \n%s' %ans_str)        

Output in the desired format: 
ATGCAACT
A: 5 1 0 0 5 5 0 0 
C: 0 0 1 4 2 0 6 1 
G: 1 1 6 3 0 1 0 0 
T: 1 5 0 0 0 1 1 6 



____________
We now turn to Biopython. As we have seen in previous notebooks, Biopython implements very useful methods for reading in FASTA (and many other) data. We start by reading the FASTA sequences into a parse object. The elements of this objects are composed of an id and a sequence. We then extract the sequences and pass them as instances of a motifs object. This object now has some methods that achieve the results we are looking for such as a consensus string or a counts matrix:

In [5]:
from Bio import motifs
from Bio import SeqIO
from Bio.Alphabet import generic_dna

seqs = SeqIO.parse('sample_cons.txt', 'fasta', generic_dna)
instances = [x.seq for x in seqs]
m = motifs.create(instances)
print('Consensus string: \n%s' % m.consensus)
print('\nProfile: \n',m.counts)

Consensus string: 
ATGCAACT

Profile: 
         0      1      2      3      4      5      6      7
A:   5.00   1.00   0.00   0.00   5.00   5.00   0.00   0.00
C:   0.00   0.00   1.00   4.00   2.00   0.00   6.00   1.00
G:   1.00   1.00   6.00   3.00   0.00   1.00   0.00   0.00
T:   1.00   5.00   0.00   0.00   0.00   1.00   1.00   6.00

