# [Consensus and Profile](http://rosalind.info/problems/cons/)
![Rosalind](logo.jpg)

## Problem
A matrix is a rectangular table of values divided into rows and columns. An $m×n$ matrix has m rows and n columns. Given a matrix A, we write $A_i,_j$ to indicate the value found at the intersection of row i and column j.

Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that 'A' occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below).

A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

```

                            A T C C A G C T
                            G G G C A A C T
                            A T G G A T C T
        DNA String          A A G C A A C C
                            T T G G A A C T
                            A T G C C A T T
                            A T G G C A C T
                            ---------------
                        *A* 5 1 0 0 5 5 0 0
        Profile         *C* 0 0 1 4 2 0 6 1
                        *G* 1 1 6 3 0 1 0 0
                        *T* 1 5 0 0 0 1 1 6
                            ----------------
        Consensus           A T G C A A C T

    
```

#### Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

#### Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

### Sample Dataset
```
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
```
### Sample Output
```
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6
```

In [4]:
def ReadFASTA(filename):
    '''Extract Sequence name and FASTA sequence from text file
            Input   : filename
            Output  : Dict with sequence Name as key and fasta sequence as value
    '''
    with open(filename) as file:
        FASTA = dict()
        lines = file.readlines()
        for line in lines:
            if line.startswith('>'):
                seqname  = line.rstrip().lstrip('>')
                FASTA[seqname] = ''
            else:
                FASTA[seqname] += line.rstrip()
        return FASTA
    
fasta = ReadFASTA('data/rosalind_cons.txt')
dna_list = list(fasta.values())


In [17]:
def profile(dna_strings):
        from numpy import zeros
        matrix = zeros((4,len(dna_strings[0])), dtype = int)
        nt_index = {'A':0,'C':1, 'G': 2, 'T': 3} 
        for dna in dna_strings:
                for index, nt in enumerate(dna):
                        matrix[nt_index[nt]][index] += 1
        return matrix
                        

In [6]:
def profile(dna_strings):
        matrix = {key: [0 for e in range(len(dna_strings[0]))] for key in 'ACGT'}
        for dna in dna_strings:
                for index, nt in enumerate(dna):
                        matrix[nt][index] += 1
        return matrix

matrix = profile(dna_list)

In [16]:
def consensus(matrix):
    con_seq = ""
    for i in range(len(matrix['A'])):
        base,count = '-1', -1
        for nt in 'ACGT':
            if matrix[nt][i] > count:
                base ,count = nt , matrix[nt][i]
        con_seq += base
    return(con_seq)

sequence = consensus(matrix)

In [18]:
# Format the count properly
output = [sequence, 'A:', 'C:', 'G:', 'T:']
for index, col in enumerate(matrix.values()):
    for val in col:
        #print(val)
        output[index+1] += ' '+str(val)

# Print and write the output
print('\n'.join(output))
with open('output/010_CONS.txt', 'w') as output_data: 
    output_data.write('\n'.join(output))

AAGACAGCGCCTTCACAAAACCACAGGGCCTCATGGGCACTCGACGGGACGAAGGCGGATAAGTTTAGGCCCGAAACCAGGCGAACAAATTCCCCCTCGTAGATAGAAAAGAAGCAACCCTATCTGCCGAGCAGAGAGCAACATTCCTAATCGCCAGAGGGGTCCCGACGTGCTCTTATAACTATAGAAACGGACCAAAAAAAACGCCTATACAATTTCCGAACCAATCCAATGGACTATCCTGTTTAAGCGCGCAACGTTATTGACCGGAGAAAGATCCACAACAGGGTATCCTCGCGGGCACACACAGATGGTTTAGTCTCGCGCACACCTACTTTATACCCCATCGCTCTCCAAAATAAACGACGGGACTCACTTTGCTGCGAATAACAACTTGAGTAAGCGAGCATGACCCAGCTGGGAAAACAAACAGCAAGATTACACCGAACCCCGCGATGAAAAGACAAGGAAGGCCACAAGACACCGTGGAAGCGCAAGAGTGTGACAGACCCTCCCAAACAAACTGACACCGGAAACGCACAAATAACATTACGAGGGCGTCCGAGTCAAGCCCCTTATCCAGTACTCGGACAAGATACCCGTTAACAATAGGAACTGAATCTTTTGACAATGGCACACAGCACCACAAGTGCTTAAGACAAGCATACTGAGGCACCCAAGGCGAGAGCTCACAAAATACATTTAATTATAGCCAAGCAAACCTAAGAATCGGGGGCCCCGAGTCAACATATCTGAGACGGCAACTATAGCGCTGGCATAGTAGAAACTAAGACAAAACCCCGGGGCGTGACCGTCCACGCGTGAGTTGCGGAAAATCAGGGCAAGCAGGACAAGATCCAATACAAAAAAGTCCAGCCCCTTTCAAACTATCTAAGTTATGCCTATGCCGACTTGGTCAATGCGAATATCGACGATGAATAAACAATTTGAGTCCAGCGAAGGGAACTGCCCCCTCTAGAATTTGAAGCGTAAGCTAG
A