CONS- Consensus and Profile
        Finding a Most Likely Common Ancestor
        A matrix is a rectangular table of values divided into rows and columns. An m×nm×n matrix has mm rows and nn columns. Given a matrix AA, we write Ai,jAi,j to indicate the value found at the intersection of row ii and column jj.

Say that we have a collection of DNA strings, all having the same length nn. Their profile matrix is a 4×n4×n matrix PP in which P1,jP1,j represents the number of times that 'A' occurs in the jjth position of one of the strings, P2,jP2,j represents the number of times that C occurs in the jjth position, and so on (see below).

A consensus string cc is a string of length nn formed from our collection by taking the most common symbol at each position; the jjth symbol of cc therefore corresponds to the symbol having the maximum value in the jj-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

In [None]:
from collections import OrderedDict

def crt_fasta(file): 
    with open(infile) as fin:        
        parts = None        
        while True:            
            line = fin.readline()           
            if not line:               
                if parts is not None:
                    yield ''.join(parts)                
                break            
            if line.startswith('>'):                
                if parts is not None:
                    yield ''.join(parts)                
                parts = []
            else:                
                parts.append(line.strip())
                
def matrix_profile(seqs):   
    mat_prf = OrderedDict.fromkeys(('A', 'C', 'G', 'T',))    
    for seq in seqs:        
        if not mat_prf['A']:
            for key in mat_prf:
                mat_prf[key] = [0] * len(seq)        
        for index, char in enumerate(seq):            
            mat_prf[char][index] += 1
   
    return mat_prf

def con_str(mat_prf):    
    op = []    
    for index in range(len(mat_prf['A'])):
        max_repeats, max_char = 0, ''              # initialize/reset counters
        for char in mat_prf:                       # now cycle over the keys for this index
            if mat_prf[char][index] >= max_repeats:
                max_repeats = mat_prf[char][index]
                max_char = char
        op.append(max_char)                        # append largest-counted char to result
   
    return ''.join(op)                             # return all the chars as a single string

file_name = 'rosalind_cons.txt'

m_fasta_seq = crt_fasta(file_name)
mat_profile = matrix_profile(m_fasta_seq)
concensus_string = con_str(mat_profile)    
print(concensus_string)
    
for key, chars in mat_profile.items():
    print('{}: {}'.format(key, ' '.join(str(char) for char in chars)))
