In [1]:
import numpy as np
from Bio import SeqIO

## Reading the fasta files

In the file promoters, I have 2 parts from Registry that we want to check and a third one which certainly has a RBS.

In [2]:
seqs = []
for seq_record in SeqIO.parse("Promoters.fasta", "fasta"):
    print 'Reading ' + str(seq_record.id) + "..."
    seqs.append( seq_record ) 

Reading BBa_R0010...
Reading BBa_K914003...
Reading BBa_R0040...
Reading BBa_K876002...
Reading BBa_K611012...


Just an example of a sequence and its id.

In [3]:
print seqs[0].id
print seqs[0].seq

BBa_R0010
caatacgcaaaccgcctctccccgcgcgttggccgattcattaatgcagctggcacgacaggtttcccgactggaaagcgggcagtgagcgcaacgcaattaatgtgagttagctcactcattaggcaccccaggctttacactttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatttcacaca


We now read all possible RBSs.

In [4]:
SetPOssB = [29, 30, 31, 32, 33, 34, 35, 64]
SetPOssJ = range(0,40)

RBSseqs = []

for seq_record in SeqIO.parse("../Registry_AllParts.fasta", "fasta"):
    if ( str(seq_record.id)[:-2] == "BBa_B00" and int( str(seq_record.id)[-2:] ) in SetPOssB ): 
        RBSseqs.append( seq_record )
    elif ( str(seq_record.id)[:-2] == "BBa_J611" and int( str(seq_record.id)[-2:] ) in SetPOssJ ) :
        RBSseqs.append( seq_record )
    elif ( seq_record.id == "BBa_J01010" or seq_record.id == "BBa_J01080" ):
        RBSseqs.append( seq_record )

## Strict comparison

Comparing with part placI

In [5]:
idx = []
for j in range(len(RBSseqs)):
    if RBSseqs[j].seq in seqs[0].seq : idx.append(j)

print idx

[]


Same thing with prham

In [6]:
idx = []
for j in range(len(RBSseqs)):
    if RBSseqs[j].seq in seqs[1].seq : idx.append(j)

print idx

[]


Finally, our sanity check:

In [7]:
idx = []
for j in range(len(RBSseqs)):
    if RBSseqs[j].seq in seqs[2].seq : idx.append(j)

print idx

[]


In [9]:
idx = []
for j in range(len(RBSseqs)):
    if RBSseqs[j].seq in seqs[3].seq : idx.append(j)

print idx

[]


All of the above parts do not show any substring that matches any of the rybosomal binding sites available at Registry...

In [10]:
idx = []
for j in range(len(RBSseqs)):
    if RBSseqs[j].seq in seqs[4].seq : idx.append(j)

print idx

[5]


We see that we found a RBS. Let's find out the id:

In [11]:
for j in idx: print RBSseqs[j].id

BBa_B0034


The code above is general enough so that if there're matches with more than one RBS, then all of them are displayed.Now, this only means none of those RBSs are included in any of the two first DNA. However, a similar gene may be encoded there. Let's try alinging them.

## Constructing an alignment algorithm

In [20]:
one_sequence = 'actgatcgattgatcgatcgatcg'
another_sequence   = 'tttagatcgatctttgatc'
 
# here are the five bits of information we described before
def score_match(subject, query, subject_start, query_start, length):
    score = 0
    # for each base in the match
    for i in range(0,length):
        # first figure out the matching base from both sequences
        subject_base = subject[subject_start + i]
        query_base = query[query_start + i]
        # then adjust the score up or down depending on 
        # whether or not they are the same
        if subject_base == query_base:
            score = score + 1
        else:
            score = score - 1
    return score

In [21]:
# the arguments are the five bits of information that define a match
def pretty_print_match(subject, query, subject_start, query_start, length):
    
    seqcmp = ""
    
    # first print the start/stop positions for the subject sequence
    seqcmp += str(subject_start) + (' ' * length) + str(subject_start+length) + '\n'
 
    # then print the bit of the subject that matches
    seqcmp += ' ' + subject[subject_start:subject_start+length] + '\n'
 
    # then print the bit of the query that matches
    seqcmp += ' ' + query[query_start:query_start+length] + '\n'
 
    # finally print the start/stop positions for the query
    seqcmp += str(query_start) + (' ' * length) + str(query_start+length) + '\n'
 
    seqcmp += 'n--------------------n'
    
    return seqcmp

In [22]:
def try_all_matches(subject, query, score_limit):
    setMatches = []
    setMatches_seqs = []
    
    qlen = len(query)
    slen = len(subject)
    for subject_start in range(0,slen):
        for query_start in range(0,qlen+1):
            for length in range(qlen-5,qlen+1):
                if (subject_start + length < slen and query_start + length < qlen):
                    score = score_match(subject, query, subject_start, query_start, length)
                    # only print a line of output if the score is better than some limie
                    if (score >= score_limit):
                        setMatches.append( float(score)/qlen )
                        setMatches_seqs.append( pretty_print_match(subject, query, subject_start, query_start, length) )
    
    return np.array( setMatches ), setMatches_seqs

## Aligning and checking RBSs

Since we know for sure that seq 5 has a RBS, let's try out our alinginment algorithm with it.

In [27]:
setMatches, setMatches_seqs = try_all_matches(seqs[4].seq, RBSseqs[5].seq, 10)

for idx in np.argwhere(setMatches == np.amax(setMatches)):
    print "Score: ", setMatches[idx]
    print setMatches_seqs[idx]

Score:  [ 0.91666667]
0           11
 aaagaggagaa
 aaagaggagaa
0           11
n--------------------n
Score:  [ 0.91666667]
1021           1032
 aaagaggagaa
 aaagaggagaa
0           11
n--------------------n


The whole RBS seems compatible around 1021... Which totally seems reasonable. The matching at the beginning is rather strange thou....

We also know there shouldn't be any other RBS there. Let's try to find matches with RBS 4 instead of 5.

In [29]:
setMatches, setMatches_seqs = try_all_matches(seqs[4].seq, RBSseqs[4].seq, 7)

for idx in np.argwhere(setMatches == np.amax(setMatches)):
    print "Score: ", setMatches[idx]
    print setMatches_seqs[idx]

Score:  [ 0.63636364]
456       463
 tcacaca
 tcacaca
0       7
n--------------------n
Score:  [ 0.63636364]
834         843
 tcacactgg
 tcacacagg
0         9
n--------------------n


The score drops significantly!!

Now let's try the other sequences.

In [30]:
setMatches, setMatches_seqs = try_all_matches(RBSseqs[5].seq, RBSseqs[4].seq, 7)

setMatches.shape

(0,)

In [31]:
for j in range(len(RBSseqs)):
    setMatches, setMatches_seqs = try_all_matches(seqs[0].seq, RBSseqs[j].seq, 8)
    if setMatches.shape[0] > 0:
        print 'RBS index: ' + j
        for idx in np.argwhere(setMatches == np.amax(setMatches)):
            print "Score: ", setMatches[idx]
            print setMatches_seqs[idx]

In [32]:
for j in range(len(RBSseqs)):
    setMatches, setMatches_seqs = try_all_matches(seqs[1].seq, RBSseqs[j].seq, 8)
    if setMatches.shape[0] > 0:
        print 'RBS index: ' + j
        for idx in np.argwhere(setMatches == np.amax(setMatches)):
            print "Score: ", setMatches[idx]
            print setMatches_seqs[idx]

In [33]:
for j in range(len(RBSseqs)):
    setMatches, setMatches_seqs = try_all_matches(seqs[2].seq, RBSseqs[j].seq, 8)
    if setMatches.shape[0] > 0:
        print 'RBS index: ' + j
        for idx in np.argwhere(setMatches == np.amax(setMatches)):
            print "Score: ", setMatches[idx]
            print setMatches_seqs[idx]

In [44]:
for j in range(len(RBSseqs)):
    setMatches, setMatches_seqs = try_all_matches(seqs[3].seq, RBSseqs[j].seq, 9)
    if setMatches.shape[0] > 0:
        print 'RBS index: ' + str(j)
        for idx in np.argwhere(setMatches == np.amax(setMatches)):
            print "Score: ", setMatches[idx]
            print setMatches_seqs[idx]

RBS index: 23
Score:  [ 0.75]
175           186
 aaaaagtggaa
 aaagagtggaa
0           11
n--------------------n
RBS index: 37
Score:  [ 0.75]
175           186
 aaaaagtggaa
 aaagagtggaa
0           11
n--------------------n


So.... did we find something?? At 175?