<h1 id="toctitle">Exercise solutions</h1>
<ul id="toc"/>

##BLAST processor

Let's look at the solutions side by side to see the difference between map/filter and comprehensions. 

First, filtering out comment lines:

In [1]:
# filtering out comment lines using a function and filter
def comment_filter(line): 
    return not line.startswith('#') 
hit_lines = filter(comment_filter, open('blast_result.txt')) 

# filtering out comment lines using a list comprehension
hit_lines = [l for l in open('blast_result.txt') if not l.startswith('#')]

Selecting only the hit lines where the number of missmatches is less than 20:

In [3]:
# selecting low-mismatch hits using a filter function
def mismatch_filter(hit_string): 
    mismatch_count = int(hit_string.split("\t")[4]) 
    return mismatch_count < 20 
f = filter(mismatch_filter, hit_lines) 
print(len(f)) 

# selecting low-mismatch hits using a list comprehension
few_mismatch_hits = [line for line in hit_lines if int(line.split("\t")[4]) < 20 ]
print(len(few_mismatch_hits))

# same but more readable
few_mismatch_hits = [
    line
    for line 
    in hit_lines 
    if int(line.split("\t")[4]) < 20 ]
print(len(few_mismatch_hits))

25
25
25


Second part: sort lists by percent id (will be the same), then take first ten and map to subject string:

In [5]:
# functions to get %is and subject 
def get_percent_id(hit_string): 
    return float(hit_string.split("\t")[2])
def get_subject(hit_string): 
    return hit_string.split("\t")[1] 

hits_sorted_by_percent_id = sorted(hit_lines, key=get_percent_id) 
low_id_hits = hits_sorted_by_percent_id[0:10] 
print(map(get_subject, low_id_hits))

# same using a list comprehension
subjects = [ 
    l.split("\t")[1] 
    for l 
    in hits_sorted_by_percent_id[0:10] 
]
print(subjects)

['gi|336287915|gb|AEI30246.1|', 'gi|336287919|gb|AEI30248.1|', 'gi|336287881|gb|AEI30229.1|', 'gi|336287897|gb|AEI30237.1|', 'gi|336287895|gb|AEI30236.1|', 'gi|336287917|gb|AEI30247.1|', 'gi|336287921|gb|AEI30249.1|', 'gi|336287923|gb|AEI30250.1|', 'gi|336287885|gb|AEI30231.1|', 'gi|336287889|gb|AEI30233.1|']
['gi|336287915|gb|AEI30246.1|', 'gi|336287919|gb|AEI30248.1|', 'gi|336287881|gb|AEI30229.1|', 'gi|336287897|gb|AEI30237.1|', 'gi|336287895|gb|AEI30236.1|', 'gi|336287917|gb|AEI30247.1|', 'gi|336287921|gb|AEI30249.1|', 'gi|336287923|gb|AEI30250.1|', 'gi|336287885|gb|AEI30231.1|', 'gi|336287889|gb|AEI30233.1|']


Final part - we can filter and map in a single comprehension:

In [7]:
from __future__ import division
# solution using filter and map
# this requires the following two functions
def cox1_filter(hit_string): 
    subject = hit_string.split("\t")[1] 
    if "COX1" in subject: 
        return True 
    else: 
        return False 
        
def start_ratio(hit_string): 
    query_start = int(hit_string.split("\t")[6]) 
    hit_length = int(hit_string.split("\t")[3]) 
    return query_start / hit_length 

cox1_hits = filter(cox1_filter, hit_lines)
print(map(start_ratio, cox1_hits))

[0.02262443438914027, 0.009009009009009009, 0.02262443438914027, 0.02262443438914027, 0.02262443438914027, 0.02262443438914027, 0.007797270955165692, 0.04308390022675737]


In [8]:
# solution using a list comprehension
# none of the functions above are used
ratios = [ 
    int(l.split("\t")[6]) / int(l.split("\t")[3]) 
    for l in hit_lines 
    if "COX1" in l.split("\t")[1] 
] 
print(ratios) 

[0.02262443438914027, 0.009009009009009009, 0.02262443438914027, 0.02262443438914027, 0.02262443438914027, 0.02262443438914027, 0.007797270955165692, 0.04308390022675737]


##Primer search

It doesn't matter what approach we use to generate the kmers. Let's reuse the recursive one we've seen before:



In [9]:
def generate_primers(length): 
    if length == 1: 
        return ['A', 'T', 'G', 'C'] 
    else: 
        result = [] 
        for seq in generate_primers(length - 1): 
            for base in ['A', 'T', 'G', 'C']: 
                result.append(seq + base) 
        return result 

In [10]:
generate_primers(2)

['AA',
 'AT',
 'AG',
 'AC',
 'TA',
 'TT',
 'TG',
 'TC',
 'GA',
 'GT',
 'GG',
 'GC',
 'CA',
 'CT',
 'CG',
 'CC']

In [11]:
generate_primers(3)

['AAA',
 'AAT',
 'AAG',
 'AAC',
 'ATA',
 'ATT',
 'ATG',
 'ATC',
 'AGA',
 'AGT',
 'AGG',
 'AGC',
 'ACA',
 'ACT',
 'ACG',
 'ACC',
 'TAA',
 'TAT',
 'TAG',
 'TAC',
 'TTA',
 'TTT',
 'TTG',
 'TTC',
 'TGA',
 'TGT',
 'TGG',
 'TGC',
 'TCA',
 'TCT',
 'TCG',
 'TCC',
 'GAA',
 'GAT',
 'GAG',
 'GAC',
 'GTA',
 'GTT',
 'GTG',
 'GTC',
 'GGA',
 'GGT',
 'GGG',
 'GGC',
 'GCA',
 'GCT',
 'GCG',
 'GCC',
 'CAA',
 'CAT',
 'CAG',
 'CAC',
 'CTA',
 'CTT',
 'CTG',
 'CTC',
 'CGA',
 'CGT',
 'CGG',
 'CGC',
 'CCA',
 'CCT',
 'CCG',
 'CCC']

We can rewrite the function as a generator; just change `yield` inside the loop instead of `returning` outside it:

In [27]:
def generate_primers(length): 
    if length == 1: 
        for base in ['A', 'T', 'G', 'C']: 
            yield(base) 
    else: 
        for seq in generate_primers(length - 1): 
            for base in ['A', 'T', 'G', 'C']: 
                yield(seq+base) 
                
p = generate_primers(5)
p

<generator object generate_primers at 0x7fa3181306e0>

The result is a generator object, which we can iterate over:

In [26]:
i = 0
for primer in p:
    if i < 10:
        print(primer)   
    i = i + 1

AAAAA
AAAAT
AAAAG
AAAAC
AAATA
AAATT
AAATG
AAATC
AAAGA
AAAGT


and we can use the same trick to generate pairs:

In [37]:
def generate_pairs(length): 
    for forward in generate_primers(length): 
        for reverse in generate_primers(length): 
            yield(forward, reverse) 

        

Again, we can iterate over the result of calling this function to find pairs matching particular criteria e.g. 

In [40]:
for forward, reverse in generate_pairs(5):
    if 'AGTG' in forward and 'TCAT' in reverse:
        print(forward, reverse)

('AAGTG', 'ATCAT')
('AAGTG', 'TTCAT')
('AAGTG', 'TCATA')
('AAGTG', 'TCATT')
('AAGTG', 'TCATG')
('AAGTG', 'TCATC')
('AAGTG', 'GTCAT')
('AAGTG', 'CTCAT')
('AGTGA', 'ATCAT')
('AGTGA', 'TTCAT')
('AGTGA', 'TCATA')
('AGTGA', 'TCATT')
('AGTGA', 'TCATG')
('AGTGA', 'TCATC')
('AGTGA', 'GTCAT')
('AGTGA', 'CTCAT')
('AGTGT', 'ATCAT')
('AGTGT', 'TTCAT')
('AGTGT', 'TCATA')
('AGTGT', 'TCATT')
('AGTGT', 'TCATG')
('AGTGT', 'TCATC')
('AGTGT', 'GTCAT')
('AGTGT', 'CTCAT')
('AGTGG', 'ATCAT')
('AGTGG', 'TTCAT')
('AGTGG', 'TCATA')
('AGTGG', 'TCATT')
('AGTGG', 'TCATG')
('AGTGG', 'TCATC')
('AGTGG', 'GTCAT')
('AGTGG', 'CTCAT')
('AGTGC', 'ATCAT')
('AGTGC', 'TTCAT')
('AGTGC', 'TCATA')
('AGTGC', 'TCATT')
('AGTGC', 'TCATG')
('AGTGC', 'TCATC')
('AGTGC', 'GTCAT')
('AGTGC', 'CTCAT')
('TAGTG', 'ATCAT')
('TAGTG', 'TTCAT')
('TAGTG', 'TCATA')
('TAGTG', 'TCATT')
('TAGTG', 'TCATG')
('TAGTG', 'TCATC')
('TAGTG', 'GTCAT')
('TAGTG', 'CTCAT')
('GAGTG', 'ATCAT')
('GAGTG', 'TTCAT')
('GAGTG', 'TCATA')
('GAGTG', 'TCATT')
('GAGTG', 'T

In [3]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [4]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")