<h1 id="toctitle">Exercise solutions</h1>
<ul id="toc"/>

## Transforming data between structures

First of all; how to calculate the similarity score, given our description? The genes are stored as sets, so we can use the `intersection()` and `union()` `set` methods:

In [1]:
from __future__ import division

gene_sets = { 
        'arsenic' : {1,2,3,4,5,6,8,12}, 
        'cadmium' : {2,12,6,4}, 
        'copper' : {7,6,10,4,8}, 
        'mercury' : {3,2,4,5,1} 
} 
 
set1 = gene_sets['arsenic'] 
set2 = gene_sets['mercury'] 
print(set1.intersection(set2))
print(set1.union(set2))
len(set1.intersection(set2)) / len(set1.union(set2)) 

set([1, 2, 3, 4, 5])
set([1, 2, 3, 4, 5, 6, 8, 12])


0.625

Next, how to calculate score for each pair of conditions? We can iterate over once with `items()`:

In [2]:
for condition, geneset in gene_sets.items():
    print(condition, geneset)

('mercury', set([1, 2, 3, 4, 5]))
('copper', set([8, 10, 4, 6, 7]))
('arsenic', set([1, 2, 3, 4, 5, 6, 8, 12]))
('cadmium', set([12, 2, 4, 6]))


So to get pairs we just need two nested loops. Let's throw in an extra `if` to avoid comparing condition to itself:

In [3]:
for condition1, set1 in gene_sets.items(): 
    for condition2, set2 in gene_sets.items(): 
        if condition1 != condition2: 
            similarity = len(set1.intersection(set2)) / len(set1.union(set2))
            print(condition1, condition2, similarity) 

('mercury', 'copper', 0.1111111111111111)
('mercury', 'arsenic', 0.625)
('mercury', 'cadmium', 0.2857142857142857)
('copper', 'mercury', 0.1111111111111111)
('copper', 'arsenic', 0.3)
('copper', 'cadmium', 0.2857142857142857)
('arsenic', 'mercury', 0.625)
('arsenic', 'copper', 0.3)
('arsenic', 'cadmium', 0.5)
('cadmium', 'mercury', 0.2857142857142857)
('cadmium', 'copper', 0.2857142857142857)
('cadmium', 'arsenic', 0.5)


We still get comparisons each way i.e. mercury vs. copper and copper vs. mercury but ignore that for now.

Next, how to store the results? The goal is to eventually be able to type

```python
similarity_matrix['arsenic']['cadmium']
```

so it's tempting to just create a `similarity_matrix` dict and then do this:

In [4]:
similarity_matrix = {} 
for condition1, set1 in gene_sets.items(): 
    for condition2, set2 in gene_sets.items(): 
        if condition1 != condition2: 
            similarity = len(set1.intersection(set2)) / len(set1.union(set2))
            similarity_matrix[condition1][condition2] = similarity 

KeyError: 'mercury'

but we get a `KeyError` because the second layer of dicts doesn't exist when we try to add a value to it. Some languages have this; it has the fantastic name of **autovivification**. 

We can either create the key at the start of each outer loop:

In [5]:
similarity_scores = {}
for condition1, set1 in gene_sets.items():
    similarity_scores[condition1] = {}
    for condition2, set2 in gene_sets.items():
        if condition1 != condition2:
            similarity = len(set1.intersection(set2)) / len(set1.union(set2))
            similarity_scores[condition1][condition2] = similarity
similarity_scores

{'arsenic': {'cadmium': 0.5, 'copper': 0.3, 'mercury': 0.625},
 'cadmium': {'arsenic': 0.5,
  'copper': 0.2857142857142857,
  'mercury': 0.2857142857142857},
 'copper': {'arsenic': 0.3,
  'cadmium': 0.2857142857142857,
  'mercury': 0.1111111111111111},
 'mercury': {'arsenic': 0.625,
  'cadmium': 0.2857142857142857,
  'copper': 0.1111111111111111}}

In [6]:
similarity_scores['arsenic']['cadmium']

0.5

or use `defaultdict` to create the second level of dicts as needed. The name of function which creates a dict is just `dict` so `defaultdict(dict)` will create a dict of dicts where the second level appears magically when needed:

In [7]:
from collections import defaultdict
similarity_scores = defaultdict(dict) 
for condition1, set1 in gene_sets.items():
    for condition2, set2 in gene_sets.items():
        if condition1 != condition2:
            similarity = len(set1.intersection(set2)) / len(set1.union(set2))
            similarity_scores[condition1][condition2] = similarity
similarity_scores

defaultdict(dict,
            {'arsenic': {'cadmium': 0.5, 'copper': 0.3, 'mercury': 0.625},
             'cadmium': {'arsenic': 0.5,
              'copper': 0.2857142857142857,
              'mercury': 0.2857142857142857},
             'copper': {'arsenic': 0.3,
              'cadmium': 0.2857142857142857,
              'mercury': 0.1111111111111111},
             'mercury': {'arsenic': 0.625,
              'cadmium': 0.2857142857142857,
              'copper': 0.1111111111111111}})

In [8]:
similarity_scores['arsenic']['cadmium']

0.5

# Listing kmer positions

First we'll set up the variables we need:

In [9]:
dna = 'aattggaattggaattg'
k = 4

The keys to our dict are going to be lists, so for our `defaultdict` we want the default key constructor to be the `list()` function:

In [10]:
from collections import defaultdict

d = defaultdict(list)

Now we iterate over the kmers using a range, and append the positions without worrying about whether the target list exists yet:

In [11]:
for start in range(len(dna) - k + 1):
    kmer = dna[start:start + k]
    d[kmer].append(start)
    
d

defaultdict(list,
            {'aatt': [0, 6, 12],
             'attg': [1, 7, 13],
             'gaat': [5, 11],
             'ggaa': [4, 10],
             'tgga': [3, 9],
             'ttgg': [2, 8]})

In [12]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

IOError: [Errno 2] No such file or directory: u'custom.js'

In [4]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")