# Data structures

<h1 id="toctitle">Contents</h1>
<ul id="toc"/>

## Basic collection types

### Lists and dicts

What do we already know about data structures?

lists store collections of elements

In [1]:
l = [1,2,3]
print l[1] 
for e in l:
    print(e+1) 

2
2
3
4


dicts store pairs of values as items for rapid lookup (has embedded hash table):

In [2]:
enzymes = { 
'EcoRI' : 'GAATTC',
'AvaII' : 'GGACC',
'BisI' : 'GCNGC' 
}

enzymes['AvaII']

'GGACC'

### Tuples

Tuples appear to be similar to lists:

In [5]:
t = (4, 5, 6)
print t[1] 
for e in t:
    print(e+1) 

5
5
6
7


until we try to change one of the values:

In [6]:
t[1] = 9

TypeError: 'tuple' object does not support item assignment

#### Tuples are immutable; they cannot be changed once created

This lets Python make some time/memory optimizations.

Tuples are useful for storing heterogenous data (think records / rows from a table / simple objects):

In [7]:
t1 = ('actgctagt', 'ABC123', 1)
t2 = ('ttaggttta', 'XYZ456', 1)
t3 = ('cgcgatcgt', 'HIJ789', 5)

More to say about tuples later...

### Sets

{1,2,3,4,6}

Sets are like lists but with
- no order, and
- fast lookup

Set is an unordered collection of unique items. Set is defined by values separated by comma inside braces { }. Items in a set are not ordered.

Hence the slicing operator [ ] does not work.

In [3]:
a = {'apple', 'pear', 'banana'}
b = {'apple', 'pear', 'banana', 'apple'}
b

{'apple', 'banana', 'pear'}

Think of sets as either like
- unordered lists with rapid lookup, or
- dicts without values

## A closer look at lists

Hopefully we are all familiar with the idea of lists of numbers and strings:

In [3]:
[1,2,3,4]
['a', 'b', 'c']

['a', 'b', 'c']

A slightly more exotic idea: we can have lists of `File` objects:

In [4]:
files = [open("blast_result.txt"), open("sequences.fasta")]
files

[<_io.TextIOWrapper name='blast_result.txt' mode='r' encoding='UTF-8'>,
 <_io.TextIOWrapper name='sequences.fasta' mode='r' encoding='UTF-8'>]



If we create a list where each element is also a list, we have a two-dimensional list or list-of-lists:

In [5]:
list_of_lists = [[1,2,3],[4,5,6],[7,8,9]]
list_of_lists

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

More readably:

In [6]:
list_of_lists = [[1,2,3],
                 [4,5,6],
                 [7,8,9]]

Each element is just a normal list:

In [7]:
list_of_lists[1]

[4, 5, 6]

and we can use two brackets to address one of the inner elements:

In [8]:
list_of_lists[1][2]

6

Where might this be useful? Imagine storing a multiple sequence alignment:

In [9]:
aln = [['A', 'T', '-', 'T', 'G'], 
       ['A', 'A', 'T', 'A', 'G'], 
       ['T', '-', 'T', 'T', 'G'], 
       ['A', 'A', '-', 'T', 'A']]  

We could get a single aligned sequence

In [10]:
aln[2]

['T', '-', 'T', 'T', 'G']

or a single column (don't worry about list comprehension if you haven't seen them yet):

In [11]:
#get the fourth column
[seq[3] for seq in aln]

['T', 'A', 'T', 'T']

### Lists of dicts and tuples

We can build lists of other things too. Imagine we have a collection of DNA sequence records. We could store this a list of dicts:

In [12]:
# a list of dicts
records = [
    {'seq' : 'actgctagt', 'accession' : 'ABC123', 'genetic_code' : 1},
    {'seq' : 'ttaggttta', 'accession' : 'XYZ456', 'genetic_code' : 1},
    {'seq' : 'cgcgatcgt', 'accession' : 'HIJ789', 'genetic_code' : 5}
]

for record in records:
    print('accession number : ' + record['accession'])
    print('genetic code: ' + str(record['genetic_code'])) 

accession number : ABC123
genetic code: 1
accession number : XYZ456
genetic code: 1
accession number : HIJ789
genetic code: 5


## A taxonomy of data structures

We have looked at
- lists of lists
- lists of dicts

It's also possible to build lists of sets. 

How about other data structures?

#### Sets

Elements in sets have to be immutable (so they can be hashed) so we can't build
- sets of lists
- sets of dicts
- sets of sets

we can build sets of tuples (though I'm not sure why).

#### Tuples

We can build tuples where the individual elements are lists/sets/dicts/tuples, but they tend not to be very useful.

## Fun with Dicts

Dicts of things turn out to be **very** useful. They allow us to 
- attach names to other data structures, and
- rapidly look up other data structures using those names.

### Dicts of sets 
Imagine we have run a gene expression experiment in which we subject some cells to various metal elements and record which genes are overexpressed in response. The data might look like this:

In [15]:
gene_sets = {
    'arsenic' : {1,2,3,4,5,6,8,12},
    'cadmium' : {2,12,6,4},
    'copper' : {7,6,10,4,8},
    'mercury' : {3,2,4,5,1}
}
gene_sets['copper']




{4, 6, 7, 8, 10}

This data structure leverages the features of dicts (rapidly look up a gene set from the metal name) and sets (rapidly check membership). E.g. is gene number 3 over-expressed in response to arsenic?

In [16]:
3 in gene_sets['arsenic']

True

Which conditions is gene 5 over-expressed in response to?

In [17]:
for metal, genes in gene_sets.items(): 
    if 5 in genes: 
        print(metal)

mercury
arsenic


Or even more concisely (wait for comprehensions....):

In [18]:
[metal for metal, genes in gene_sets.items() if 5 in genes]

['mercury', 'arsenic']

Now, a more interesting question: are there any conditions whose genes are a subset of another condition's genes? 

In [19]:
for condition1,set1 in gene_sets.items(): 
     for condition2,set2 in gene_sets.items():
            if set1.issubset(set2) and condition1 != condition2: 
                print(condition1 + ' is a subset of ' + condition2)  

cadmium is a subset of arsenic
mercury is a subset of arsenic


Notice how we use the features of both dicts (to get hold of the condition names) and sets (using the `issubset()` method).

### List of tuples

Remember our list of tuples for storing DNA sequence records:

In [20]:
records = [
    ('actgctagt', 'ABC123', 1),
    ('ttaggttta', 'XYZ456', 1),
    ('cgcgatcgt', 'HIJ789', 5)
]

This is great for iterating over all records:

In [23]:
for record in records:
    (sequence, accession, code) = record
    print("looking at record " + accession + " with genetic code " + str(code))
    # do something with the record

looking at record ABC123 with genetic code 1
looking at record XYZ456 with genetic code 1
looking at record HIJ789 with genetic code 5


but not great for finding a specific record:

In [24]:
for record in records:
    if record[1] == 'XYZ456':
        print("Found it!")
        (sequence, accession, code) = record
        # do something with the record

Found it!


### Dict of tuples
Here's the same data as a dict of tuples. We turn the accession into the key:

In [26]:
records = {
    'ABC123' : ('actgctagt', 1),
    'XYZ456' : ('ttaggttta', 1),
    'HIJ789' : ('cgcgatcgt', 5)
}

Now it's just as easy to iterate over all records:

In [27]:
for accession, record in records.items():
    (sequence, code) = record
    print("looking at record " + accession + " with genetic code " + str(code))
    # do something with the record

looking at record HIJ789 with genetic code 5
looking at record ABC123 with genetic code 1
looking at record XYZ456 with genetic code 1


and it's also easy to retrieve a specific record by accession:

In [28]:
my_record = records.get('XYZ456')
(this_sequence, this_code) = my_record
print("looking at record " + accession + " with genetic code " + str(code))
# do something with the record

looking at record XYZ456 with genetic code 1


## Some special data structures from the standard library

### `Collections.Counter`

Common scenario number one: we want to count the number of times each unique element occurs in a collection of things. E.g. counting bases in a DNA sequence:

In [29]:
dna = 'aattggaattggaattg'
base_counts = {}
for base in dna:
    current_count = base_counts.get(base, 0)
    base_counts[base] = current_count + 1

print(base_counts)

{'a': 6, 't': 6, 'g': 5}


`collections.Counter` is a special dict class for doing this. Construct it by passing a list (or string, etc.) as the argument:

In [30]:
import collections

dna = 'aattggaattggaattg'
base_counter = collections.Counter(dna)

print(base_counter)
print(base_counter.most_common())
print(base_counter.get('t'))

Counter({'a': 6, 't': 6, 'g': 5})
[('a', 6), ('t', 6), ('g', 5)]
6


### `collections.defaultdict`

Common scenario number two: we want to have a dict where there's a default value for new keys. Let's tackle the exact same problem. Previously we have implicitly used zero as the default value when retrieving a key:

In [31]:
dna = 'aattggaattggaattg'
base_counts = {}
for base in dna:
    current_count = base_counts.get(base, 0)
    base_counts[base] = current_count + 1

print(base_counts)

{'a': 6, 't': 6, 'g': 5}


if we use a `defaultdict` we can supply a function that will be used to create the value for a key when we ask for it.

In [32]:
import collections

def return_zero():
    return 0

# note no parens after function name
dd = collections.defaultdict(return_zero)
dd['banana']

0

In [33]:
# or with a lambda
dd = collections.defaultdict(lambda : 0)
dd['apple']

0

Now we can manipulate values in the dict without worrying about whether there is already a value there:

In [34]:
dna = 'aattggaattggaattg'
base_counts = collections.defaultdict(lambda : 0)
for base in dna:
    base_counts[base] = base_counts[base] + 1
print(base_counts)


# or more concisely
base_counts = collections.defaultdict(lambda : 0)
for base in dna:
    base_counts[base] +=  1
print(base_counts)


defaultdict(<function <lambda> at 0x7f2254448510>, {'a': 6, 't': 6, 'g': 5})
defaultdict(<function <lambda> at 0x7f22544480d0>, {'a': 6, 't': 6, 'g': 5})


## Exercise

### Transforming data between structures

Use the heavy metal gene expression data.
 
The similarity score between two conditions is the number of over-expressed genes in common (the intersection) divided by the total number of over-expressed genes (the union).
 
Write a program that will start with the dict of sets and produce a pairwise similarity matrix stored as a dict of dicts. 
 
We should be able to get the score for a given pair of conditions like this:

```python
score = similarity_matrix['arsenic']['cadmium']
```

Hints

- take a look at [the documentation for sets](https://docs.python.org/2/library/stdtypes.html#set).
- this is really three problems: (1) how to calculate similarity for any two given set of genes and (2) how to generate all pairs of sets of genes and (3) how to store the resulting scores.
- think about what the final data structure will look like... you can figure this out from the code fragment above. Draw the final data structure if it helps.



In [None]:

gene_sets = { 
        'arsenic' : {1,2,3,4,5,6,8,12}, 
        'cadmium' : {2,12,6,4}, 
        'copper' : {7,6,10,4,8}, 
        'mercury' : {3,2,4,5,1} 
} 
 
set1 = gene_sets['arsenic'] 
set2 = gene_sets['mercury'] 
print(set1.intersection(set2))
print(set1.union(set2))
len(set1.intersection(set2)) / len(set1.union(set2)) 

Next, how to calculate score for each pair of conditions? We can iterate over once with `items()`:

In [None]:
for condition, geneset in gene_sets.items():
    print(condition, geneset)

So to get pairs we just need two nested loops. Let's throw in an extra `if` to avoid comparing condition to itself:

In [38]:
for condition1, set1 in gene_sets.items(): 
    for condition2, set2 in gene_sets.items(): 
        if condition1 != condition2: 
            similarity = len(set1.intersection(set2)) / len(set1.union(set2))
            print(condition1, condition2, similarity) 

cadmium mercury 0.2857142857142857
cadmium copper 0.2857142857142857
cadmium arsenic 0.5
mercury cadmium 0.2857142857142857
mercury copper 0.1111111111111111
mercury arsenic 0.625
copper cadmium 0.2857142857142857
copper mercury 0.1111111111111111
copper arsenic 0.3
arsenic cadmium 0.5
arsenic mercury 0.625
arsenic copper 0.3


We still get comparisons each way i.e. mercury vs. copper and copper vs. mercury but ignore that for now.

Next, how to store the results? The goal is to eventually be able to type

```python
similarity_matrix['arsenic']['cadmium']
```

so it's tempting to just create a `similarity_matrix` dict and then do this:

In [39]:
from collections import defaultdict
similarity_scores = defaultdict(dict) 
for condition1, set1 in gene_sets.items():
    for condition2, set2 in gene_sets.items():
        if condition1 != condition2:
            similarity = len(set1.intersection(set2)) / len(set1.union(set2))
            similarity_scores[condition1][condition2] = similarity
similarity_scores

defaultdict(dict,
            {'arsenic': {'cadmium': 0.5, 'copper': 0.3, 'mercury': 0.625},
             'cadmium': {'arsenic': 0.5,
              'copper': 0.2857142857142857,
              'mercury': 0.2857142857142857},
             'copper': {'arsenic': 0.3,
              'cadmium': 0.2857142857142857,
              'mercury': 0.1111111111111111},
             'mercury': {'arsenic': 0.625,
              'cadmium': 0.2857142857142857,
              'copper': 0.1111111111111111}})

In [40]:
similarity_scores['arsenic']['cadmium']

0.5

In [41]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [42]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")