# Dictionaries

- A dictionary (also known as associative array, hash or map) is another type of data container
- very useful for storing "paired data"

In [1]:
PhoneNumbers = {}   # Create an empty dictionary 
PhoneNumbers['John'] = '463673'
PhoneNumbers['Mary'] = '279943'

True to the dictionary analogy, values in dictionaries are looked up according to their keys, rather than by their position (as would happen in a list).

We extract values from the dictionary using square brackets `[]`

In [2]:
#prints out Mary's phone number
print(PhoneNumbers['Mary'])

279943


The contents of dictionaries can be modified

In [3]:
PhoneNumbers['Mary'] = '123'
print(PhoneNumbers['Mary'])

123


## Creating dictionaries

There are several ways to create dictionaries: 

The most direct way is like this:

```python
MyDictionary = {key:value, nextkey:nextvalue}
```

For example

In [4]:
PhoneNumbers = {'John':'463673', 'Mary':'279943'}

In this example, the keys are strings, and the values are strings.

💡 Python does allow a statement to be split across lines if the splits occurs within `(),[] or {}`.

In [5]:
PhoneNumbers = {
    'John':'463673',
    'Mary':'279943',
    'William':'393493',
    'Bob':'236333', 
    'Linda':'976866'}

is also valid, and often helps to make long list/dictionary entries more readible.

💡 The third way of creating is handy when you have 2 lists, one containing keys and the other containing values.

```python
Names = ['John','Mary']
Numbers = ['463673','279943']
PhoneNumbers = dict(zip(Names,Numbers))
```

If we are not sure whether a key is in a dictionary or not, we can test it:

In [6]:
print('Robbie' in PhoneNumbers)
print('Mary' in PhoneNumbers)

False
True


It is often convenient to combine dictionaries and lists, making a dictionary of lists. For instance, you could associate lists of contact data with particular persons.

```python
Contacts = {}
# [PhoneNumber, City, Street, YearOfBirth] 
Contacts['John'] = ['463673', 'Zurich', 1984]
Contacts['Mary'] = ['279943', 'Paris', 1983]
Contacts['William'] = ['393493', 'London', 1990]
```

You can also create a dictionary of dictionaries.

In summary, dictionaries are more flexible than lists: In a dictionary, you can create an entry e.g. for key 16 without having any data corresponding to keys 0 to 15. This lets you fill in data as they become known.

Importantly, dictionaries provide efficient lookup.

## Important dictionary functions
### The .get() function

allows to extract values from a dictionary but with a default value to be returned if the entry doesn't exist.

```python
codon_table['atg']
codon_table.get('atg')  # is the same

codon_table.get('atg', 0)
```

### The .keys() function

Extracts a list of the keys in a dictionary. Remember that there is no intrinsic order to the keys or values in a dictionary.

### The .values() function

Extracts a list of the values in a dictionary. Remember that there is no intrinsic order to the keys or values in a dictionary .

## Listing keys and values

In Python pseudocode:

```python
for MyItem in MyCollection:
   do a command with MyItem
   do another command
resume operation of main commands
```

We can loop through a dictionary in two ways:

In [7]:
# traversing through keys
for person in PhoneNumbers.keys():
    print(person, '->', PhoneNumbers[person])
    
# the short version:
#for person in PhoneNumbers:
#   print(person, PhoneNumbers[person])

John -> 463673
Mary -> 279943
William -> 393493
Bob -> 236333
Linda -> 976866


In [8]:
# we can also get keys and values directly
for key, value in PhoneNumbers.items():
    print(key, '->', value)

John -> 463673
Mary -> 279943
William -> 393493
Bob -> 236333
Linda -> 976866


It is important to note that dictionaries are “unordered” and do not remember the sequence of their items (i.e. the order in which key:value pairs were added to the dictionary). Because of this, the order in which items are returned from loops over dictionaries might appear random and can even change with time.

## Sorting

The power of dict comes now from sorting: We can sort the keys and then retrieve the associated values in a particular order. After sorting, we get back a sorted list as dictionaries are unordered.

```python
SortedKeys = sorted(PhoneNumbers.keys())
```

We can then loop over the sorted keys using (we don't need to write the .keys()):

```python
for PersonSorted in SortedKeys:
    print SortedPerson, PhoneNumbers[PersonSorted]
```

To get a list containing sorted values we do:

```python
SortedValues = sorted(PhoneNumbers.values())
```

If we need the key-value pairs sorted by values it is a bit more complicated:

```python
SortedValuesAsPairs = sorted(PhoneNumbers.items(), key=lambda x: x[1])
```

Again we are getting back a list.

The sorting order can be changed using the reverse=True parameter.

## Key points

- Dictionary (also known as associative array, hash or map) is another type of container
- Collection of names, or keys, with each key pointing to an associated value
- The keys for a dictionary must be unique
- Keys can be integers or strings
- Unordered: Elements don't have a fixed position

## Further Reading

see OrderedDict from module collections that remembers the order entries were added https://docs.python.org/3/library/collections.html

## Exercise: Codon Table

We have a dictionary codon_table containing the codon table, it will look up the amino acid encoded by the codon.

```python
codon_table = {  
  'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',  
  'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',  
  'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',  
  'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',  
  'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',  
  'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',  
  'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',  
  'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*', 'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W'}  
```

For instance, print(codon_table['ATG']) will print out 'M'

1. Write a script that prints out the protein sequence encoded by the DNA 'GTGCGGCCACCT'. Print out the position,the codon and the growing amino acid sequence. Hint: look at the function `range()`
2. Convert it into a function (if you did not write it as a function)
3. Make your script safer by also allowing both upper and lower case letters, check that only valid letters occur, ...
 💡 Use the Cheat Sheet to find out which string function to use for converting letters to
4. Make a count table how many codons encode the same amino acid
5. Print out a sorted count table
6. Write a backtranslator (protein -> DNA). Make up an example amino sequence, backtranslate it into DNA and translate it again into a protein (Using the function you wrote above).

## Exercise: Counting insects

The following script you found on the internet. Try to understand it. How needs the input file to be formatted?
1. Make up an input file and run the script.
2. Modify the script so that it reads in only the second column of each line.

```python
# CountInsects.py
import sys

def count_names(lines):
  '''Count unique lines of text, returning dictionary.'''

  result = {}                   # Create an empty directionary to fill
  for name in lines:            # Handle input values one at a time
     name = name.strip()
     if name in result:     # If we have seen this name before...
            result[name] = result[name] + 1    # add one to its count
     else:                                     # If it is the first time we see that name
            result[name] = 1

  return result


reader = open(sys.argv[1], 'r')
lines = reader.readlines()
reader.close()

# Count distinctive values
count = count_names(lines)
for name in count:
   print(name, count[name])  
```

## Solutions for Exercise Codon Table

In [9]:
codon_table = {  
  'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',  
  'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',  
  'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',  
  'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',  
  'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',  
  'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',  
  'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',  
  'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*', 'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W'}

In [10]:
print(codon_table['ATG'])

M


### 1. Write a script that prints out the protein sequence encoded by the DNA 'GTGCGGCCACCT'. Print out the position,the codon and the growing amino acid sequence.

In [11]:
dna = 'GTGCGGCCACCT'

aaseq = ''

for pos in range(0, len(dna)-1, 3):
    codon = dna[pos:pos+3]
    aaseq += codon_table[codon]
    print(pos, codon, aaseq)

0 GTG V
3 CGG VR
6 CCA VRP
9 CCT VRPP


### 2. Convert it into a function (if you did not write it as a function)

In [12]:
def Translate_DNA(dna):

   codon_table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

   aaSeq = ''
   for pos in range(0, len(dna)-1, 3):
      codon = dna[pos:pos+3]
      aaSeq += codon_table[codon]
   return aaSeq



print(Translate_DNA('GTGCGGCCACCT'))

VRPP


### 3. Make your script safer by also allowing both upper and lower case letters, check that only valid letters occur, the length of the DNA sequence is a multiple of 3...

In [13]:
def Safe_Translate_DNA(dna):
   "Translates DNA into amino acid and checks that input sequence is a valid DNA sequence"
   
   from sys import exit
   
   codon_table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

   dnaUpper = dna.upper()
   if len(dnaUpper) % 3 != 0 :
       exit('DNA sequence length is not a multiple of 3')
   for letter in dnaUpper:
      if letter not in ['G','A','T','C']:
          exit('Invalid letter(s) other than GATC present in sequence') # sys.exit stops execution
   aaSeq = ''
   for pos in range(0, len(dnaUpper)-1, 3):
      codon = dnaUpper[pos:pos+3]
      aaSeq += codon_table[codon]
   return aaSeq


print(Safe_Translate_DNA('GTGCGGCCACCT'))

VRPP


### 4. Make a count table of how many codons encode the same amino acid

In [14]:
codon_table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

codon_count = {}

for codon in codon_table:
    aa = codon_table[codon]
    if aa in codon_count:     # If we have seen this AA before...
        codon_count[aa] = codon_count[aa] + 1    # add one to its count
    else:                                      # If it is the first time we see that name
        codon_count[aa] = 1

for ele in codon_count:
    print(ele, codon_count[ele])

I 3
M 1
T 4
N 2
K 2
S 6
R 6
L 6
P 4
H 2
Q 2
V 4
A 4
D 2
E 2
G 4
F 2
Y 2
_ 3
C 2
W 1


### 5. Print out a sorted count table

In [15]:
# Prints out number of different codons encoding the same amino acid 

codon_table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

codon_count = {}

for codon in codon_table:
    aa = codon_table[codon]
    if aa in codon_count:     # If we have seen this AA before...
        codon_count[aa] = codon_count[aa] + 1    # add one to its count
    else:                                      # If it is the first time we see that name
        codon_count[aa] = 1

sorted_codon_count = sorted(codon_count.items(), key=lambda x: x[1], reverse=True)

for ele in sorted_codon_count:
    print(ele[0], ele[1])

S 6
R 6
L 6
T 4
P 4
V 4
A 4
G 4
I 3
_ 3
N 2
K 2
H 2
Q 2
D 2
E 2
F 2
Y 2
C 2
M 1
W 1


### 6. Write a backtranslator (protein -> DNA). Make up an example amino sequence, backtranslate it into DNA and translate it again into a protein (Using the function you wrote above). 

In [16]:
# Backtranslates an amino acid sequence

def Backtranslate_DNA(aa_seq):
    
    codon_table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}
    
    rev_codon_table = {}
    
    for codon in codon_table:
        aa = codon_table[codon]
        rev_codon_table[aa] = codon

    nt_seq = ''
    
    for aa in aa_seq:
        nt_seq += rev_codon_table[aa]

    return nt_seq    
  
 
aa = 'VRPP'
aa_backtr = Backtranslate_DNA(aa)

print(aa)
print(aa_backtr)
print(Safe_Translate_DNA(aa_backtr))

VRPP
GTTCGTCCTCCT
VRPP


## Solution for Exercise Counting insects

```python

**more SeenInsects.txt** 
Ladybug
Mealworm
Fruitfly
Bumblebee
Hoverfly
Fruitfly
  
  

python CountInsects.py SeenInsects.txt
Ladybug 1
Hoverfly 1
Mealworm 1
Bumblebee 1
Fruitfly 2
```