# Python primer 7, Data Science I

## Topics to cover

- Dictionaries
- More work with regular expressions
- Input/Output: using loops to process many files at once.

## Useful materials
- Haddock and Dunn chapters and 9, bit of 11 
- regular expression cheat sheet (python-regular-expressions-cheat-sheet-1.pdf)
- slides from class
- [python for biologists dictionaries tutorial](https://pythonforbiologists.com/tutorial/dictionaries.html)


## Creating Dictionaries

Dictionaries (also known as associative arrays, or hashes) are data structures that store unordered pairs of objects. Each pair consists of a key and a value. These types of structures that store pairs of data are common to many programming languages for several reasons. First, we often want one piece of information to be directly connected to another (names and phone numbers; gene name and sequence; site identifier and measurement; codons and their associated amino acids). We can pair such data using lists, but this turns out to be quite slow (and clumsy) for larger data because lists have a fixed ordering of elements. Dictionaries are unordered, thus they have the advantage of allowing you to look up or access pairs of information much more quickly.


Dictionaries can be created by listing key:value pairs within curly brackets, with each pair (also referred to as an item) separated by a comma. Below illustrates how to hard code a dictionary (in this case a list of names and phone numbers) consisting of three items:

In [1]:
Pbook = { 'Ken Stephens':'2599855888','Mick Collins':'3544333321', 'Jen Miles':'9875842194' }


While python has strict indent rules, statements can be split across lines when they are contained within curly brackets. This makes things much easier to read:

In [2]:
Pbook = { 
        'Ken Stephens':'2599855888', 
        'Mick Collins':'3544333321', 
        'Jen Miles':'9875842194' 
    }

The entire dictionary contents can be printed (although this is seldom useful)

In [3]:
 print(Pbook)

{'Ken Stephens': '2599855888', 'Mick Collins': '3544333321', 'Jen Miles': '9875842194'}


Values are extracted from a dictionary by specifying the **dictionary name** and a key. The key is specified within `[]`. The below would return '9875842194'


In [4]:
print(Pbook['Mick Collins'])

3544333321


Here is another dictionary, with western cities as keys and mid October temperatures as values.

In [1]:
Ctemp = { 
        'Tucson': 95, 
        'Truckee': 65, 
        'Reno': 74,
        'Laramie': 50,
        'Flagstaff': 75,
        'Bozeman': 70,
        'Ketchum': 65 
    }

## Useful dictionary methods.

Before we think about building dictionaries while processing real data, lets use the above example dictionary to demonstrate some commonly used methods to manipulate and extract information from dictionaries.

- `.pop()`
- `.get()`
- `.items()`
- `.keys()`
- `.update()`
- `.copy()`

We can print or assign a specific value to a different variable name simply:


In [6]:
print(Ctemp['Reno'])    # prints 74 
Rtemp = Ctemp['Reno']   # assigns 74 to Rtemp

74


Key:value pairs can be removed from a dictionary using `.pop()`.  

In [7]:
Ctemp.pop('Tucson')  # will remove Tucson key and 95 value, returns value, dictionary no longer has this key:value pair
print(Ctemp)
Ctemp['Tucson']=95  # adds that key:value pair back.

`.keys()` returns a list of dict_keys, `.values()` returns a dict_values list, and `.items()` returns a list of key:value pairs

In [9]:
vlist=Ctemp.values()
for v in vlist:
    print(v)  ## prints values




65
74
50
75
70
65
95


In [10]:
klist=Ctemp.keys()
for k in klist:
    print(k)  ## prints keys


Truckee
Reno
Laramie
Flagstaff
Bozeman
Ketchum
Tucson


In [7]:
ilist=Ctemp.items()
for i in ilist:
    print(i)    ## prints items (key:value) pairs

Ctemp.get('Tucson') ## returns 95, the value for the Tucson key

('Tucson', 95)
('Truckee', 65)
('Reno', 74)
('Laramie', 50)
('Flagstaff', 75)
('Bozeman', 70)
('Ketchum', 65)


95

`.update()` adds a key:value pair to the dictionary. Specify both the key and value within curly brackets, surrounded by parentheses. Note below that Bridgeport is a string, so is enclosed in "", and 62 is an integer.


In [2]:
Ctemp.update({"Bridgeport": 62})
print(Ctemp)

{'Tucson': 95, 'Truckee': 65, 'Reno': 74, 'Laramie': 50, 'Flagstaff': 75, 'Bozeman': 70, 'Ketchum': 65, 'Bridgeport': 62}


`.copy()` simply copies the dictionary to the variable specified. Useful for copying prior to modifying

In [3]:
Ntemp = Ctemp.copy()
print(Ntemp)

{'Tucson': 95, 'Truckee': 65, 'Reno': 74, 'Laramie': 50, 'Flagstaff': 75, 'Bozeman': 70, 'Ketchum': 65, 'Bridgeport': 62}


Getting the length of a dictionary works same as getting length of a list

In [7]:
Length_Ctemp=len(Ctemp)
print(Length_Ctemp)

8


### Nested Dictionaries.

The use of nested dictionaries (dictionaries within dictionaries) adds flexibility, although things do tend to get complicated. Below is a short example of how to build nested dictionaries and how to access specific elements.

In [33]:
#Making nested dictionary

Bengals = {'Offense': {9: 'Burrow', 1: "Chase", 83: "Boyd", 75: "Brown", 5: "Higgins"}, 
        'Defense': {29: 'Taylor-Britt', 55: "Wilson", 20: "Turner", 21: "Hilton", 33: "Scott"}} 

# Accessing element using key 
print("Offense dict: ", Bengals['Offense']) 
print("Offense, number 1: ", Bengals['Offense'][1]) 
print("Defense, number 29: ", Bengals['Defense'][29]) 

#iterating through nested dictionaries
for i in Bengals:
    print(i, Bengals[i])
    print(i, Bengals[i].keys())
    print(i, Bengals[i].values())

Offense dict:  {9: 'Burrow', 1: 'Chase', 83: 'Boyd', 75: 'Brown', 5: 'Higgins'}
Offense, number 1:  Chase
Defense, number 29:  Taylor-Britt
Offense {9: 'Burrow', 1: 'Chase', 83: 'Boyd', 75: 'Brown', 5: 'Higgins'}
Offense dict_keys([9, 1, 83, 75, 5])
Offense dict_values(['Burrow', 'Chase', 'Boyd', 'Brown', 'Higgins'])
Defense {29: 'Taylor-Britt', 55: 'Wilson', 20: 'Turner', 21: 'Hilton', 33: 'Scott'}
Defense dict_keys([29, 55, 20, 21, 33])
Defense dict_values(['Taylor-Britt', 'Wilson', 'Turner', 'Hilton', 'Scott'])


## Building dictionaries on the fly with real data

We will rarely construct dictionaries by hardcoding them, as above. Instead, we will populate them with key:value pairs as we loop through data. As with lists, we will often want to start with an empty dictionary that we populate within some form of loop. This is done simply with empty curly brackets.

In [13]:
Dict = {}

Lets look at some example code below that reads in a .fasta file containing alternating lines of identifiers and DNA sequence data. Within the `for` loop, we process two lines at a time, assigning them to different variables. We then build the dictionary "ID_Seq" within the loop, using ID as keys and Seq as values. I like to name dictionaries with two part names that suggest the variables used for keys and values, but like scalars and lists, you can name dictionaries anything you want.

In [9]:
import glob
file=glob.glob("SG_ref.fasta.short") # note, glob.glob always returns a list, even if it only has one item.
print(file) #
IN = open(file[0], 'r') # see above comment, accessing first and only item in file list

['SG_ref.fasta.short']


In [10]:
ID_Seq={}           # initializing the ID_seq dictionary

for Line in IN:
    ID = Line.strip('\n')
    Seq = IN.readline()
    Seq = Seq.strip('\n')
    ID_Seq[ID]=Seq      # building the ID_seq dictionary
#WORK ON DICTIONARY OUTSIDE OF LIST

## Iterating over dictionaries.

Much like we iterate over lists, we can iterate through dictionaries to access, and operate on, both key and value information within loops. The example code below illustrates using a `for` loop to iterate through the dictionary created above. The `sorted()` function sorts the hash by keys, which ensures that each time you loop through the hash the same order is followed.

In [12]:
for thing in sorted(ID_Seq.keys()):
    print("Dictionary key is: ", thing)
    print("Dictionary value is: ",ID_Seq[thing])

Dictionary key is:  >scaffold_1
Dictionary value is:  ATCGGCAGCTACATACCCTCCCTTATCCTGCAAAGCAGCTGGGATCAATCAAAGTATAAAATCAAAACCTTTTTACAGATCGGAAGA
Dictionary key is:  >scaffold_10
Dictionary value is:  GGACGCTCTCGGCTTCAGAACGGCACAACTTCGCTCTCACGGCTCGCAAGTGGCACTCGCGGTGAACTGCTCCGGAGCGTTTGCCGT
Dictionary key is:  >scaffold_100
Dictionary value is:  AGCTGTTCAGTATGAAATGGTTCTCAGAAGCAGCAGGCTCTTTGCCAAAGGGAGAGCTCCAATTAATACAAAACCAGGCTGCCCA
Dictionary key is:  >scaffold_101
Dictionary value is:  AGCCTGCTGCCTTTCCTCTGGAGGGGGTTATTTTCTCCGCACTCCAGAAAGGATCTCTCCCGACATCTTCCAATGTGCTCCCTGAG
Dictionary key is:  >scaffold_102
Dictionary value is:  AGCCTGCTGCCTTTCGTCCGGAGTGGGTTATTTTCTCCGCACTCCAGAAACGATCTCTCCCGACTTCTTCCAATGTGCTCCCTGAGA
Dictionary key is:  >scaffold_103
Dictionary value is:  AGCTGTTCAGTATGAAATGGTTCTCAGAAGCAGCAGGCTCTTTGCCAAAGGGAGAGCTCCAATTAATGCTTCTTATCTGCAACATCC
Dictionary key is:  >scaffold_104
Dictionary value is:  ACCTGCACAGAAAGAATCGATGCCTCAGAACGGGAAAGCACGGAACCAGAAGAGAGAATAAAACGGCATTCAGATTGGGAATTACT

Each iteration through the loop above, "thing" is assigned a key. Hence, the code will print to screen each key followed by its value on the following line.

Here are some examples, based on the little Ctemp dictionary created above, of iterating through dictionaries.


In [14]:
Ctemp = { 
       'Tucson': 95, 
        'Truckee': 65, 
        'Reno': 74,
        'Laramie': 50,
        'Flagstaff': 75,
        'Bozeman': 70,
        'Ketchum': 65 
    }   

for city in Ctemp.keys():
    print("Ctemp key is: ",city)
    print("Ctemp value is: ", Ctemp[city])

Ctemp key is:  Tucson
Ctemp value is:  95
Ctemp key is:  Truckee
Ctemp value is:  65
Ctemp key is:  Reno
Ctemp value is:  74
Ctemp key is:  Laramie
Ctemp value is:  50
Ctemp key is:  Flagstaff
Ctemp value is:  75
Ctemp key is:  Bozeman
Ctemp value is:  70
Ctemp key is:  Ketchum
Ctemp value is:  65


Above we iterate through the dictionary, and are assigning the keys to the the `city` variable for each time through the loop. The first print statement prints each dictionary key (assigned to `city` in this case), the second statement prints the value assigned to each key. If you look at your output below, you will see that dictionaries are unordered.

To iterate through a dictionary in an ordered manner, we can use the sorted function to iterate by sorted key value. The difference between below and above is the elements are accessed by alphanumeric key value.


In [15]:
for city in sorted(Ctemp.keys()):
    print("Ctemp key is: ",city)
    print("Ctemp value is: ", Ctemp[city])
    

Ctemp key is:  Bozeman
Ctemp value is:  70
Ctemp key is:  Flagstaff
Ctemp value is:  75
Ctemp key is:  Ketchum
Ctemp value is:  65
Ctemp key is:  Laramie
Ctemp value is:  50
Ctemp key is:  Reno
Ctemp value is:  74
Ctemp key is:  Truckee
Ctemp value is:  65
Ctemp key is:  Tucson
Ctemp value is:  95


It is often useful to iterate over both key and values (items) in a dictionary to access both simultaneously via different variable names. This is fairly straightforward as well. Using the same Ctemp dictionary as above, but assigning variable names to keys, values and using the `.items()` dictionary function:

In [17]:
for city, temp in Ctemp.items():
    print("Key item is: ", city) # do something with keys
    print("Value item is: ",) # do something with values

Key item is:  Tucson
Value item is: 
Key item is:  Truckee
Value item is: 
Key item is:  Reno
Value item is: 
Key item is:  Laramie
Value item is: 
Key item is:  Flagstaff
Value item is: 
Key item is:  Bozeman
Value item is: 
Key item is:  Ketchum
Value item is: 


Notice in the above example, variables are assigned between `for` and `in` with the first variable name corresponding with keys, and the second corresponding with values. As usual, the names of those variables are arbitrary.


To finish, here is an example chunk of code to open a file handle, build a dictionary, and then loop through that dictionary.

In [20]:
import re
file=glob.glob("no60_intron_IME_data.fasta.short") # accessing file in current directory using glob as above.
IN = open(file[0], 'r')

ID_Seq={}           # initializing the ID_seq dictionary
readCTR=0

for Line in IN:
    ID = Line.strip('\n') # first line corresponds to an ID
    Seq = IN.readline() #reading second line .fa, a DNA seq
    Seq = Seq.strip('\n')
    ID_Seq[ID]=Seq      # building the ID_seq dictionary
    readCTR += 1
    #WORK ON DICTIONARY OUTSIDE OF LIST

for id, seq in ID_Seq.items(): # using .items, working on both value and key of dictionary at same time
    
    if re.search("CDS", id):
        print("Sequence is coding")
    else:
        print("Seq not CDS: ", seq)

print("The total number of items in ID_Seq dictionary is: ", readCTR)
       

Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Seq not CDS:  GTGGGTATACTTTCGTCGATTCTTCTTTTTCTTTATGTTTTAAGCTGTTTGTTTCTTTCGATTTTGTGATTCGATCTTTTAATCCGATGTTTCATTCGTCGATCCGAAAACTAGATTGTTCTCTGTATGATGATTTTCTGCTCAATCTTTGTTTGTGATTCTCTTTTGCAAGCCTATGAATTTCATTTGGACTTTCATTTTATGGTTCTGACTACTGAGTGATGCGTATATGTGTTTAGTGAGTTGACCTTGTAAGCTTTGGTTAATTTTTCTTAAATTCTCCAATCGAAATGACCTTGTTGCTGGAACTAGAAATTTGACGTGTAGCAAAACTTGAGAACAGATCATGAAGTTTTTTTTTTTTTTTTTGTATTGTGCATGATTTACTTGATGAAAGAATGTTTTTTTCTTGCTTTTATTGACGATTGAATTGATGGGTCTTACGCGTGTGTATTCATGAGTTTCATCTTCTGTAACTACCTTTTTCTGAAAGCACTATCTATGTTTTGGTATGTCCTTAATTTGCCTAACCACCACGTGATTGCTCATTTGCTCTTTCATTTGAATCTAG
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is coding
Sequence is cod