# Lesson 4: In-class exercises -- ANSWERS
---

Sarah Middleton (http://sarahmid.github.io/)

http://github.com/sarahmid/python-tutorials

---

**Instructions: For each problem, write code in the provided code block. Don't forget to run your code to make sure it works.**

---

**1\. Simple list and dictionary practice**

Using the data below, write code to accomplish the following tasks.

|    Name   | Favorite Food |
|:---------:|:-------------:|
|  Wilfred  |     Steak     |
|  Manfred  |      Duck     |
| Wadsworth |   Spaghetti   |
|   Jeeves  |   Ice cream   |
| Mitsworth |      Tuna     |

**(A)** Make a list of all the names, then loop through the list and print each name out.

In [None]:
names = ["Wilfred", "Manfred", "Wadsworth", "Jeeves", "Mitsworth"]

for name in names:
    print name

**(B)** Below, some of the names and foods have already been added to a dictionary. Fill in the missing entries using the `dict[key] = value` syntax. Then loop through the dictionary and print each name and food combination in the format:

    <NAME>'s favorite food is <FOOD>

In [None]:
favFoods = {"Wilfred":"Steak", "Manfred":"Duck", "Wadsworth":"Spaghetti"}

# add your code below:
favFoods["Jeeves"] = "Ice cream"
favFoods["Mitsworth"] = "Tuna"

for person in favFoods:
    print person + "'s favorite food is " + favFoods[person]

**(C)** In the dictionary from part (B), change Wilfred's favorite food to pizza.

In [None]:
favFoods["Wilfred"] = "Pizza"

print favFoods

---

**2\. Duplicate removal**

Read in the file `genes.txt` and print **only the unique gene IDs** (remove the duplicates). Do not assume repeat IDs appear consecutively in the file. 

*Hint: see the practice exercises from Lesson 4 for an example of how to remove duplicates using a list.*

In [None]:
inFile = "genes.txt"
geneList = []
ins = open(inFile, 'r')

for line in ins:
    geneID = line.rstrip('\r\n')
    if geneID not in geneList:    # keep track of IDs we've already seen using a list
        geneList.append(geneID)   # add to list & print only if this ID hasn't been see before
        print geneID

ins.close()

---

**3\. Split practice**

Read in the file `init_sites.txt` and compute the average CDS length (i.e. average the values in the 7th column). Your answer should be 236.36.

In [None]:
fileName = "init_sites.txt"
totalLen = 0
numLines = 0
ins = open(fileName, 'r')
ins.readline()

for line in ins:
    line = line.rstrip('\n')
    lineParts = line.split()
    totalLen = totalLen + int(lineParts[6]) #all file input is read as string; must convert to int
    numLines = numLines + 1
    
print float(totalLen)/numLines
ins.close()

---

**4\. The "many counters" problem**

Write a script that reads a file of sequences and tallies how many sequences there are of each length. Use `sequences3.txt` as input to test your code. After reading through all the sequences, print the sequence length that was the most common.

*Hint: you can use a dictionary to keep track of all the tallies, e.g.:*

In [None]:
inFile = "sequences3.txt"
lenDict = {}
ins = open(inFile, 'r')

for line in ins:
    line = line.rstrip('\r\n') #important to do here, since '\n' and '\r' count as a character and thus increases the length of the sequence!
    seqLen = len(line)
    if seqLen not in lenDict:
        lenDict[seqLen] = 1
    else:
        lenDict[seqLen] += 1
ins.close()

# loop through the hash to find the sequence with the greatest number of occurences.
# also print the number of times each sequence length occured for informational purposes
maxFreqSeenSoFar = 0
mostFreqLength = None    # "None" is just a null value for initialization

for name in lenDict:
    print name, ":", lenDict[name]
    
    if lenDict[name] > maxFreqSeenSoFar:
        maxFreqSeenSoFar = lenDict[name]
        mostFreqLength = name
    
print ""
print "Most frequent length:", mostFreqLength, "(occured", maxFreqSeenSoFar, "times)"


---

# Homework exercise
---

**Codon table**

For this question, use `codon_table.txt`, which contains a list of all possible codons and their corresponding amino acids. We will be using this info to translate a nucleotide sequence into amino acids. Each part of this question builds off the previous parts.

**(A)** Thinkin' question (short answer, not code): If we want to create a codon dictionary and use it to translate nucleotide sequences, would it be better to use the codons or amino acids as keys? 

**(B)** Read in `codon_table.txt` (note that it has a header line) and use it to create a codon dictionary. Then use `raw_input()` prompt the user to enter a single codon (e.g. ATG) and print the amino acid corresponding to that codon to the screen.

In [None]:
inFile = "codon_table.txt"
codon2aa = {}
ins = open(inFile, 'r')
ins.readline() #skip header

for line in ins:
    line = line.rstrip('\n')
 
    # since I know there are exactly 2 values on every line, I can use this shorthand 
    # notation to automatically "unpack" the returned list into named variables.
    (codon, aa) = line.split() 
    if codon not in codon2aa:
        codon2aa[codon] = aa
    else:
        print "Warning! Multiple entries found for the same codon (" + codon + "). Skipping."
        
ins.close()

request = raw_input("Codon to translate: ").upper() #read & covert to uppercase
if request in codon2aa:
    print request, "-->", codon2aa[request]
else:
    print "Did not recognize that codon."

**(C)** Now we will adapt the code in (b) to translate a longer sequence. Instead of prompting the user for a single codon, allow them to enter a longer sequence. First, check that the sequence they entered has a length that is a multiple of 3 (Hint: use the mod operator, %), and print an error message if it is not. If it is valid, then go on to translate every three nucleotides to
an amino acid. Print the final amino acid sequence to the screen.

In [None]:
inFile = "codon_table.txt"
codon2aa = {}
ins = open(inFile, 'r')
ins.readline() #skip header

for line in ins:
    line = line.rstrip('\n')
    (codon, aa) = line.split() 
    if codon not in codon2aa:
        codon2aa[codon] = aa
    else:
        print "Warning! Multiple entries found for the same codon (" + codon + "). Skipping."
        
ins.close()

# get user input
request = raw_input("Sequence to translate (multiple of 3): ").upper()
protSeq = ""

if (len(request) % 3) == 0:
    
    # this indexing/slicing is tricky! definitely try this sort of thing out in the 
    # interpreter to make sure you get it right.
    for i in range(0,len(request),3):
        codon = request[i:i+3]
        if codon in codon2aa:
            protSeq += codon2aa[codon]
        else:
            print "Warning! Unrecognized codon (" + codon + "). Using X as a place holder."
            protSeq += "X"
    
    print "Your protein sequence is:", protSeq

else:
    print "Please enter a sequence length that is a multiple of 3."

**[ OPTIONAL ] (D)** Now, instead of taking user input, you will apply your translator to a set of sequences stored in a file. Read in the sequences from `sequences3.txt` (assume each line is a separate sequence), translate it to amino acids, and print it to a new file called `proteins.txt`.

In [None]:
inFile = "codon_table.txt"
codon2aa = {}
ins = open(inFile, 'r')
ins.readline() #skip header

for line in ins:
    line = line.rstrip('\n')
    (codon, aa) = line.split() 
    if codon not in codon2aa:
        codon2aa[codon] = aa
    else:
        print "Warning! Multiple entries found for the same codon (" + codon + "). Skipping."
        
ins.close()

# read file of sequences
inFile = "sequences3.txt"
outFile = "proteins.txt"
ins = open(inFile, 'r')
outs = open(outFile, 'w')
lineNum = 1 #just used for nicer error message

for line in ins:
    line = line.rstrip('\n')
    protSeq = "" #best to define this with the loop so it's re-created for each separate sequence.
    
    if (len(line) % 3) == 0:
        for i in range(0,len(line),3):
            codon = line[i:i+3]
            if codon in codon2aa:
                protSeq += codon2aa[codon]
            else:
                print "Warning! Unrecognized codon ("+codon+"). Using X as a place holder."
                protSeq += "X"
                
        outs.write(protSeq + '\n') # write to output file
        
    else:
        print "Line "+lineNum+" - Encountered sequence length that is not a multiple of 3. Skipping."
        
    lineNum += 1
    
outs.close()
ins.close()