In [1]:
#Generator Statements and Generators (Nov. 5) Submission
#Nov 24, 2020
#Zeke Van Dehy

# Generator statements practice #

Let's review some basic ways to use generator statements.

# Within a function #

One example that we've seen in class is use of a generator statement with a function that can handle an iterable as an argument.

If I have a list: 

```aminos = ["ALA","CYS","ASP","GLU","PHE","GLY","HIS","ILE","LYS","LEU","MET","ASN","PRO","GLN","ARG","SER","THR","VAL","TRP","TYR"]```

and I use .writelines(aminos) to write each item in the list to a file, I will get an output file where the individual items are all smashed together on one line.

To get around this, I can write a generator statement that uses a format statement to reformat that list on the fly as I'm writing it.

```with open("test.out","a") as fo:
    fo.writelines("{}\n".format(amino) for amino in aminos)```
    
Following this example, create some code that:

- makes the list of all possible codons
- then writes each codon to an output file on a separate line

You can use generator statements with other kinds of functions, too, if they are set up to take an iterable as input. 

In [3]:
aminos = ["ALA","CYS","ASP","GLU","PHE","GLY","HIS","ILE","LYS","LEU","MET","ASN","PRO","GLN","ARG","SER","THR","VAL","TRP","TYR"]
with open("test.out","w") as fo:
    fo.writelines("{}\n".format(amino) for amino in aminos)
    fo.writelines("\n")
    
def ref_codons():
    codons = []
    for base1 in ["A","T","G","C"]:
        for base2 in ["A","T","G","C"]:
            for base3 in ["A","T","G","C"]:
                codons.append(base1+base2+base3)
    return codons

with open("test.out","a") as fo:
    fo.writelines("{}\n".format(codon) for codon in ref_codons());

# In a list comprehension #

We've also used list comprehensions before. A list comprehension will create a new list from an existing list (or from an iterable that's behaving like a list, for instance, file contents).

A simple list comprehension has three parts. Transformation, iteration, and filter. Example:

```bytens = [i * 5 for i in range(1,20) if i % 2 == 0]```

makes the list [10, 20, 30, 40, 50, 60, 70, 80, 90]

```compdict = {"A":"T", "T":"A", "G":"C", "C":"G"}
    dnaseq = "ATCGATCGTACG"
    complement = [compdict.get(char) for char in dnaseq if char in ["A","T","G","C"]]```
    
makes the list ['T', 'A', 'G', 'C', 'T', 'A', 'G', 'C', 'A', 'T', 'G', 'C']

Following the rules that anticodons are the reverse complement of codons, and that in RNA sequences (like tRNA) T is replaced with U, write some code that:

- takes your list of codons
- turns it into a list of anticodons with a comprehension
- zips the list of codons and the list of anticodons together into a dict

Hint: anticodon = codon.replace("T","U")[::-1]

In [7]:
bytens = [i * 5 for i in range(1,20) if i % 2 == 0]
print(bytens)
compdict = {"A":"T", "T":"A", "G":"C", "C":"G"}
dnaseq = "ATCGATCGTACG"
complement = [compdict.get(char) for char in dnaseq if char in ["A","T","G","C"]]
print(complement)

[10, 20, 30, 40, 50, 60, 70, 80, 90]
['T', 'A', 'G', 'C', 'T', 'A', 'G', 'C', 'A', 'T', 'G', 'C']


# Another example #

Remember the weird-looking ASCII-encoded quality score strings that come along with the sequence in a FASTQ file? In lab on Thurs. we're going to make use of these to do quality-based sequence filtering.

We can convert them to quality score numbers using a built-in function called ord().

ord() retrieves the integer that is associated with a Unicode character. FASTQ encoding typically starts with Unicode character 33, so we can use the equation ord(x) - 33 to get the correct integer value for character x.

Given the Unicode string pre-populated for you in the cell below, write code that:

- converts the Unicode string to a list of integer values
- calculates an average from the list (you can do this with sum() and len()) -- the value you get should be 35.37

In [None]:
qualstring = "@@@@@@@@@FDFFFHFHHAGIJJJGGGHCGGIIJIFAFEGHDGHGE@HGGHGICHEHGHIJJHHFDBDDCCDCCBDDCDCA:>CDDEDCCACA>CA@:>:CBABBD"

# In a dict() comprehension #

We can also use a generator in a dict() comprehension. As an example of this, we can parse the file aamw.txt, which we have used before. It's made up of comment lines beginning with # followed by the data in two columns separated by whitespace.

```with open("aamw.txt") as aamw:
    aamws = {line.split()[0]:line.split()[1] for line in aamw}```

returns

```{'I': '131.1736', 'L': '131.1736', 'K': '146.1882', 'M': '149.2124', 'F': '165.1900', 'T': '119.1197', 'W': '204.2262', 'V': '117.1469', 'R': '174.2017', 'H': '155.1552', 'A': '89.0935', 'N': '132.1184', 'D': '133.1032', 'C': '121.1590', 'E': '147.1299', 'Q': '146.1451', 'G': '75.0669', 'P': '115.1310', 'S': '105.0930', 'Y': '181.1894'}```

Given the dict of codon:anticodon pairs you made before, can you write a comprehension that makes a new dictionary of anticodon:codon? Remember you can iterate over the dictionary to get tuple pairs with dict.items()

# Generators #

A generator looks like a function, but it returns its values one by one with the yield keyword, instead of only once with return.

A simple generator could be something like this:

```def addfive(num):
    for i in range(0,num):
        yield i+5```

And to call it, we would use a for keyword:

```for num in addfive(5):
    print(num)```

In [10]:
def addfive(num):
    for i in range(0,num):
        yield i+5

for num in addfive(5):
    print(num)

5
6
7
8
9


# Fibonacci 3: Son of Fibonacci #

In the cell below, take the Fibonacci function that you have used before and convert it into a generator. 

Either within the generator code or within the call, limit the number of Fibonacci numbers that can be printed 

In [14]:
def fib():
    a=0
    b=1
    while True:
        yield a
        a,b = b, a+b

for i in fib():
    print(i)
    if(i > 100):
        break

0
1
1
2
3
5
8
13
21
34
55
89
144


# Powerball 3: Son of Powerball #

In the cell below, take the lottery function you have previously made (it's reviewed in the slides) and convert it into a generator.

Call it with a loop that prints out the numbers one at a time.

In [26]:
import random

def quickpicks():
    for i in range(0,6):
        yield random.randint(1,40)
    yield random.randint(1,15)

winning = quickpicks()
for i in winning:
    if i==0:
        print("The first number is... {:3d}".format(i))
    else:
        print("The next number is...  {:3d}".format(i))

The next number is...    5
The next number is...   22
The next number is...   30
The next number is...   12
The next number is...    8
The next number is...   10
The next number is...    9


In [34]:
def read_FASTA(file):
    desc = ""
    seq = ""
    for line in file.readlines():
        if line[0] == ">": #if description
            if desc != "" and seq != "": #if not the first line
                yield desc, seq #save the last line's contents
                seq = "" #reset seq
            desc = line #(re)set desc to new description
        else:
            seq += line #append this line to seq (if first line of seq, will append to "")
    return dict(zip(desc,seq)) #return list of tuples

with open("mockfaa.txt") as fo:
    seqs = read_FASTA(fo)
    for (d, s) in seqs:
        print(d, s)

>gi|544163660|ref|YP_008563067.1| ribosomal protein S12 (chloroplast) [Solanum lycopersicum]
 MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGI
GHNSQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK

>gi|544163593|ref|YP_008563068.1| photosystem II protein D1 (chloroplast) [Solanum lycopersicum]
 MTAILERRESESLWGRFCNWITSTENRLYIGWFGVLMIPTLLTATSVFIIAFIAAPPVDIDGIREPVSGS
LLYGNNIISGAIIPTSAAIGLHFYPIWEAASVDEWLYNGGPYELIVLHFLLGVACYMGREWELSFRLGMR
PWIAVAYSAPVAAATAVFLIYPIGQGSFSDGMPLGISGTFNFMIVFQAEHNILMHPFHMLGVAGVFGGSL
FSAMHGSLVTSSLIRETTENESANEGYRFGQEEETYNIVAAHGYFGRLIFQYASFNNSRSLHFFLAAWPV
VGIWFTALGISTMAFNLNGFNFNQSVVDSQGRVINTWADIINRANLGMEVMHERNAHNFPLDLAAIEAPS
TNG

>gi|544163594|ref|YP_008563069.1| maturase K (chloroplast) [Solanum lycopersicum]
 MEEIHRYLQPDSSQQHNFLYPLIFQEYIYALAQDHGLNRNRSILLENSGYNNKFSFLIVKRLITRMDQQN
HLIISTNDSNKNPFLGCNKSLYSQMISEGFACIVEIPFSIRLISSLSSFEGKKIFKSHNLRSIHSTFPFL
EDNFSHLNYVLDILIPYPVHLEILVQTLRYWVKDASSLHLLRFFLHEYCNLNSLITSKKPGYSFSKKNQR
FFFFLYNSYVYECESTFVFLRNQSSH

# FastQ Parser as Generator #

We have worked with the FASTQ parser a couple of times before, but here is a reminder of the way you step through the groups of four lines in a fastq file and assign them to the variables you want:

```with open("sequences.fastq") as fastq:
	fqlist = []
	i = 1
	for line in fastq:
		if i % 4 == 1:
			fqkey = line
			i = i+1
		elif i % 4 == 2:
			fqseq = line
			i = i+1
		elif i % 4 == 3:
			i = i+1
		elif i % 4 == 0:
			fqqual = line
			fqlist.append((fqkey,fqseq,fqqual))
			fqseq, fqqual, fqkey = None, None, None
			i = i+1```
            
Reformulate this code to turn it into a generator that yields name, sequence and quality values instead of appending them to a list.

In [40]:
def fastq(filename):
    with open(filename) as fastq:
        i = 1
        for line in fastq:
            if i % 4 == 1:
                fqkey = line
                i = i+1
            elif i % 4 == 2:
                fqseq = line
                i = i+1
            elif i % 4 == 3:
                i = i+1
            elif i % 4 == 0:
                fqqual = line
                yield (fqkey,fqseq,fqqual)
                fqseq, fqqual, fqkey = None, None, None
                i = i+1


for (key,seq,qual) in fastq("sequences.fastq"):
    print(key,seq)

@SRR1391072.1 HISEQ:120:D1GR1ACXX:5:1101:1563:1989 length=100
 ACATGTTGCCGTNATTAAGGTTCCTAAAGGACAGCTAATTGCTTCCGCGATTCATGAAGATCTATACGNNNNCATNAACGAGGNNNNNNAANANNNNNNACGACAA

@SRR1391072.2 HISEQ:120:D1GR1ACXX:5:1101:2058:1997 length=100
 CTTGTACGGCGATTTAAGCAATGAACTGTGTGACACCATTGGTGAGTACCAAGTGGATTTCCGGGTGTGNNNCCATCACCAAGATNTTTGGAGNANNCTGCTCTCT

@SRR1391072.3 HISEQ:120:D1GR1ACXX:5:1101:2679:1989 length=100
 GCCAATAATCAGNTGCGGCTCTTGCGGTCAATGTCGGGCATTTTGATGATCCTATCGAACGAGAAGGTNNNNCCCNTTATCTTNNNNNNATNCNNNNNNTGGGGAC

@SRR1391072.4 HISEQ:120:D1GR1ACXX:5:1101:4135:1999 length=100
 GATCAGAGGATGGTTACTACTAAGGAAGGCAATGGACACCTCTGGATGAGGCAAGGACTGAACATCAGGNNNATGTCAGGGCCACNGCTCAGGNANNAAGTGATGT

@SRR1391072.5 HISEQ:120:D1GR1ACXX:5:1101:4181:1999 length=100
 TAGCTTTACCTACTGGGCCCTGTGGGCGGTGGTAAATCGTCGCTGGCAGAAAAGCTCAAAGCATTAATGNNNCAAATGCCGATTTNCGTGCTCNCNNCCAACGGGA

@SRR1391072.6 HISEQ:120:D1GR1ACXX:5:1101:4788:1996 length=100
 TGACCAACATCCAGTAACAGTTGGACACTGTATACATTTGGCGAGAACCCACGGAAGGCAGATAACTTANNNAAGNGTTTTT

# Call the FastQ generator #

- Remember that you need to call it with an iterator
- Remember that if you're returning three variables name,seq,qual, you will need three variables in the call or otherwise you'll get a tuple.