In [None]:
# Lists, Matrices, Tuples -- In Class Oct 10
# Zeke Van Dehy
# Oct 27, 2020

# Getting data into a structure #

We've already covered how to make lists many times.

We've also seen tuples as the return values of functions, or as the outcome of zipping together individual lists. Let's try a few more operations of this kind with a little file of amino acid molecular weights, aamw.txt.

aamw.txt has 22 lines. The top two lines are comments. It is formatted in two columns separated by whitespace. I included it for you on the Canvas -- open it up and see.

For this file, what would we need to do to read the lines and return the values as two lists? Put your code in a function, and have it return the two lists. Show how you'd call the function to get the two lists as separate objects.

In [10]:
def myFunction():
    weights = []
    aas = []
    with open("aamw.txt") as molweights:
        for line in molweights.readlines():
            if(line[0] != "#"):
                mw = line.split()
                aas.append(mw[0])
                weights.append(float(mw[1]))
    return aas, weights

aas, weights = myFunction()
for i, weight in enumerate(weights):
    print(aas[i],weight)
    


I 131.1736
L 131.1736
K 146.1882
M 149.2124
F 165.19
T 119.1197
W 204.2262
V 117.1469
R 174.2017
H 155.1552
A 89.0935
N 132.1184
D 133.1032
C 121.159
E 147.1299
Q 146.1451
G 75.0669
P 115.131
S 105.093
Y 181.1894


# Getting it back out again #

Now, what would we need to do to get the molecular weight of an amino acid in the second list out, based on the position of its code in the first list? Use .index(). We first learned this on Oct. 3.

In [11]:
def getWeight(aa):
    aas, weights = myFunction()
    return weights[aas.index(aa)]
getWeight("W")

204.2262

# What if we zip the lists together? #

Now create some code that zips the two lists you make above together into a list of tuples. Put your code in a function, and have it return the list of tuples.

The list of tuples should look like this:

```[('I', '131.1736'), ('L', '131.1736'), ('K', '146.1882'), ('M', '149.2124'), ('F', '165.1900'), ('T', '119.1197'), ('W', '204.2262'), ('V', '117.1469'), ('R', '174.2017'), ('H', '155.1552'), ('A', '89.0935'), ('N', '132.1184'), ('D', '133.1032'), ('C', '121.1590'), ('E', '147.1299'), ('Q', '146.1451'), ('G', '75.0669'), ('P', '115.1310'), ('S', '105.0930'), ('Y', '181.1894')]```

In [14]:
def myTuplesFunction():
    aas, weights = myFunction()
    return list(zip(aas, weights))
print(myTuplesFunction())
    

[('I', 131.1736), ('L', 131.1736), ('K', 146.1882), ('M', 149.2124), ('F', 165.19), ('T', 119.1197), ('W', 204.2262), ('V', 117.1469), ('R', 174.2017), ('H', 155.1552), ('A', 89.0935), ('N', 132.1184), ('D', 133.1032), ('C', 121.159), ('E', 147.1299), ('Q', 146.1451), ('G', 75.0669), ('P', 115.131), ('S', 105.093), ('Y', 181.1894)]


# How do we get values back out of a list of tuples? #

To get an individual value back out of a list of tuples, we have to set our conditions such that we look at the content of a specific slice of the tuple. We can't .index() based on a slice, so we have to do this another way.

In the amino acid molecular weight list of tuples above, I could get the index of an specific element (tuple) in the list, based on whether it matches a particular amino acid code, by writing:

```next(i for i,aa in enumerate(molweights) if aa[0] = "C").``` 

# Generator expressions and the next keyword #

Notice that this expression looks almost exactly like a list comprehension, which we saw last week. But it's called a "generator expression" -- basically a list comprehension without the [] and with a *next* keyword to get the value from the generator. We can use these when we don't want to store another list in memory and/or we don't need to use a list method or slice operator on the result. 

This recipe will work to look up individual values as long as there are not multiple pairs that have "C" in position [0] in the tuple. (This is the same as having to have a unique key in a dictionary for those who already know what that is).

How would I get the *molecular weight* value that corresponds to a particular amino acid here? Try your code and make sure that it works (by checking the file to make sure you get the right value). We'll use this statement in the next problem, so if you can't get it to work, ask.

In [27]:
molweights = myTuplesFunction()
# print(molweights)

# aa = input("Which amino acid? ")
aa = "C"


print(next(tup[1] for i, tup in enumerate(molweights) if tup[0] == aa))


121.159


# Use this in a calculation #

Say I have the tuple of molecular weights that we created above, and I have a protein sequence string,  "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN" (our old friend human insulin). 

For each character in this sequence, get the value of the molecular weight that corresponds to the character. Append it to a list. Sum the list to get the total molecular weight of the protein.

In [43]:
protein = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"
molweights = myTuplesFunction()

def getWeight(sequence):
    allWeights = []
    for p in sequence:
        weight = next(tup[1] for i, tup in enumerate(molweights) if tup[0] == p)
        allWeights.append(weight)
    return sum(allWeights)


print(getWeight(protein))

13944.519099999992


# Use this with map() #

Now, let's scale it up. The file mockfaa.txt, which you have seen before, has multiple proteins in it. We can get them out and into a list, using the fasta parser that we've seen before in class. Make the molecular weight calculator into a function and use map() to get a result for each of the sequences in the mockfaa.txt file.

*DO WE NEED TO GO OVER THE FASTA PARSER?*

In [45]:
def read_FASTA(file):
    contents = [] #list of tuples (desc,seq)
    desc = ""
    seq = ""
    for line in file.readlines():
        if line[0] == ">": #if description
            if desc != "" and seq != "": #if not the first line
                contents.append((desc,seq.replace("\n",""))) #save the last line's contents
                seq = "" #reset seq
            desc = line #(re)set desc to new description
        else:
            seq += line #append this line to seq (if first line of seq, will append to "")
    return contents #return list of tuples

with open("InClass-Files-Oct20/mockfaa.txt") as mockfaaFile:
    for (desc, sequence) in read_FASTA(mockfaaFile):
        print(desc)
        print(getWeight(sequence))
        print("-----")
    

>gi|544163660|ref|YP_008563067.1| ribosomal protein S12 (chloroplast) [Solanum lycopersicum]

15935.872599999988
-----
>gi|544163593|ref|YP_008563068.1| photosystem II protein D1 (chloroplast) [Solanum lycopersicum]

45291.684000000045
-----
>gi|544163594|ref|YP_008563069.1| maturase K (chloroplast) [Solanum lycopersicum]

69127.50150000004
-----
>gi|544163595|ref|YP_008563070.1| ribosomal protein S16 (chloroplast) [Solanum lycopersicum]

11947.497799999997
-----
>gi|544163596|ref|YP_008563071.1| photosystem II protein K (chloroplast) [Solanum lycopersicum]

8011.152799999998
-----
>gi|544163597|ref|YP_008563072.1| photosystem II protein I (chloroplast) [Solanum lycopersicum]

4798.424099999999
-----
>gi|544163598|ref|YP_008563073.1| ATP synthase CF1 alpha subunit (chloroplast) [Solanum lycopersicum]

64526.667400000144
-----
>gi|544163599|ref|YP_008563074.1| ATP synthase CF0 subunit I (chloroplast) [Solanum lycopersicum]

24168.386099999985
-----
>gi|544163600|ref|YP_008563075.1| ATP 

# Fastq sorting -- basic -- values into lists #

Below we've got an example of the simple fastq code. This one isn't a function or a generator or anything fancy. It's a simple loop inside a with block. Think about this code and do two things with it.

1) Change the code so that it uses enumerate() and still gets the correct lines into the correct variables
2) Fill in the blank at the bottom to get the values into lists.

In [7]:
import pprint
pp = pprint.PrettyPrinter()
with open("sequences.fastq") as fastq:
    fqnames = []
    fqseqs = []
    fqquals = []
    for i,line in enumerate(fastq.readlines()):
        if i % 4 == 0:
            fqname = line
            fqnames.append(fqname)
            i = i+1
        elif i % 4 == 1:
            fqseq = line
            fqseqs.append(fqseq)
            i = i+1
        elif i % 4 == 2:
            i = i+1 #ignore +
        elif i % 4 == 3:
            fqqual = line
            fqquals.append(fqqual)
    pp.pprint(fqnames[:10])
    pp.pprint(fqseqs[:10])
    pp.pprint(fqquals[:10])

['@SRR1391072.1 HISEQ:120:D1GR1ACXX:5:1101:1563:1989 length=100\n',
 '@SRR1391072.2 HISEQ:120:D1GR1ACXX:5:1101:2058:1997 length=100\n',
 '@SRR1391072.3 HISEQ:120:D1GR1ACXX:5:1101:2679:1989 length=100\n',
 '@SRR1391072.4 HISEQ:120:D1GR1ACXX:5:1101:4135:1999 length=100\n',
 '@SRR1391072.5 HISEQ:120:D1GR1ACXX:5:1101:4181:1999 length=100\n',
 '@SRR1391072.6 HISEQ:120:D1GR1ACXX:5:1101:4788:1996 length=100\n',
 '@SRR1391072.7 HISEQ:120:D1GR1ACXX:5:1101:5911:1991 length=100\n',
 '@SRR1391072.8 HISEQ:120:D1GR1ACXX:5:1101:6379:1991 length=100\n',
 '@SRR1391072.9 HISEQ:120:D1GR1ACXX:5:1101:6269:1993 length=100\n',
 '@SRR1391072.10 HISEQ:120:D1GR1ACXX:5:1101:6745:1989 length=100\n']
['ACATGTTGCCGTNATTAAGGTTCCTAAAGGACAGCTAATTGCTTCCGCGATTCATGAAGATCTATACGNNNNCATNAACGAGGNNNNNNAANANNNNNNACGACAA\n',
 'CTTGTACGGCGATTTAAGCAATGAACTGTGTGACACCATTGGTGAGTACCAAGTGGATTTCCGGGTGTGNNNCCATCACCAAGATNTTTGGAGNANNCTGCTCTCT\n',
 'GCCAATAATCAGNTGCGGCTCTTGCGGTCAATGTCGGGCATTTTGATGATCCTATCGAACGAGAAGGTNNNNCCCNTTATCTTNNNNNNAT

# Less basic -- do it in a function, and return a list of 3 part tuples #

What do we really want out of this file?

1) We want the comment string, sequence, and quality scores.
2) We want the three items that pertain to each unique sequence to be linked together.
3) In the Fastq Sorting problem, we want to be able to search based on whether there is a known barcode tag in the first 6 characters of the sequence.

We solved the first part in the cell above. Solve the next part by changing the code. Make the existing code into a function that returns a list of three-part tuples.

In [12]:
with open("sequences.fastq") as fastq:
    threeParts = [] #list of tuples (fqname, fqseq, fqqual)
    fqname = ""
    fqseq = ""
    fqqual = ""
    for i,line in enumerate(fastq.readlines()):
        if i % 4 == 0:
            fqname = line
            i = i+1
        elif i % 4 == 1:
            fqseq = line
            i = i+1
        elif i % 4 == 2:
            i = i+1 #ignore +
        elif i % 4 == 3:
            fqqual = line
            threeParts.append((fqname, fqseq, fqqual))
    pp.pprint(threeParts[:10])
        

[('@SRR1391072.1 HISEQ:120:D1GR1ACXX:5:1101:1563:1989 length=100\n',
  'ACATGTTGCCGTNATTAAGGTTCCTAAAGGACAGCTAATTGCTTCCGCGATTCATGAAGATCTATACGNNNNCATNAACGAGGNNNNNNAANANNNNNNACGACAA\n',
  'BBBBBBBBCFFD#2ADHHHJHHGIJJIJIJJJIJJJJJJJEGGIHJJJIIIJJJJJIIJJJJJJJJJJ####,,;#,;?ABBB#######################\n'),
 ('@SRR1391072.2 HISEQ:120:D1GR1ACXX:5:1101:2058:1997 length=100\n',
  'CTTGTACGGCGATTTAAGCAATGAACTGTGTGACACCATTGGTGAGTACCAAGTGGATTTCCGGGTGTGNNNCCATCACCAAGATNTTTGGAGNANNCTGCTCTCT\n',
  '@@@@@@@@@DDADDBH?CDB?<FFDBED?H???C?F1?DFA>F;<??DD@B9B*??B==B.B(5;;>EE#####################################\n'),
 ('@SRR1391072.3 HISEQ:120:D1GR1ACXX:5:1101:2679:1989 length=100\n',
  'GCCAATAATCAGNTGCGGCTCTTGCGGTCAATGTCGGGCATTTTGATGATCCTATCGAACGAGAAGGTNNNNCCCNTTATCTTNNNNNNATNCNNNNNNTGGGGAC\n',
  'CCCCCCCCCFFF#2CFHHHJJJJJJJJFIJJJFHIJJJJJJJJJGHIIJIJIJJJJJHHHFFFDDEDC######################################\n'),
 ('@SRR1391072.4 HISEQ:120:D1GR1ACXX:5:1101:4135:1999 length=100\n',
  'GATCAGAGGATGGTTACTACTAAGGAAGGCAATG

# Now, search the tuples for the sequences you want #

Our goal in the first version of the FastqSorting exercise was just to find out:

1) how often did each barcode tag occur
2) in sequences that were long enough (>= 80)

One way to go about this would be to make a list of matches for each barcode, and then just get the length of each list with len. Because you want to make a list and use len() rather than get a value on the fly, in this case you'll use a comprehension placed in square brackets [] instead of a generator expression inside a next(). Otherwise, the statement syntax is the same. Think about:

- What are you iterating over? (the iteration)
- What do you want to get back from what you're iterating over? (the transformation)
- When do you want to get that item or items back? (the filter)

In [25]:
barcodes = ['ATCACG','CGATGT','TTAGGC','TGACCA','ACATGT','GCCAAT','CAGATC','ACTTGA','GATCAG','TAGCTT','GGCTAG','CTTGTA']
#threeParts is the list of tuples found in the fastq file from the previous cell
bCounts = []
for barcode in barcodes:
    desired = [(i, parts[1]) for i, parts in enumerate(threeParts) if len(parts[1]) >= 80 and parts[1].startswith(barcode)]
    count = len(desired)
    bCounts.append(count)
    #I am not sure why, but I get one more "TTAGGC" than I do in the previous lab...
#     if barcode == "TTAGGC":
#         pp.pprint(desired)
print(barcodes)
print(bCounts)

['ATCACG', 'CGATGT', 'TTAGGC', 'TGACCA', 'ACATGT', 'GCCAAT', 'CAGATC', 'ACTTGA', 'GATCAG', 'TAGCTT', 'GGCTAG', 'CTTGTA']
[81, 73, 85, 78, 71, 83, 55, 64, 76, 75, 81, 84]


# Matrices are lists of lists #

What could we do if we wanted to associate a list of sequence entries with each specific barcode?

- we could make a list of tuples
- we could make a list of lists

Let's look at lists of lists. In practical terms, they're not much different to use than lists of tuples.

The syntax for a list of lists is matrix = [[]]

A comprehension that would populate this matrix is:

```[[i + "," + j for i in ["A","a"] for j in ["A","a"]]]```

How would you make a multiplication table using a comprehension and the range() function?        

In [37]:
example = [[i + "," + j] for i in ["A","a"] for j in ["A","a"]]
print(example)
mTable = [[i*j for i in range(1,10)] for j in range(1,10)]
pp.pprint(mTable)

[['A,A'], ['A,a'], ['a,A'], ['a,a']]
[[1, 2, 3, 4, 5, 6, 7, 8, 9],
 [2, 4, 6, 8, 10, 12, 14, 16, 18],
 [3, 6, 9, 12, 15, 18, 21, 24, 27],
 [4, 8, 12, 16, 20, 24, 28, 32, 36],
 [5, 10, 15, 20, 25, 30, 35, 40, 45],
 [6, 12, 18, 24, 30, 36, 42, 48, 54],
 [7, 14, 21, 28, 35, 42, 49, 56, 63],
 [8, 16, 24, 32, 40, 48, 56, 64, 72],
 [9, 18, 27, 36, 45, 54, 63, 72, 81]]


In [None]:
import random
i = random.randint(0, 99999)
print(f'{i:05d}')

# Retrieval from Matrices #

In the first cell below, set up a comprehension that would populate a 5*5 matrix with random integers.

Now, to get a cell out of a matrix, you simply call mat[i][j]. If you have a pre-built matrix (say, populated by zeroes) you can also use this syntax to set or change a value.

In the second cell, write a nested loop that takes your 5*5 random matrix, and retrieves each value. Print out the i and j index along with the value held at that position.

We'll do more with matrices next week.