In [1]:
#In Class Oct. 29 Submission
#Nov 17, 2020
#Zeke Van Dehy

# PDB Structure Files #

This week we're going to work with protein structure files. The standard format is the PDB file, which we'll go over in detail in class. 

Skills practiced:
- opening files
- loops
- filtering data with conditionals
- slicing
- making lists
- using a module (statistics)
- math operations
- writing to files

# A Simple Filter #

The two types of records in a PDB file that contain atoms are ATOM and HETATM records. The first thing I want you to do is open the file, read in the lines, and write a loop that searches for only ATOM and HETATM records. 

If a line passes your filter, then write it to a new file. This means that you will need to define a new file object that is empty, open it for appending, and use .write() to write each matching line to the file. That will look something like this.

Of course you have to think about where the parts of this block go relative to the whole loop.

```with open("out.pdb","a") as fo:
    fo.write(line)```

In [4]:
with open("1nap.pdb") as fo:
    count = 0
    for line in fo.readlines():
        row = [
            line[0:6].strip(), #record name
            line[6:11].strip(), #atom serial number
            line[12:16].strip(), #atom name
            line[17] #alternate location indicator
        ]
        if count < 10 and (row[0] == "ATOM" or row[0] == "HETATM"):
            print(row)
            count+=1

['ATOM', '1', 'N', 'A']
['ATOM', '2', 'CA', 'A']
['ATOM', '3', 'C', 'A']
['ATOM', '4', 'O', 'A']
['ATOM', '5', 'CB', 'A']
['ATOM', '6', 'N', 'G']
['ATOM', '7', 'CA', 'G']
['ATOM', '8', 'C', 'G']
['ATOM', '9', 'O', 'G']
['ATOM', '10', 'CB', 'G']


# A slightly less simple filter #

Now we would like to get the amino acid sequence out of the PDB file. This is stored in lines that start with SEQRES. The only problem is that it is stored in the three-letter amino acid code. 

- Use a filter to get these lines. 
- Add the split line contents to a list, then filter the list (so you're not trying to translate SEQRES)
- Use a dictionary to translate the three letter code to a one-letter sequence string. Where could we find the information to make this code if we don't have the amino acids memorized?
- Let's try to use comprehension statements to do this efficiently.

```threetoone = {"ALA":"A","CYS":"C","ASP":"D","GLU":"E","PHE","F","GLY":"G","HIS":"H","ILE":"I","LYS":"K","LEU","L","MET":"M","ASN":"N","PRO":"P","GLN":"Q","ARG":"R","SER":"S","THR":"T","VAL":"V","TRP":"W","TYR":"Y"}```

In [5]:
threetoone = {"ALA":"A","CYS":"C","ASP":"D","GLU":"E","PHE":"F","GLY":"G","HIS":"H","ILE":"I","LYS":"K","LEU":"L","MET":"M","ASN":"N","PRO":"P","GLN":"Q","ARG":"R","SER":"S","THR":"T","VAL":"V","TRP":"W","TYR":"Y"}
with open("1nap.pdb") as fo:
    for line in fo.readlines():
        header = line[0:6]
        if header != "SEQRES":
            continue
        aas = line[19:].strip().split(" ")
        ones = [threetoone[aa] for aa in aas]
        print(line[:19], "".join(ones))

SEQRES   1 A   70   AELRCLCIKTTSG
SEQRES   2 A   70   IHPKNIQSLEVIG
SEQRES   3 A   70   KGTHCNQVEVIAT
SEQRES   4 A   70   LKDGRKICLDPDA
SEQRES   5 A   70   PRIKKIVQKKLAG
SEQRES   6 A   70   DESAD
SEQRES   1 B   70   AELRCLCIKTTSG
SEQRES   2 B   70   IHPKNIQSLEVIG
SEQRES   3 B   70   KGTHCNQVEVIAT
SEQRES   4 B   70   LKDGRKICLDPDA
SEQRES   5 B   70   PRIKKIVQKKLAG
SEQRES   6 B   70   DESAD
SEQRES   1 C   70   AELRCLCIKTTSG
SEQRES   2 C   70   IHPKNIQSLEVIG
SEQRES   3 C   70   KGTHCNQVEVIAT
SEQRES   4 C   70   LKDGRKICLDPDA
SEQRES   5 C   70   PRIKKIVQKKLAG
SEQRES   6 C   70   DESAD
SEQRES   1 D   70   AELRCLCIKTTSG
SEQRES   2 D   70   IHPKNIQSLEVIG
SEQRES   3 D   70   KGTHCNQVEVIAT
SEQRES   4 D   70   LKDGRKICLDPDA
SEQRES   5 D   70   PRIKKIVQKKLAG
SEQRES   6 D   70   DESAD


# Dealing with legacy lines #

PDB files are a legacy format. They were developed back in the dark days when FORTRAN ruled the land and everyone was working with formatted 80 character lines.

As such, sometimes you don't get the fields you expect when you split PDB lines on whitespace.

Try splitting all the lines in the example file 2dc3orig.pdb that is  included for today.

- How many fields do you expect, based on the description of the ATOM and HETATM lines shown in the slides?
- Do you actually get that many fields from all of the lines in the files? Test this by trying to print line.split()[#] where # is the last field you think you should get.

# Using slices to split ATOM lines #

Fortunately, we have another way to split the lines -- we can use slices defined based on the standard for the PDB format.

Review the format of the PDB ATOM lines shown in the powerpoint. Can you make a loop that reads each of these lines and puts the required slice of the line into the appropriate list? This is another one of those problems where parallel lists will work for your purpose and you don't exactly need a fancier data structure, but you could also do this with a list of lists or tuples if you wanted to. 

Here are the names that I used to define my empty lists to start this loop. There need to be 12.

type,number,atom,amino,chain,residue,xs,ys,zs,occ,bs,element = [],[],[],[],[],[],[],[],[],[],[],[] 

In [35]:
types,number,atom,amino,chain,residue,xs,ys,zs,occ,bs,element = [],[],[],[],[],[],[],[],[],[],[],[]

with open("1nap.pdb") as fo:
    for line in fo.readlines():
        if line[0:6].strip() == "ATOM":
            types.append(line[0:6].strip()) #record name
            number.append(line[6:11].strip()) #atom serial number
            atom.append(line[12:16].strip()) #atom name
            amino.append(line[17]) #alternate location indicator
            residue.append(line[17:20].strip()) #residue name
            chain.append(line[21]) #chain identifier

            #x,y,z coords
            xs.append(float(line[30:38]))
            ys.append(float(line[38:46]))
            zs.append(float(line[46:54]))

# Manipulating ATOM lines #

Let's say we want to ask the user if they want to select/do something with only one of the chains in the protein structure.

- First you need to make your lists of stuff like we did above.
- Then, you need to use set() on the chains list to see what the unique chain identifiers are.
- Then you want to filter for the chain ID the user selects.

Try this out below. Pick a chain ID, and then extract the ATOM numbers that correspond to chain C in your lists.

In [36]:
unique = set()
for i in range(0,len(types)):
    if types[i] == "ATOM":
        unique.add(chain[i])
print(unique)
        
my_chain = input("which chain would you like to see?")

for i,typ in enumerate(types):
    if(chain[i]==my_chain and typ =="ATOM"):
        print(typ,number[i],atom[i],amino[i],residue[i])
#print type name atom amino residue for the given chain

{'A', 'B', 'D', 'C'}


which chain would you like to see? A


ATOM 1 N A ALA
ATOM 2 CA A ALA
ATOM 3 C A ALA
ATOM 4 O A ALA
ATOM 5 CB A ALA
ATOM 6 N G GLU
ATOM 7 CA G GLU
ATOM 8 C G GLU
ATOM 9 O G GLU
ATOM 10 CB G GLU
ATOM 11 CG G GLU
ATOM 12 CD G GLU
ATOM 13 OE1 G GLU
ATOM 14 OE2 G GLU
ATOM 15 N L LEU
ATOM 16 CA L LEU
ATOM 17 C L LEU
ATOM 18 O L LEU
ATOM 19 CB L LEU
ATOM 20 CG L LEU
ATOM 21 CD1 L LEU
ATOM 22 CD2 L LEU
ATOM 23 N A ARG
ATOM 24 CA A ARG
ATOM 25 C A ARG
ATOM 26 O A ARG
ATOM 27 CB A ARG
ATOM 28 CG A ARG
ATOM 29 CD A ARG
ATOM 30 NE A ARG
ATOM 31 CZ A ARG
ATOM 32 NH1 A ARG
ATOM 33 NH2 A ARG
ATOM 34 N C CYS
ATOM 35 CA C CYS
ATOM 36 C C CYS
ATOM 37 O C CYS
ATOM 38 CB C CYS
ATOM 39 SG C CYS
ATOM 40 N L LEU
ATOM 41 CA L LEU
ATOM 42 C L LEU
ATOM 43 O L LEU
ATOM 44 CB L LEU
ATOM 45 CG L LEU
ATOM 46 CD1 L LEU
ATOM 47 CD2 L LEU
ATOM 48 N C CYS
ATOM 49 CA C CYS
ATOM 50 C C CYS
ATOM 51 O C CYS
ATOM 52 CB C CYS
ATOM 53 SG C CYS
ATOM 54 N I ILE
ATOM 55 CA I ILE
ATOM 56 C I ILE
ATOM 57 O I ILE
ATOM 58 CB I ILE
ATOM 59 CG1 I ILE
ATOM 60 CG2 I ILE
ATO

# Only the backbone #

Only certain types of atoms belong to the protein backbone. The ATOM type is stored in the third field of the ATOM record, so if you followed my scheme of lists above you extracted this into the ATOM list.

- Find a way to count the total number of atoms in your PDB file.
- Then find a way to count the number of atoms that are one of the backbone atom types: N, CA, C, O, CB.

In [37]:
total_atoms = 0
atom_counts = {"N":0,"CA":0,"C":0,"O":0,"CB":0}

for i in range(0,len(types)):
    if types[i] != "ATOM":
        continue
    total_atoms += 1
    if atom[i] in atom_counts:
        atom_counts[atom[i]] += 1
print(total_atoms)
print(atom_counts)

1989
{'N': 261, 'CA': 261, 'C': 261, 'O': 261, 'CB': 242}


# Change the coordinates #

We've seen that the protein x, y, z coordinates are the locations of atoms in a three-dimensional space with xyz axes. These can be visualized in a program like PyMOL. 

# Coordinate math: is it in the middle? #

Use your lists of xs, ys and zs to do some math.

- Find the midpoint between the smallest x value and the largest x. Do the same for ys and zs.
- Find the average x value, and do the same for ys and zs.
- First do this using sum() and math operators.
- Then try loading the statistics module.
- Calculate mean() and median() of xs, ys, and zs.

In [43]:
import pprint
pp = pprint.PrettyPrinter()


print(sorted(xs)[len(xs)//2], sorted(ys)[len(ys)//2], sorted(zs)[len(zs)//2])
print(sum(xs)/len(xs),sum(ys)/len(ys),sum(zs)/len(zs))
import statistics
print(statistics.median(xs),statistics.median(ys),statistics.median(zs))
print(statistics.mean(xs),statistics.mean(ys),statistics.mean(zs))

7.353 18.286 18.844
7.403605329311214 18.311941176470587 19.12982101558572
7.353 18.286 18.844
7.403605329311212 18.311941176470587 19.129821015585723


# Coordinate math: move it to the origin #

Depending on how you define the "average" x y and z coordinate of your protein (or where you locate the center of mass) you will probably find that the center of the molecule is not at 0,0,0 (the center of the coordinate system). Can you modify each x, y and z coordinate to shift the molecule to the middle of the coordinate system? You can base this on whatever definition of "center" you choose. Mean, median, or halfway between min and max.

Here, show how to do the calculation and to modify the coordinate lists. You'll learn how to write out information from your lists in a formatted PDB next time.

In [51]:
#define center as median
import statistics
center = (statistics.median(xs),statistics.median(ys),statistics.median(zs))
print(center)
newXs, newYs, newZs = [],[],[]
for i in range(0,len(xs)):
        newXs.append(xs[i]-center[0])
        newYs.append(ys[i]-center[1])
        newZs.append(zs[i]-center[2])
print((statistics.median(newXs),statistics.median(newYs),statistics.median(newZs)))

(7.353, 18.286, 18.844)
(0.0, 0.0, 0.0)
