# Course Project: Draw a protein structure in 3D

In this project you will plot the backbone structure of proteins in 3D. To do this you will fetch atomic coordinate data from the Protein Data Bank (PDB), parse the data, and use the [matplotlib](https://matplotlib.org/) library to plot the backbone structure from $C_{\alpha}$ atoms.

Along the way, you will use all of the techniques you learned during the course, but don't worry: the project progresses at the same pace as the course. You can already start making useful progress after the first day. During the course you will have the opportunity to work on the project.

---

## Software prerequisites

This project involves plotting so we need a couple of plotting related libraries to be installed.
 * [Matplotlib](https://matplotlib.org/) A general purpose visualisation library for Python
 * [ipympl](https://github.com/matplotlib/ipympl) Matplotlib Jupyter integration

Please ask your instructors for help installing these with Anaconda Navigator or the `pip` command.

---

## Background

[Proteins](https://en.wikipedia.org/wiki/Protein) are made from _chains_ of amino acids. Each amino acid is distinguished by a _side chain_ but share a common, repeating [_backbone_](https://foldit.fandom.com/wiki/Protein_backbone) structure. The central carbon atom (labeled $C_{\alpha}$ in the diagram below) of each amino acid backbone can be taken as representative of that amino acid and used to draw line segments between the amino acids that make up a folded protein chain. This is what you will do in this project.

![Amino acid structure](https://bio.libretexts.org/@api/deki/files/16705/peptide_bond.png)
Source: https://bio.libretexts.org/Courses/University_of_California_Davis/BIS_2A%3A_Introductory_Biology_-_Molecules_to_Cell/MASTER_RESOURCES/Proteins

There are now several formats used for storage and interchange of protein structure data, the oldest (and easiest to parse) is the [Brookhaven Protein Data Bank](https://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html) file format. This format was designed in the 1970's for use with paper punched cards.

![Punched card](https://swift.cmbi.umcn.nl/teach/B1M/IMAGE/pc1.jpg)
Source: https://swift.cmbi.umcn.nl/teach/B1M/BIOINF_4.html


Here (on the second line) is an example of a PDB record for a single atom. The first line labels culumn numbers (these numbers are not part of a PDB file).

> ```
> 1   5    10        20         30        40        50        60        70
> ATOM    278  CA  THR A  38       6.008  -5.254 -13.600  1.00 33.49           C  
> ```

You can read this letter-by-letter (or actually column-by-column). Above the PDB record I've lebeled some column numbers.
* Columns 1-6 contain the record name, in this case `ATOM`.
* Columns 13-16 contain the atom name. In this case `CA` which means this is a $C_\alpha$ atom.
* Columns 18-20 contain the amino acid name. In this case it is `THR` which mean this atom is in a Threonine.
* Column 22 contains the chain name, in this case `A`.
* Columns 23-26 contain the sequence number (where in the sequence this amino acid appears).
* Columns 31-38 contain the `x` coordinate.
* Columns 39-46 contain the `y` coordinate.
* Columns 47-54 contain the `z` coordinate.

The records for a whole backbone of a single amino acid might look like this:

```
ATOM    270  N   PRO A  37       9.573  -6.468 -11.879  1.00 36.70           N  
ATOM    271  CA  PRO A  37       9.555  -6.634 -13.332  1.00 31.03           C  
ATOM    272  C   PRO A  37       8.253  -6.068 -13.942  1.00 29.37           C  
ATOM    273  O   PRO A  37       8.081  -5.921 -15.152  1.00 34.30           O  
ATOM    274  CB  PRO A  37       9.656  -8.129 -13.521  1.00 25.97           C  
```

This Amino Acid is a "Proline", it's the 37th residue in the 'A' chain. This protein may be made up of multiple _chains_ with different names. For example, the [human haemoglobin protein](https://www.rcsb.org/structure/1HHO) is made up of 2 chains named 'A' and 'B'. You will be plotting a single chain in this project.

---

## Part 1: Parsing Brookhaven PDB formatted data

The first step in this project is to extract coordinates from PDB formatted data. Let's start off by writing a predicate to detect "ATOM" records. A _predicate_ is a function that returns a boolean value: it answers a question. Here is an example of a predicate:

```python
def is_string_empty(string):
    return len(string) == 0
```

Write a predicate function, using the template below, that takes 1 argument and returns `True` when that argument is a string that contains "ATOM".

In [None]:
def is_atom(record_id):
    "Predicate: True if record_id is for an ATOM record"
    return _

# Test your implementation here
assert is_atom('ATOM'), "Expected True, got " + str(is_atom('ATOM'))
assert not is_atom('REMARK'), "Expected False, got " + str(is_atom('REMARK'))
assert is_atom('ATOM '), "Expected True, got " + str(is_atom('ATOM '))
assert is_atom('ATOM  '), "Expected True, got " + str(is_atom('ATOM  '))
assert not is_atom('HETATM'), "Expected False, got " + str(is_atom('HETATM'))

You're welcome to modify the template above if you believe you can write the function more clearly or more concisely.

We also need a predicate to check if a record is a $C_\alpha$ atom. Fill in the template for `is_ca()` below.

In [None]:
def is_ca(atom_name):
    "Predicate: True if ATOM is a CA"
    return _

# Test your implementation here
assert is_ca('CA'), "Expected True, got " + str(is_ca('CA'))
assert not is_ca(' N  '), "Expected False, got " + str(is_ca(' N  '))
assert not is_ca(''), "Expected False, got " + str(is_ca(''))
assert not is_ca('    '), "Expected False, got " + str(is_ca('    '))
assert is_ca(' CA '), "Expected True, got " + str(is_ca(' CA '))

Well done! You will also need a predicate to detect if an atom is in the chain we're interested in. This one will be a little different because now we need to accept 2 arguments: the chain we want, and the chain in the record. For example, we want chain 'A' but we're looking at a record in chain 'C'. Fill in the template below and make sure it passes the test cases.

In [None]:
def is_in_chain(wanted_chain_id, observed_chain_id):
    "Predicate: True if observed_chain_id is the same as observed_chain_id"
    return _

# Test your implementation here
assert is_in_chain('A', 'A'), "Expected True, got " + str(is_in_chain('A', 'A'))
assert not is_in_chain('A', 'C'), "Expected False, got " + str(is_in_chain('A', 'C'))
assert is_in_chain('1', ' 1 '), "Expected True, got " + str(is_in_chain('1', ' 1 '))
assert is_in_chain('D', ' D'), "Expected True, got " + str(is_in_chain('D', ' D'))
assert not is_in_chain('D', ' A'), "Expected False, got " + str(is_in_chain('D', ' A'))

Finally, you should combine all of the above predicates into a single function that only returns `True` when all 3 of the predicates are `True`.

In [None]:
def is_ca_atom_in_chain(chain):
    def predicate(record):
        record_id = record[:6]
        chain_id = record[21]
        atom_name = record[13:16]
        return _
    return predicate

assert is_ca_atom_in_chain("A")("ATOM      2  CA  VAL A   1       5.776  17.899   5.595  1.00 70.91           C  ")
assert not is_ca_atom_in_chain("B")("ATOM      2  CA  VAL A   1       5.776  17.899   5.595  1.00 70.91           C  ")
assert not is_ca_atom_in_chain("B")("HEADER    OXYGEN TRANSPORT                        10-JUN-83   1HHO              ")

---

## Part 2: Data wrangling

In the last part you wrote predicates that will allow you to _filter_ only the records you're interested in. In part 2 you will do that and extract the atomic $x$, $y$, $z$ coordinates.

Here is some example data for you to parse:

In [None]:
example = """HEADER    OXYGEN TRANSPORT                        10-JUN-83   1HHO              
TITLE     STRUCTURE OF HUMAN OXYHAEMOGLOBIN AT 2.1 ANGSTROMS RESOLUTION         
ATOM      1  N   VAL A   1       5.287  16.725   4.830  1.00 77.31           N  
ATOM      2  CA  VAL A   1       5.776  17.899   5.595  1.00 70.91           C  
ATOM      3  C   VAL A   1       7.198  18.266   5.104  1.00 81.71           C  
ATOM      4  O   VAL A   1       7.301  19.067   4.161  1.00 77.16           O  
ATOM      5  CB  VAL A   1       5.498  17.697   7.118  1.00 51.33           C  
ATOM      6  CG1 VAL A   1       6.457  16.822   7.917  1.00 78.39           C  
ATOM      7  CG2 VAL A   1       5.211  18.976   7.922  1.00 48.23           C  
ATOM      8  N   LEU A   2       8.272  17.653   5.632  1.00 67.33           N  
ATOM      9  CA  LEU A   2       9.698  18.050   5.442  1.00 27.11           C  
ATOM     10  C   LEU A   2      10.047  19.267   6.283  1.00 33.71           C  
ATOM     11  O   LEU A   2       9.566  20.404   6.099  1.00 55.97           O  
ATOM     12  CB  LEU A   2      10.129  18.317   4.001  1.00 30.38           C  
ATOM     13  CG  LEU A   2      10.208  17.036   3.175  1.00 29.73           C  
ATOM     14  CD1 LEU A   2      10.270  17.355   1.684  1.00 58.48           C  
ATOM     15  CD2 LEU A   2      11.398  16.220   3.605  1.00 47.81           C  
TER    1070      ARG A 141                                                      
ATOM   1071  N   VAL B   1       9.445 -18.730  -3.132  1.00 58.52           N  
ATOM   1072  CA  VAL B   1      10.737 -18.800  -3.820  1.00 25.93           C  
ATOM   1073  C   VAL B   1      10.769 -19.768  -5.014  1.00 38.27           C  
ATOM   1074  O   VAL B   1       9.926 -20.653  -5.223  1.00 66.47           O  
ATOM   1075  CB  VAL B   1      11.679 -19.318  -2.800  1.00 79.18           C  
ATOM   1076  CG1 VAL B   1      10.768 -19.342  -1.582  1.00 62.58           C  
ATOM   1077  CG2 VAL B   1      12.250 -20.696  -3.202  1.00 24.78           C  
ATOM   1078  N   HIS B   2      11.763 -19.   464  -5.811  1.00 68.56           N  
ATOM   1079  CA  HIS B   2      12.288 -20.223  -6.906  1.00 63.28           C  
ATOM   1080  C   HIS B   2      13.768 -20.310  -6.626  1.00 75.35           C  
ATOM   1081  O   HIS B   2      14.296 -21.361  -6.237  1.00 71.83           O  
ATOM   1082  CB  HIS B   2      12.177 -19.327  -8.096  1.00 57.31           C  
ATOM   1083  CG  HIS B   2      12.390 -20.089  -9.415  1.00100.00           C  
ATOM   1084  ND1 HIS B   2      12.159 -19.476 -10.670  1.00 96.98           N  
ATOM   1085  CD2 HIS B   2      12.796 -21.402  -9.616  1.00 80.07           C  
ATOM   1086  CE1 HIS B   2      12.432 -20.425 -11.667  1.00 97.15           C  
ATOM   1087  NE2 HIS B   2      12.824 -21.614 -11.003  1.00 95.62           N  
TER    2194      HIS B 146                                                      
HETATM 2195  P   PO4 A 142      -0.011   6.346   0.005  0.50 59.85           P  
HETATM 2196  O1  PO4 A 142       0.239   7.341  -1.195  0.50 62.66           O  
HETATM 2197  O2  PO4 A 142       1.222   5.424   0.381  0.50 53.79           O  
HETATM 2198  O3  PO4 A 142      -0.259   7.347   1.196  0.50 62.61           O  
HETATM 2199  O4  PO4 A 142      -1.250   5.403  -0.344  0.50 53.40           O  
END                                                                             
"""

Use the above code cell to explore this data a little. Do you notice that each record is smooshed together? Our first task will be to seperate the records. Thankfully records are delimited by _newline_ characters (you can write these in strings like `"\n"`). Write and test a function below to split out records from the example input.

In [None]:
def extract_records(pdb):
    "Take raw PDB formatted text and extract all valid records."
    return _

# Test your implementation here
assert 42 == len(extract_records(example)), "Expected: 42, got: " + str(len(extract_records(example)))

Excellent! Now that we can look at individual records it's time to filter the records for $C_{\alpha}$ `ATOM` records using the predicates you wrote earlier. Write a function called `filter_records()` below that uses your predicates. This function should accept the list of records from `extract_records()` and the chain identifier you're interested in as an arguments and return a list of records that satisfy the predicates.

In [None]:
def filter_records(records, chain):
    "Filter input records for those that are CA ATOMs in chain"
    return _

# Test your implementation here
assert 2 == len(filter_records(extract_records(example), 'A')), "Expected 2, got " + str(len(filter_records(extract_records(example), 'A')))
assert 2 == len(filter_records(extract_records(example), 'B')), "Expected 2, got " + str(len(filter_records(extract_records(example), 'B')))
assert ['2', '9'] == [r[10] for r in filter_records(extract_records(example), 'A')], "Expected [2, 9], got " + str([r[10] for r in filter_records(extract_records(example), 'A')])
assert ['1072', '1079'] == [''.join(r[7:11]) for r in filter_records(extract_records(example), 'B')], "Expected [1072, 1079], got " + str([''.join(r[7:11]) for r in filter_records(extract_records(example), 'B')])

You've almost finished parsing the data! Only 1 thing left to do now that you have a list of only the records you're interested in: extract the coordinates :)

Write a function below that takes a list of records as an argument and returns a list of tuples containing coordinates like: $[(x_1. y_1, z_1), (x_2, y_2, z_2), ..., (x_n, y_n, z_n)]$

In [None]:
def extract_coordinates(records):
    "Extract 3D coordinate values from filtered PDB records"
    coords = []
    for record in records:
        x = _
        y = _
        z = _
        coords.append([x, y, z])
    return coords

#  Test your implementation here
from solutions.pdb import compare_coordinate_lists
expectedA = [[5.776, 17.899, 5.595], [9.698, 18.050, 5.442]]
expectedB = [[10.737, -18.800, -3.820], [12.288, -20.223, -6.906]]
assert all(compare_coordinate_lists(extract_coordinates(filter_records(extract_records(example), 'A')), expectedA))
assert all(compare_coordinate_lists(extract_coordinates(filter_records(extract_records(example), 'B')), expectedB))

Well done! You've now finished parsing the coordinate data :) And given you've passed all of the test cases you can be reasonably confident that your code is correct.

---

## Part 3: Plotting the 3D coordinates

Finally! You've finished wrangling data. It's time for the pay off :) Today you'll finish up by fetching PDB data from the internet and ploting it with matplotlib. Make sure you run the following code cell without editing it so that plots will display nice and interactively in the notebook...

In [None]:
%matplotlib widget

Then we'll import the libraries we need...

In [None]:
from urllib import request # For getting PDB data over the internet

import matplotlib.pyplot as plt # For plotting

Let's start by writing the function to download PDB data over the internet. We can use the Protein Data Bank Europe repository for this. The URL Template looks like this:

<pre>
<span style="color:blue;">https://www.ebi.ac.uk/pdbe/entry-files/download/pdb</span><span style="color:black;">$PDBID</span><span style="color:blue;">.ent</span>
</pre>

You should replace the `$PDBID` part with a valid PDB accession code in _lower case_. Here are a few examples:
* [1HHO](https://www.ebi.ac.uk/pdbe/entry/pdb/1hho): Human Oxyhaemoglobin
* [1MAZ](https://www.ebi.ac.uk/pdbe/entry/pdb/1maz): BCL-XL, an inhibitor of programmed cell death
* [1CPB](https://www.ebi.ac.uk/pdbe/entry/pdb/1cpb): Bovine Carboxypeptidase B
* [1SRX](https://www.ebi.ac.uk/pdbe/entry/pdb/1srx): E. coli Thioredoxin-S2

In [None]:
def fetch_pdb(code):
    "Fetch PDB data from PDBe. Code should be e.g. '1HHO'"
    url = "https://www.ebi.ac.uk/pdbe/entry-files/download/pdb" + code.lower() + ".ent"
    with request.urlopen(url) as pdbfile:
        pdb = _.read().decode() # Note the decode() is necessary. If you're not sure why, ask an instructor.
    return _

Now you should be able to fetch one of the examples and extract the $C_\alpha$ coordinates with your `extract_coordinates()` function.

The final thing to do is massage our coordinate data ready for plotting and plot the coordinates! When you're finished you should have an interactive plot something like this (for 1HHO):

![1HHO](images/1hho_plotted.png)

In [None]:
def plot_coordinates(pdb_id, chain_id):
    coordinates = extract_coordinates(filter_records(extract_records(fetch_pdb(pdb_id)), chain_id)) # Download and extract the coordinates for a protein
    fig = plt.figure()
    ax = plt.axes(projection="3d")
    x, y, z = zip(*coordinates)
    ax.plot(x, y, z, label=pdb_id+'-'+chain_id) # Fill in the label
    ax.legend()
    plt.show()

In [None]:
plot_coordinates('1HHO', 'A')