# Course Project: Draw a protein structure in 3D

In this project you will plot the backbone structure of proteins in 3D. To do this you will fatch PDB data from PDBe (Protein Data Bank Europe), parse the data, and use the [matplotlib]() library to plot the backbone structure from $C_{\alpha}$ atoms.

Along the way, you will use all of the techniques you learned during the course, but don't worry: the project progresses at the same pace as the course. You can already start making useful progress after the first day. During the course you will have the opportunity to work on the project.

---

## Software prerequisites

This project involves plotting so we need a couple of plotting related libraries to be installed.
 * [Matplotlib](https://matplotlib.org/) A general purpose visualisation library for Python
 * [ipympl](https://github.com/matplotlib/ipympl) Matplotlib Jupyter integration

Please ask your instructors for help installing these with Anaconda Navigator or the `pip` command.

---

## Background

[Proteins](https://en.wikipedia.org/wiki/Protein) are made from _chains_ of amino acids. Each amino acid is distinguished by a _side chain_ but share a common, repeating [_backbone_](https://foldit.fandom.com/wiki/Protein_backbone) structure. The central carbon atom of each amino acid backbone can be taken as representative of that amino acid and used to draw line segments. This is what you will do in this project.

![Amino acid structure](https://bio.libretexts.org/@api/deki/files/16705/peptide_bond.png)
Source: https://bio.libretexts.org/Courses/University_of_California_Davis/BIS_2A%3A_Introductory_Biology_-_Molecules_to_Cell/MASTER_RESOURCES/Proteins

There are now several formats used for storage and interchange of protein structure data, the oldest (and easiest to parse) is the [Brookhaven Protein Data Bank](https://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html) file format. This format was designed in the 1970's for use with paper punched cards.

![Punched card](https://swift.cmbi.umcn.nl/teach/B1M/IMAGE/pc1.jpg)
Source: https://swift.cmbi.umcn.nl/teach/B1M/BIOINF_4.html


Here is an example of a PDB record for a single atom:

> ```
> 1   5    10        20       30        40        50        60        70
> ATOM    278  CA  THR A  38       6.008  -5.254 -13.600  1.00 33.49           C  
> ```

You can read this letter-by-letter (or actually column-by-column). Above the PDB record I've lebeled some column numbers.
* Columns 1-6 is the record name, in this case `ATOM`.
* Column 22 is the chain name, in this case `A`.
* Columns 13-16 is the atom name, in this case `CA` which means this is a $C_\alpha$ atom.
* Columns 31-38 is the $x$ coordinate.
* Columns 39-46 is the $y$ coordinate.
* Columns 47-54 is the $z$ coordinate.

The records for a whole backbone of a single amino acid might look like this:

```
ATOM    270  N   PRO A  37       9.573  -6.468 -11.879  1.00 36.70           N  
ATOM    271  CA  PRO A  37       9.555  -6.634 -13.332  1.00 31.03           C  
ATOM    272  C   PRO A  37       8.253  -6.068 -13.942  1.00 29.37           C  
ATOM    273  O   PRO A  37       8.081  -5.921 -15.152  1.00 34.30           O  
ATOM    274  CB  PRO A  37       9.656  -8.129 -13.521  1.00 25.97           C  
```

This Amino Acid is a "Proline", it's the 37th residue in the 'A' chain.

---

## Part 1: Parsing Brookhaven PDB formatted data

The first step in this project is to extract coordinates from a PDB formatted data. Let's start off by writing a predicate to detect "ATOM" records. A _predicate_ is a function that returns a boolean value: it answers a question. Here is an example of a predicate:

```python
def is_string_empty(string):
    return len(string) == 0
```

Write a predicate function, using the template below, that takes 1 argument and returns `True` when that argument is a string that contains "ATOM".

In [None]:
def is_atom(record_id):
    "Predicate: True if record_id is for an ATOM record"
    if _:
        return True
    
    return False

# Test your implementation here
assert is_atom('ATOM') == True, f"I expected True, got {is_atom('ATOM')}"
assert is_atom('REMARK') == False, f"I expected False, got {is_atom('REMARK')}"
assert is_atom('ATOM ') == True, f"I expected True, got {is_atom('ATOM ')}"
assert is_atom('ATOM  ') == True, f"I expected True, got {is_atom('ATOM  ')}"

You're welcome to modify the template above if you believe you can write the function more clearly or more concisely.

We also need a predicate to check if a record is for $C_\alpha$ atoms. Fill in the template for `is_ca()` below.

In [None]:
def is_ca(atom_name):
    "Predicate: True if ATOM is a CA"
    ...

# Test your implementation here
assert is_ca('CA') == True, f"I expected True, got {is_ca('ATOM')}"
assert is_ca(' N  ') == False, f"I expected False, got {is_ca('REMARK')}"
assert is_ca(' CA ') == True, f"I expected True, got {is_ca('ATOM ')}"

Well done! Finally, we need a predicate to detect if an atom is in the chain we're interested in. This one will be a little different because now we need to accept 2 arguments: the chain we want, and the chain in the record. For example, we want chain 'A' but we're looking at a record in chain 'C'. Fill in the template below and make sure it passes the test cases.

In [None]:
def is_in_chain(wanted_chain_id, observed_chain_id):
    "Predicate: True if observed_chain_id is the same as observed_chain_id"
    ...

# Test your implementation here
assert is_in_chain('A', 'A') == True, f"I expected True, got {is_in_chain('A', 'A')}"
assert is_in_chain('A', 'C') == False, f"I expected False, got {is_in_chain('A', 'C')}"
assert is_in_chain('1', ' 1 ') == True, f"I expected True, got {is_in_chain('1', ' 1 ')}"
assert is_in_chain('D', ' D') == True, f"I expected True, got {is_in_chain('D', ' D')}"
assert is_in_chain('D', ' A') == False, f"I expected False, got {is_in_chain('D', ' A')}"

---

## Part 2: More parsing Brookhaven PDB formatted data

In the last part you wrote predicates that will allow you to _filter_ only the records you're interested in. Today we will do that and extract the atomic $x$, $y$, $z$ coordinates.

Here is some example data for you to parse:

In [None]:
example = """HEADER    OXYGEN TRANSPORT                        10-JUN-83   1HHO              
TITLE     STRUCTURE OF HUMAN OXYHAEMOGLOBIN AT 2.1 ANGSTROMS RESOLUTION         
ATOM      1  N   VAL A   1       5.287  16.725   4.830  1.00 77.31           N  
ATOM      2  CA  VAL A   1       5.776  17.899   5.595  1.00 70.91           C  
ATOM      3  C   VAL A   1       7.198  18.266   5.104  1.00 81.71           C  
ATOM      4  O   VAL A   1       7.301  19.067   4.161  1.00 77.16           O  
ATOM      5  CB  VAL A   1       5.498  17.697   7.118  1.00 51.33           C  
ATOM      6  CG1 VAL A   1       6.457  16.822   7.917  1.00 78.39           C  
ATOM      7  CG2 VAL A   1       5.211  18.976   7.922  1.00 48.23           C  
ATOM      8  N   LEU A   2       8.272  17.653   5.632  1.00 67.33           N  
ATOM      9  CA  LEU A   2       9.698  18.050   5.442  1.00 27.11           C  
ATOM     10  C   LEU A   2      10.047  19.267   6.283  1.00 33.71           C  
ATOM     11  O   LEU A   2       9.566  20.404   6.099  1.00 55.97           O  
ATOM     12  CB  LEU A   2      10.129  18.317   4.001  1.00 30.38           C  
ATOM     13  CG  LEU A   2      10.208  17.036   3.175  1.00 29.73           C  
ATOM     14  CD1 LEU A   2      10.270  17.355   1.684  1.00 58.48           C  
ATOM     15  CD2 LEU A   2      11.398  16.220   3.605  1.00 47.81           C  
"""

Use the above code cell to explore this data a little. Do you notice that each record is smooshed together? Our first task will be to seperate the records. Thankfully records are delimited by _newline_ characters (you can write these in strings like `"\n"`). Write and test a function below to split out records from the example input.

In [None]:
def split_records(pdb):
    "Take raw PDB formatted text and extract records."
    return pdb.split(_)

# Test your implementation here
assert len(split_records(example)) == 17, f"I expected 17, got {len(split_records(example))}"

Excellent! Now that we can look at individual records it's time to filter the records using the predicates you wrote last time. Write a function called `filter_records()` below that uses your predicates. This function should accept the list of records from `split_records()` and the chain identifier you're interested in as an arguments and return a list of records that satisfy the predicates.

In [None]:
def filter_records(records, chain):
    "Filter input records for those that are CA ATOMs in chain"
    _
    for record in _:
        record_id = record[:6]
        chain_id = record[21]
        atom_name = record[46:54]
        if is_atom(_) and is_ca(_) and is_in_chain(chain, _):
            _
    return _

#  Test your implementation here
assert len(filter_records(split_records(example), 'A')) == 2, f"I expected 2, got {len(filter_records(split_records(example), 'A'))}"
assert filter_records([split_records(example)[0]], 'A') == [], f"I expected [], got {filter_records([split_records(example)[0]], 'A')}"
assert filter_records(split_records(example), 'B') == [], f"I expected [], got {filter_records(split_records(example), 'B')}"
assert filter_records(split_records(example)[:5], 'A') == "ATOM      2  CA  VAL A   1       5.776  17.899   5.595  1.00 70.91           C  ", f"I expected ATOM      2  CA  VAL A   1       5.776  17.899   5.595  1.00 70.91           C  , got {filter_records(split_records(example)[:5], 'A')}"

You've almost finished parsing the data! Only 1 thing left to do now that you have a list of only the records you're interested in: extract the coordinates :)

Write a function below that takes a list of records as an argument and returns a list of tuples containing coordinates like: $[(x_1. y_1, z_1), (x_2, y_2, z_2), ..., (x_n, y_n, z_n)]$

In [None]:
def extract_coordinates(records):
    "Extract 3D coordinate values from filtered PDB records"
    coords = []
    for record in _:
        x = _(record[30:38])
        y = _(record[38:46])
        z = _(record[46:54])
        _
    _

#  Test your implementation here
from solutions.pdb import compare_coordinate_lists
expected = [(5.776, 17.899, 5.595), (9.698, 18.050, 5.442)]
assert all(compare_coordinate_lists(extract_coordinates(filter_records(split_records(example), 'A')), expected)), f"I expected {expected}, got {extract_coordinates(filter_records(split_records(example), 'A'))}"

Well done! You've now finished parsing data :) And given you've passed all of the test cases you can be reasonably confident that your code is correct.

---

## Part 3: Plotting the 3D coordinates

Finally! You've finished wrangling data. It's time for the pay off :) Today you'll finish up by fetching PDB data from the internet and ploting it with matplotlib. Make sure you run the following code cell without editing it so that plots will display nice and interactively in the notebook...

In [None]:
%matplotlib widget

Then we'll import the libraries we need...

In [None]:
from urllib import request # For getting PDB data over the internet

import matplotlib.pyplot as plt # For plotting

Let's start by writing the function to download PDB data over the internet. We can use the Protein Data Bank Europe repository for this. The URL Template looks like this:

<pre>
<span style="color:blue;">https://www.ebi.ac.uk/pdbe/entry-files/download/pdb</span><span style="color:black;">$PDBID</span><span style="color:blue;">.ent</span>
</pre>

You should replace the `$PDBID` part with a valid PDB accession code in _lower case_. Here are a few examples:
* [1HHO](https://www.ebi.ac.uk/pdbe/entry/pdb/1hho): Human Oxyhaemoglobin
* [1MAZ](https://www.ebi.ac.uk/pdbe/entry/pdb/1maz): BCL-XL, an inhibitor of programmed cell death
* [1CPB](https://www.ebi.ac.uk/pdbe/entry/pdb/1cpb): Bovine Carboxypeptidase B
* [1SRX](https://www.ebi.ac.uk/pdbe/entry/pdb/1srx): E. coli Thioredoxin-S2

In [None]:
def fetch_pdb(code):
    "Fetch PDB data from PDBe. Code should be e.g. '1HHO'"
    url = _
    with request.urlopen(url) as pdbfile:
        pdb = _.read().decode() # Note the decode() is necessary. If you're not sure why, ask an instructor.
    return _

Now you should be able to fetch one of the examples and extract the $C_\alpha$ coordinates with your `extract_coordinates()` function.

The final thing to do is massage our coordinate data ready for plotting and plot the coordinates! When you're finished you should have an interactive plot something like this (for 1HHO):

![1HHO](images/1hho_plotted.png)

In [None]:
# Download and extract the coordinates for a protein
coordinates = _

In [None]:
fig = plt.figure()
ax = plt.axes(projection="3d")
x, y, z = zip(*coordinates)
ax.plot(x, y, z, label='Fixme') # Fill in the label
ax.legend()
plt.show()