# Table of Contents
<ol>
 <li><a href="#introduction">Introduction</a></li>
 <ol>
  <li><a href="#using-the-notebook">Using the notebook</a></li>
 </ol>
    <li><a href="#the-proteins">The proteins</a></li>   
  <ol>
    <li><a href="#1c8q">1C8Q: Human salivary amylase</a></li>
    <li><a href="#4m6u">4M6U: *P. putida* mandelate racemase</a></li>
    <li><a href="#4ads">4ADS: Plasmodial PLP synthase complex</a></li>
    <li><a href="#1ubq">1UBQ: Ubiquitin (1.8 Å resolution)</a></li>
    <li><a href="#1zqa">1ZQA: DNA polymerase complexed with DNA</a></li>
  </ol>
    <li><a href="#the-code-cells">The code cells</a></li>
</ol>

# Introduction <a class="anchor" id="introduction"></a>
The Protein Data Bank (PDB) stores data for various proteins, each protein being designated by a four character alphanumeric ID. Some of this data includes 3-D structure information; this is important because the structure of a protein determines its function.

So what is the PDB? Taken from their [website](https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduction):

> The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.

Even still, predicting the shape of a protein remains challenging, though [huge breakthroughs](https://www.nature.com/articles/d41586-021-02025-4?error=cookies_not_supported&code=c3799b46-4355-4fca-9513-8d35f4ec4984) have been made in this field using [machine learning](https://pubmed.ncbi.nlm.nih.gov/34015749/). This notebook serves as a brief exploration into the world of proteins and protein modeling.

### Using the notebook <a class="anchor" id="using-the-notebook"></a>
Execute both of the [code cells](#the-code-cells) below and use the dropdown box or text box that appears to select a protein.

To execute a cell, click on it and press ```Shift + Enter``` or the ```Run``` button in the notebook toolbar.

Alternatively, run both cells by clicking ```Cell > Run All``` in the notebook toolbar.

After selecting a protein, the protein and an accompanying menu will appear. The protein can be rotated by clicking and dragging. Scroll to zoom in and out. Different representations can be selected by navigating to the ```Extra > Quick``` tab in the protein viewer menu and selecting from the list of representations. A short reference for the protein menu can be found [here](https://github.com/zachmichael14/protein_viewer#gui-reference).

# The Proteins <a class="anchor" id="the-proteins"></a>

Proteins are all over and do all manner of important things, and that's an understatement. In fact, proteins are so important that the whole purpose of DNA is to serve as an instruction set for making them. 

Proteins are just chains of amino acids arranged in a certain way. The way these amino acid are arranged create the protein's structure, and this structure determines what the protein can do. 

For instance, a certain sequence of amino aicds might create a protein that has a small pocket for a chemical substrate to fit into, thus facilitating some reaction. An enzyme is just a protein that promotes some such reaction, though not necessarily in that way. Proteins with names ending in -ase are enzymes.

You can enter a PDB ID below or select one of five proteins from the dropdown box. The five pre-selected proteins available here are:

<ul>
<li><strong>1C8Q: Human salivary amylase</strong><a class="anchor" id="1c8q"></a></li>
    
- Amylase is an enzyme that aids digestion by breaking down starches into sugars. It's made by both the pancreas and the salivary glands; its presence in the latter explains why eating foods high in starch (rice, for instance) may taste slightly sweet while being chewed.<br><br>
    
<li><strong>4M6U: *P. putida* mandelate racemase</strong><a class="anchor" id="4m6u"></a></li>
    
- The bacterium *Pseudomonas putida*, from which this protein comes, was the subject of the U.S. Supreme Court case *Diamond v. Chakrabarty* and ultimately became the first patented living organism. This particular protein essentially creates a mirror image of its substrate, a molecule called mandelic acid.<br><br>

<li><strong>4ADS: Plasmodial PLP synthase complex</strong><a class="anchor" id="4ads"></a></li>
    
- The parasite *Plasmodium berghei* causes malaria in some rodents and is a model organism for studying malaria in humans. Protein 4ADS is a complex (meaning it's composed of multiple individual proteins) that synthesizes the molecule pyridoxal phosphate (PLP), or vitamin B<sub>6</sub>.<br><br>

<li><strong>1UBQ: Ubiquitin (1.8 Å resolution)</strong><a class="anchor" id="1ubq"></a></li>
    
- Ubiquitin is so named because of its ubiquitous nature in most eukaryotes (animals, plants, and fungi). In fact, it's so useful that four diffent genes encode for its creation in humans. Importantly, ubiquitin fosters a process called ubiquintinylation, which can greatly alter a protein's function.<br><br>

<li><strong>1ZQA: DNA polymerase complexed with DNA</strong><a class="anchor" id="1zqa"></a></li>
    
- DNA polymerases are a class of proteins that are cruicial in the replication and repair of DNA. This particular variant is involved heavily in the DNA repair process in humans. Here, it's pictured with a very small segment of DNA.
</ul>


# The code cells <a class="anchor" id="the-code-cells"></a>

Run both of the cells below and select a protein from the dropdown menu or enter a PDB ID to view the protein.

In [None]:
""" This code taken from ipywidget docs on debouncing.
    The debounce is used for the protein textbox to avoid
    attempting to retrieve incomplete PDB IDs.
    https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Events.html#debouncing """
import asyncio

class Timer:
    def __init__(self, timeout, callback):
        self._timeout = timeout
        self._callback = callback

    async def _job(self):
        await asyncio.sleep(self._timeout)
        self._callback()

    def start(self):
        self._task = asyncio.ensure_future(self._job())

    def cancel(self):
        self._task.cancel()

def debounce(wait):
    """ Decorator that will postpone a function's
        execution until after `wait` seconds
        have elapsed since the last time it was invoked. """
    def decorator(fn):
        timer = None
        def debounced(*args, **kwargs):
            nonlocal timer
            def call_it():
                fn(*args, **kwargs)
            if timer is not None:
                timer.cancel()
            timer = Timer(wait, call_it)
            timer.start()
        return debounced
    return decorator

In [None]:
from os.path import exists
from pathlib import Path

from Bio.PDB import PDBList
from Bio.PDB.MMCIFParser import MMCIFParser
from Bio.PDB import MMCIF2Dict
import ipywidgets as widgets
import nglview as nv

def visualize_protein(pdb_id, delete_file_after=False):
    """ Parse structure from existing .cif file corresponding to pdb_id,
        then view the protein using nglview GUI. Since get_pdb_file saves
        the file locally, delete_file_after provides the option to
        delete the saved file after viewing. """
    file = f'pdb_data/{pdb_id}.cif'
    
    # Parse molecular structure from file.
    parser = MMCIFParser(QUIET=True)
    structure = parser.get_structure(f'{pdb_id}', f'{file}')
    
    # Display the protein and nglview GUI.
    view = nv.show_biopython(structure, gui=True)
    with widget_output:
        display(view)
        # Clear output when new protein is selected.
        widget_output.clear_output(wait=True)
    
    # Delete stored file if necessary.
    if delete_file_after:
        Path.unlink(file)

def get_pdb_file(pdb_id, data_dir='pdb_data'):
    """ Checks for .cif file in data_dir corresponding to pdb_id. 
        If it doesn't exist, retrieve it from PDB and save it locally. 
        Since the file is saved, this function returns nothing. """
    file_path = f'{data_dir}/{pdb_id}.cif'
    
    # If the file isn't stored, retrieve and save it locally.
    if not exists(file_path):
        pdb_list = PDBList()
        
        # TODO: Check for this to fail (no structure matching ID)
        pdb_list.retrieve_pdb_file(pdb_id, pdir="pdb_data")

# SELECTOR WIDGETS
pdb_dropdown = widgets.Dropdown(
    options=[
        'Select a protein...', 
        '1C8Q: Amylase ', 
        '4M6U: Racemase', 
        '4ADS: Synthase', 
        '1UBQ: Ubiquitin', 
        '1ZQA: Polymerase'
    ],
    value='Select a protein...',
    description='PDB ID:',
    disabled=False
)

pdb_textbox = widgets.Text(
    value='',
    placeholder='Enter a 4 character PDB ID...',
    description='PDB ID:',
    disabled=False
)

# SELECTOR EVENT HANDLERS
def handle_dropdown(change):
    # Handle error if user re-selects default selector value.
    if change['new'] == 'Select a protein...':
        with widget_output:
            print('Please select a protein to display.')
            widget_output.clear_output(wait=True)
    else:
        # Extract 4 character ID from selector widget.
        pdb_id = change['new'][0:4].lower()
        get_pdb_file(pdb_id)
        visualize_protein(pdb_id)

@debounce(1.5)
def handle_textbox(change):
    pdb_id = change['new'].lower()
    
    get_pdb_file(pdb_id)
    visualize_protein(pdb_id)

# Link selectors to event handlers.
pdb_dropdown.observe(handle_dropdown, names='value')
pdb_textbox.observe(handle_textbox, names='value')    
    
# Widget ouput must be captured in order to be displayed. 
widget_output = widgets.Output()

# Display selector widget, output.
display(widgets.VBox([pdb_dropdown, pdb_textbox, widget_output])) 