# Reproducible PDB viewer (BinderHub version)

The Protein Data Bank (PDB) contains more than a hundred thousand three-dimensional atomic structures of proteins, peptides and nucleic acids. There are three main portals: in the USA (https://rcsb.org), Europe (https://www.ebi.ac.uk/pdbe/) and Japan (https://pdbj.org/).

PDB entries are defined by one digit + three letters, e.g. "1avx" for the trypsin protein in complex with its inhibitor. This always refers to the latest version of the atomic structure. 

This means downloading a PDB entry is non-reproducible, i.e. the entry may have changed when you repeat the download later in time. The PDB does maintain a versioned repository, but only major changes are stored there.

At the RPBS, reproducible distributions of the PDB are maintained. Once in a while, all changed entries of the PDB are downloaded and their checksums computed, creating a time-specific distribution. For each distribution, an index file can be downloaded containing all PDB entries and their checksums. These index files can be wrapped inside a Seamless DeepCell for convenience. 

The current notebook sets up a molecular web viewer for a reproducible PDB distribution. Then, the user can select a PDB entry among the ~200 000 entries and the corresponding molecular structure will be shown. It is guaranteed to show the same structure even if the PDB entry undergoes later change.

The RPBS also maintains a last-resort buffer server at https://buffer.rpbs.univ-paris-diderot.fr . This buffer server includes the PDB entries.

### Create a new Seamless project

This will set up the Seamless web page generator. For the rest, the project is empty.

(If Seamless is installed locally, you would use the `seamless-new-project` command instead)

In [None]:
import os
if not os.path.exists("web/"):
    os.system("python3 ~/seamless-scripts/new-project.py rpdb-viewer")
else:
    # To avoid merge conflicts, remove all existing customized web content
    os.system("rm -f web/webform.json web/index.html web/index.js web/*CONFLICT.*")

# new-project.py will generate a default Notebook for the project. Link it to the current notebook instead
os.system("rm -f rpdb-viewer.ipynb")
os.system("ln -s reproducible-pdb-viewer.ipynb rpdb-viewer.ipynb")

In [None]:
%run -i load-project.py
await load()

## Defining the reproducible PDB distribution

First, we need to define the reproducible PDB distribution. At the RPBS, there is the ***FAIR server*** where you can specify human-level metadata, such as the name of the dataset, the date, the version and/or the format. It returns the checksum of the distribution, the checksum of the ordered entries (keys), and some metadata:

In [None]:
!curl 'https://fair.rpbs.univ-paris-diderot.fr/machine/find_distribution?dataset=pdb&date=2022-11-27&type=deepcell'

### Seamless API
Seamless has an API where the FAIR server is contacted and the result is stored in a deep cell:

In [None]:
from seamless.highlevel import DeepCell
import json

date = "2022-11-27"
distribution = DeepCell.find_distribution("pdb", date=date, format="mmcif")
print(json.dumps(distribution, indent=2))

ctx.pdb = DeepCell()
ctx.pdb.define(distribution)

In [None]:
print("PDB date:", date)
print("Number of index keys (PDB entries): ", ctx.pdb.nkeys )
pdb_index_size = "{:d} MiB".format(int(ctx.pdb.index_size/10**6))
print("Size of the checksum index file: ", pdb_index_size )
if ctx.pdb.content_size is None:
    pdb_size = "<Unknown>"
else:
    pdb_size = "{:d} GiB".format(int(ctx.pdb.content_size/10**9))
print("Total size of the Protein Data Bank (mmCIF format):", pdb_size )

### Strong reproducibility

Saving the workflow with `save` or `ctx.save_graph` will lead to strong reproducibility, since it stores the distribution checksums directly.

If you need strongly reproducible *notebook code*, you can embed the distribution checksums inside the code:

```python
ctx.pdb.define({
    "checksum": "57ce3e4487745320f68fa84e2e4cb4c431953b204812cf1f76bb011f032d6380",
    "keyorder": "8fe126582cd6933150d79027927393a86d8426669e48fc39a911c9f895f00e2e",
})
```

In [None]:
print("Download checksum index file...")
await ctx.computation()
print("Done")

## Accessing individual PDB entries

You can now get the checksum of each individual PDB entry.

In [None]:
print(ctx.pdb.data["1avx"])

In addition, the FAIR server maintains, for each checksum, a list of URLs where the data can be downloaded. 

There is no guarantee that the URL will yield the correct data, but because the checksum is known in advance, the download can be verified.

In [None]:
!curl https://fair.rpbs.univ-paris-diderot.fr/machine/access/2b0eeeac3bd3ba8d6e67aa262d9d2279dc672607af7a80414df10da1cb4f9cc2

### Seamless API
Seamless has an API `DeepCell.access(entry)` where: 

- The FAIR server is contacted with the entry's checksum, obtaining the above list of URLs.
- Using the list of URLs, the molecular structure is downloaded
- The downloaded structure is verified against the checksum

In [None]:
print("Access PDB entry 1avx")
pdb_data = ctx.pdb.access("1avx")
print(pdb_data[:500] + "\n...")

## PDB workflow

The code below defines a workflow where the entry is defined in `ctx.pdb_code`, and the corresponding molecular structure is then stored (as text) in `ctx.pdb_structure`.

We can manipulate the entry and structure with a little Jupyter dashboard.

In [None]:
from seamless.highlevel import stdlib

ctx.all_pdb_codes = Cell("plain")
await ctx.translation()
ctx.all_pdb_codes.set_checksum(ctx.pdb.keyorder_checksum)

ctx.pdb_code = Cell("str").set("1avx")

ctx.include(stdlib.select)
ctx.pdb_structure = Cell("text")
ctx.select_pdb = ctx.lib.select(
    celltype="text",
    input=ctx.pdb,
    selected=ctx.pdb_code,
    output=ctx.pdb_structure,
)

In [None]:
from IPython.display import display
from ipywidgets import Text, Textarea

w = Text()
ctx.pdb_code.traitlet().link(w)
display(w)
w = Textarea()
ctx.pdb_structure.traitlet().connect(w)
display(w)
await ctx.computation()


## PDB web visualization

The final step is to define a web page where a PDB code is selected and the corresponding molecular structure is visualized.

For this purpose, the Seamless web page generator contains the "bigselect" and "nglviewer" webunits. See their documentation below.

In [None]:
from seamless.highlevel import webunits
webunits.bigselect?

In [None]:
webunits.nglviewer?

In [None]:
# Define the PDB viewer.

# 1. Web selector (with tab completion) of PDB code 
webunits.bigselect(ctx, options=ctx.all_pdb_codes, selected=ctx.pdb_code)

# 2. Define molecular representation (defined in representation.yaml)
ctx.representation = Cell("yaml").share(readonly=False)
ctx.representation.mount("representation.yaml")
ctx.representation2 = Cell("plain")
ctx.representation2 = ctx.representation

# 3. Molecular visualization based on the NGL web viewer
webunits.nglviewer(ctx, ctx.pdb_structure, ctx.representation2, format="cif")

await ctx.computation()

### BinderHub only: unify the Seamless ports
(This allows us to serve Seamless HTTP cells and the Seamless web interface through JupyterLab, because we can't access localhost)

In [None]:
from seamless import shareserver
cmd = "python3 ~/seamless-scripts/webproxy.py 6543 http://localhost:{0} ws://localhost:{1}".format(shareserver.rest_port, shareserver.update_port)
get_ipython().run_cell_magic('script', 'bash --bg --out webproxy.log', cmd)

In [None]:
%%javascript
let base = window.location.protocol + "//" + window.location.hostname
if (window.location.port != 80) {
    base = base + ":" + window.location.port
}
let v = window.location.pathname
let vv = v.split("/")
for (let i = 1; i < vv.length - 1; i++) {
    if ((vv[i] == "lab" || vv[i] == "doc") && (vv[i+1] == "tree" || vv[i+3] == "tree")) {
        window.JUPYTERLAB_URL = base + vv.slice(0, i).join("/") + "/proxy/6543" 
        break
    }
}


## PDB viewer

In [None]:
%%javascript
var ele = document.createElement("div")
element.append(ele)
ele.innerHTML = "<b><a href=\"" + window.JUPYTERLAB_URL  + "/status/index.html\" target=\"_blank\"> The PDB viewer can now be opened by clicking here</a></b>"


The web page has been lightly customized by editing `web/webform.json`. Compare with `web/webform-AUTOGEN.json` to observe the modifications.