<a href="https://colab.research.google.com/github/rajaonsonella/csc2541_lab1/blob/main/lab1_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSC2541 - Lab 1: Deep Learning for structure-based drug design (SBDD)

Structure-based drug design (SBDD) aims at designing small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets which have been identified as having therapeutic potential. Specifically, this approach focuses on leveraging the 3D structure of the protein target to rationally design a ligand.

In this lab, we aim at giving you an introduction to SBDD and the latest deep learning models used in the field. The lab is structured into three parts

### Table of Content

1.   Pre-module: Introduction to the Protein Data Bank (PDB) (~30 mins)
2.   Module: Exploring how ML does at Ligand pose prediction (~50 mins)
3.   Post-module (assignment): Evaluating a state of the art ML model (Boltz2) on various protein ligand complex pose prediction (~60 mins)

# Colab Set Up

In [None]:
from IPython.utils import io
print('Downloading necessary libraries...')
def install_libraries():
  !pip install py3Dmol -q
  !pip install biopython -q
with io.capture_output() as captured:
  install_libraries()

Downloading necessary libraries...


In [None]:
import ipywidgets
from IPython.display import display

# Pre-module: Introduction to the Protein Data Bank (PDB)

The Protein Data Bank (PDB) is a **global repository of 3D structural data for biological macromolecules** like proteins, DNA, RNA, and their complexes with ligands or drugs.


It stores experimentally determined structures obtained through techniques such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM). Each entry in the PDB contains atomic coordinates, experimental details, and annotations describing function and interactions of the macromolecule.

Learn more about the history of PDB [here](https://www.rcsb.org/pages/about-us/history) and check out the overall growth of the database [here](https://www.rcsb.org/stats/growth/growth-released-structures)!

**In this pre-module we are going to:**

* Review the basics about proteins
* Learn more about the file format used to store this type of data
* Learn how to programmatically retrieve protein structures and extract key metadata
* Visualize protein structures interactively in 3D using Python tools

## Protein 101 recap

TBD

## The PDBx/mmCIF file format

The files found in the PDB are available in several formats (PDB, mmCIF, XML). We will focus on the PDBx/mmCIF file format, which is the current preferred format.


**Note:** This is a quick intro to the file format, full documentation can be found at [https://mmcif.wwpdb.org/](https://mmcif.wwpdb.org/)


Let's first download a file and open it to see what the format looks like:

In [None]:
# Download an mmCIF file

# Colab tip -- You can download the file to your local machine by:
# 1) clicking on the folder icon on the colab menu
# 2) locating the file and double-clicking on it

# Let's specify the PDB ID of the file we want to download
pdb_id = "4INS"

import urllib.request
urllib.request.urlretrieve(f"https://files.rcsb.org/download/{pdb_id}.cif", f"{pdb_id}.cif")

# View first few lines
num_lines_to_print = 10
with open("4INS.cif") as f:
    print(f"Printing the first {num_lines_to_print} lines in the file: \n")
    for _ in range(num_lines_to_print):
        print(f.readline().strip())

As you can see, it is quite challenging to handle these files "by ourselves" that's why many python libraries have built-in parsers to easily retrieve this data and focus on the rest. The lines we are usually interested in within this file are the one describing the 3D structure of the protein, let's use of of those python packages to extract this information from this file more easily.

In [None]:
!pip install biopandas -q

In [None]:
from biopandas.mmcif import PandasMmcif

# Read the mmCIF file
mmcif = PandasMmcif().read_mmcif(f'{pdb_id}.cif')

# Display the different "keys" of the returned dictionnary
print(mmcif.df.keys())

We created a dictionary ```mmcif``` which contains 3 keys:

* **ATOM** records represent atoms that belong to standard polymer chains such as proteins and nucleic acids.

* **HETATM** records describe atoms from non-polymer molecules like ligands, metal ions (e.g., zinc), and solvent molecules (e.g., water).

* **ANISOU** records contain detailed information on atomic vibrations (anisotropic displacement parameters) derived from crystallography, which provide insight into atomic motion but are optional for most analyses.
Understanding these distinctions helps when filtering or visualizing structural components relevant to your study.

In [None]:
mmcif.df["ATOM"]

In [None]:
mmcif.df["HETATM"]

Spend some time exploring the tabular data visualization of this protein, and start imagining having to deal with several thousand of these files!

**Quiz:**

* Are there any solvent molecules present in that structure?
* How many amino acid is there in this protein? How about the number of heavy atoms? How about the number of chains?
* What does the "occupancy" column means? Why is this important?



Now, this is much more readable than the raw cif format, but it is not necessarily easy to get a feel of 3D data by looking at a table. Let's try adding some visualization to truly get a sense of what we are dealing with

## Visualizing proteins with py3Dmol



To do so, let's use the py3Dmol package created by the Koes group [see paper](https://doi.org/10.1093/bioinformatics/btu829)

In [None]:
import py3Dmol
view = py3Dmol.view(query=f'pdb:{pdb_id}')

# Set the style, we have to add separate visualizations for the hetero atoms
view.setStyle({'cartoon':{'color':'spectrum'}})
view.addStyle({'atom': 'ZN'}, {'sphere': {'color': 'grey', 'radius': 1.5}})
view.addStyle({'resn': 'HOH'}, {'sphere': {'color': 'red', 'radius': 0.5}})
view.show()

Let's exclude the water molecules from our visualization from now on. It's important to know they are here due to the experimental nature of the structure, but it hinders our visualization. On top of that, let's add a few small changes to how we're doing the rendering to see more.

In [None]:
view = py3Dmol.view(query=f'pdb:{pdb_id}')
view.setStyle({'cartoon': {'color':'spectrum'}})
view.addStyle({'atom': 'ZN'}, {'sphere': {'color': 'grey', 'radius': 1.5}})
view.addSurface(py3Dmol.VDW,{'opacity':0.7,'color':'white'}, {'not': {'resn': 'HOH'}})
view.show()

**Quiz:**

* What does having solvent molecules within the crystal imply when modeling protein-ligand interactions?
* Another interesting thing about this protein structure is the 2 Zn ions it has. Can AlphaFold2 handle that type of atoms? If not, what are the potential consequences of this missing feature?