<a href="https://colab.research.google.com/github/rajaonsonella/csc2541_lab1/blob/main/lab1_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSC2541 - Lab 1: Deep Learning for structure-based drug design (SBDD)

Structure-based drug design (SBDD) aims at designing small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets which have been identified as having therapeutic potential. Specifically, this approach focuses on leveraging the 3D structure of the protein target to rationally design a ligand.

In this lab, we aim at giving you an introduction to SBDD and the latest deep learning models used in the field. The lab is structured into three parts

### Table of Content

1.   Pre-module: Introduction to the Protein Data Bank (PDB) (~30 mins)
2.   Module: Exploring how ML does at Ligand pose prediction (~50 mins)
3.   Post-module (assignment): Evaluating a state of the art ML model (Boltz2) on various protein ligand complex pose prediction (~60 mins)

# Colab Set Up

In [None]:
import ipywidgets
from IPython.display import display

# Pre-module: Introduction to the Protein Data Bank (PDB)

The Protein Data Bank (PDB) is a **global repository of 3D structural data for biological macromolecules** like proteins, DNA, RNA, and their complexes with ligands or drugs.


It stores experimentally determined structures obtained through techniques such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM). Each entry in the PDB contains atomic coordinates, experimental details, and annotations describing function and interactions of the macromolecule.

Learn more about the history of PDB [here](https://www.rcsb.org/pages/about-us/history) and check out the overall growth of the database [here](https://www.rcsb.org/stats/growth/growth-released-structures)!

**In this pre-module we are going to:**

* Review the basics about proteins
* Learn more about the file format used to store this type of data
* Learn how to programmatically retrieve protein structures and extract key metadata
* Visualize protein structures interactively in 3D using Python tools

## The PDBx/mmCIF file format

The files found in the PDB are available in several formats (PDB, mmCIF, XML). We will focus on the PDBx/mmCIF file format, which is the current preferred format.


**Note:** This is a quick intro to the file format, full documentation can be found at [https://mmcif.wwpdb.org/](https://mmcif.wwpdb.org/)


Let's first download a file and open it to see what the format looks like:

In [None]:
# Download an mmCIF file

# Colab tip -- You can download the file to your local machine by:
# 1) clicking on the folder icon on the colab menu
# 2) locating the file and double-clicking on it

# Let's specify the PDB ID of the file we want to download
pdb_id = "4INS"

import urllib.request
urllib.request.urlretrieve(f"https://files.rcsb.org/download/{pdb_id}.cif", f"{pdb_id}.cif")

# View first few lines
num_lines_to_print = 10
with open("4INS.cif") as f:
    print(f"Printing the first {num_lines_to_print} lines in the file: \n")
    for _ in range(num_lines_to_print):
        print(f.readline().strip())

Printing the first 10 lines in the file: 

data_4INS
#
_entry.id   4INS
#
_audit_conform.dict_name       mmcif_pdbx.dic
_audit_conform.dict_version    5.397
_audit_conform.dict_location   http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic
#
loop_
_database_2.database_id


As you can see, it is quite challenging to handle these files "by ourselves" that's why many python libraries have built-in parsers to easily retrieve this data and focus on the rest. We will use one of those parser shortly, but for now let's try to see what the protein structure we have within this file looks like:

In [None]:
import py3Dmol
viewer = py3Dmol.view(query=f'pdb:{pdb_id}')
viewer.setStyle({'cartoon':{'color':'spectrum'}})

<py3Dmol.view at 0x10af41540>

Let's use of of those python packages to extract information from this file more easily.

In [None]:
from biopandas.mmcif import PandasMmcif

# Read the mmCIF file
mmcif = PandasMmcif().read_mmcif('4INS.cif')

In [None]:
mmcif.df.keys()

dict_keys(['ATOM', 'HETATM', 'ANISOU'])