In [2]:
from IPython import display

# An introduction to structural bioinformatics tools and databases

## What is structural bioinformatics?

 - Definition:
   - Branch of bioinformatics that leverages computational power & experimental data to predict biomolecular structures and analyse their behavior
 - Importance:
   - Crucial role in the design of drugs and the understanding of disease mechanisms at the molecular level
   - Accelerates the discovery of new drugs by reducing time and cost of experimental approaches
     - Allows for more focused lab assays by prioritizing promising subsets of data
       - Imagine narrowing down millions of compounds to a few in vitro lab experiments

### General objectives in structural bioinformatics
- Predicting 3D structures
  - Nucleic acids, proteins, small molecules, and any of these combinations
- Determining structural landmarks in proteins (structural annotation)
  - secondary structures, motifs (supersecondary structures), domains, catalytic/allosteric sites
- Determining folding mechanisms of proteins
  - Process by which a primary sequence becomes a functional 3D structure
  - Defective folding are linked to several disorders
    - e.g. Alzheimer's (amyloid beta), etc  
- Investigating molecular interactions
  - e.g. How strongly does a compound binds to a protein receptor?
- Investigating the effect of residue mutations
    - e.g. Single hemoglobin gene mutation leads to C-shaped protein in Sickle Cell Disease (SCD)
    - e.g. Antiretroviral drug resistance in HIV - ARVs lose efficacy over time
- Investigating the dynamics of proteins and their complexes
  - Changes that occur over time, unobserved in static structures
    - cryptic pockets, unknown conformations/interactions, drug stability, etc\
- Drug discovery
  - Screening from potential modulators of proteins
    - modulators can be upregulators or inhibitors of a protein
- There many more techniques that can give various kinds of insights

### How is structural bioinformatics practised nowadays?

- Based on the premise that structure determines function
  - Distal parts on a protein sequence (1-D) can actually be brought together in 3-D structure
    - Distal residue interactions (H-bonds, disulphides, salt-bridges, etc)
  - Knowing shapes of receptors and other molecules facilitates the discovery of (protein or non-protein) binding partners
- Computational power and improved algorithms/techniques
  - Artificial Intelligence
    - AlphaFold can predict 3-D structures of certain protein structures with very high accuracy
      - This is promising, even though it has certain limitations
  - Increase in to computational power/efficiency and its accessibility
    - More powerful CPUs and GPUs on the market
    - Availability of high performance computing (HPC) clusters to more scientists (CHPC in South Africa)
    - Open source HPC software enables anyone to set up of his own computer cluster
    - Cloud computing (e.g. GPU on Google Collaboratory) to anyone with a Google account

---

## Structural bioinformatics is informed by experimental methods

- It is not just a matter of running completely random simulations on a computer
  - The computer simulations have to be rooted in biology/reality
  - Garbage In Garbage Out
- But there are also highly accurate quantum mechanical (QM) predictions
  - Based on functions that describe a system using probabilities of finding it's electron positions
  - QM is very computationally expensive to use for large systems
  - We won't delve any further into that area

### Experimental techniques for determining molecular (protein) 3D structures

As at September 2024, the RCSB PDB was composed of:
- Experimental models (225,158 models)
  - 83.5% X-ray crystallographic structures.
  - 6.3% solution NMR structures
  - 9.9% Cryo-EM structures.
- Computed Structure Models (1,068,577 models)
  - 100% AlphaFold predictions

#### Nuclear magnetic resonance (NMR) Spectroscopy
- What is it?
  - A highly concentrated protein solution is exposed to a strong magnetic field, and the resulting in atomic spectra are used to determine the atomic distances and angles
  - Some databases can report multiple observed conformations for NMR entries
    - Example: 20 models from the calcyclin [1A03](https://www.rcsb.org/structure/1A03) from RCSB 
  - [NMR spectroscopy visualized](https://www.youtube.com/watch?v=RZLew6Ff-JE)
- Advantages
  - Good for studying flexible proteins
  - Proteins can be studied in solution
- Disadvantages
  - Typically limited to small to medium sized proteins

#### X-ray crystallography
- What is is?
  - A highly purified protein is crystallised in a buffer before being exposed to X-rays, resulting in a diffraction pattern that is used to generate the electron density maps, which are then fit to a 3D structure.
- Advantages
  - High resolution, depending on crystal quality
- Disadvantages
  - Some proteins are difficult to crystalize (membrane proteins and highly flexible proteins) 
  - Difficult cases may require multiple crystallization, which can increase costs

#### Cryo-Electron Microscopy (cryo-EM)
- What is is?
  - A purified protein solution is frozen and is photographed thousands of times using an electron microscope to produce 2D snapshots that are then computationally processed to generate the 3D structure.
  - [What is Cryo-Electron Microscopy (Cryo-EM)? (YouTube video)](https://www.youtube.com/watch?v=Qq8DO-4BnIY)
  - [A 3 minute introduction to CryoEM](https://www.youtube.com/watch?v=BJKkC0W-6Qk)
- Advantages
  - Suitable for very large proteins/complexes
- Disadvantages
  - Lower resolution, even though the resolution is improving over time
  - Expensive equipment



---

## Structure file formats

### The PDB format

<img src="figures/PDB_format.png">

### Visual representations of protein structures

In [3]:
display.Image(url="https://www.compchems.com/pymol_intro/representation_2.webp", width=1100)

#### PyMOL demonstration (standalone tool)
- Load PyMOL using the Anaconda software
- Run PyMOL, using the commapd: `pymol`
- Fetch an example structure, using the command: `fetch 1HIV, type=pdb`
- Explore different representations (surface, sticks, wireframe, lines, etc)

#### RCSB demonstration (web server with linked database)
- Go to www.rcsb.org
- Search for "1HIV"
- Explore the interface
- Explore different representations (surface, sticks, wireframe, lines, etc)

---

## Bioinformatics databases

- Scientists around the world continuously produce biomolecular data sets
    - e.g. X-ray crystals, NMR structures, protein/nucleic acid sequences
    - Some data is publicly available, while some can remain private or be shared upon request
- Experimental data is usually expensive to produce
  - costs a lot of time and money
  - Can be enormous in size as well (e.g. molecular dynamics simulations)
  - Various centers have decided to store, classify and make available these data sets using databases

### What is a database?

- A database is a **computerised archive used to store and organise data** in such a way that information can be retrieved easily via a variety of search criteria.
- Think of the simplest database, which is a table containing rows and columns
    - Each record in a database is called an **entry**
    - Each **entry** (row) contains a number of **fields** (columns)

### Types of databases

- A flat file (text files, e.g. excel sheet, CSV file, etc) can be used as database
    - Disadvantage:
        - Cannot store relations
        - Slow; as entire table needs to be read
        - Uses more storage as repeated values are stored repeatedly 
- Relational databases - store relational information between tables
    - Advantages:
        - Uses less storage space, as table is generated by combining various tables (as needed)
        - Fast; only required tables, entries and fields are read
        - Relational tables drive knowledge discovery (associating related data from separate experiments)

---

## Computational approaches based on 3D structures

### Protein structure prediction
- **_ab initio modeling_** (i.e. Latin, meaning "from the beginning")
  - Uses part stochastic (random) processes to generate starting conformations
  - Employs sets of basic functions based on physical laws, to search and find conformations that satisfy these rules
  - The collection of functions is known as the "force field"
  - The search is an exploration of a "conformational landscape"
  - The satisfaction of the functions is a minimum energy conformation
    - Think of the protein as someone, who search is searching around campus to find a place to stay
    - Once he finds such a place, he can rest
  - Features:
    - does not require a pre-existing template
    - lower accuracy
    - More computationally expensive (time & hardware), and is thus practical mostly for small proteins
    - Difficult to validate predictions, as they would be no reference to compare it to
  - [QUARK](https://seq2fun.dcmb.med.umich.edu//QUARK/), [I-TASSER](https://zhanggroup.org/I-TASSER/) 
- **Homology modeling**:
  - Predicts 3D structure of protein using the structure of a known homologous template structure as example.
  - Assumes that proteins with similar sequences have similar structures (mostly true)
  - Features:
    - Highly accurate, as long as a high quality template is available
    - Less computationally demanding, and is thus relatively fast
  - [MODELLER](https://salilab.org/modeller/)

#### The homology modeling approach (MODELLER software)
1. Obtain a target sequence (here "target" means the one you want to determine)
2. Search for templates from a structural database using sequence similarity search (e.g. BLAST)
3. Choose a suitable template structure (high quality and high coverage)
4. Align the target sequence to the template sequence
   - derive spatial restraints (positions, bond angles, etc) from template
   - sequence alignment has to span the entirety of the target sequence (with gaps if needed)
5. Build several 3D models from restraints
   - Choose the best one using various quality metrics (local / global)

### Molecular dynamics - the big picture

In [7]:
display.Video("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8545671/bin/mmc2.mp4", width=900)

---

### Molecular docking
- The basic principle - finding if and where two entities fit together
  - Think of it as finding the right key to open a door, in the dark
  - It's more like lockpicking - finding a key that will fit the door
- There are two typical approached in molecular docking:
  - A small molecule docked against a protein
  - A protein docked against another protein
- Flexible docking
  - Most docking tasks are peformed with the ligand being flexible
  - One can also make part of a protein flexible, to increase the chances of finding a fit
    - Few receptor residues are made flexible, in addition to the small molecule
    - More computationally costly

### Molecular dynamics simulations
  - Computer simulation of molecules, usually under physiological conditions

---

## Databases & tools dealing with 3D molecular structures

### Molecular structure databases
  - These are essential, centralized repositories that store and make accessible structural data
    - [RCSB PDB](http://rcsb.org), [UniProt](http://uniprot.org), [AlphaFold DB](https://alphafold.ebi.ac.uk)
  - Allow the **reuse of experimental data**, thus **minimizing costs** and **optimizing the use of resources and research efforts**
  - **Integrate information/annotations** from various sources, they allow new discoveries to be made
    - Data can be imputed to close yet incompletely described homologs
  - Publicly available experimental structures / computed structure models can be **immediately accessed all over the world**
  - Databases are often not limited to storage, but often also **provide tools** for visualization
    - Example: RCSB provides a sophisticated visualizer
  - Allow for more **powerful queries** (searches) to be made
    - e.g. You could search for different states of the same protein (drug bound/unbound/orthologs/etc)

### A subset of tools and databases for working with molecular structures
  - Web servers:
    - Protein structure modeling
      - [PRIMO](https://primo.rubi.ru.ac.za), [SWISS-MODEL](https://swissmodel.expasy.org), AlphaFold Server, [QUARK](https://seq2fun.dcmb.med.umich.edu//QUARK/)
    - Database of small molecules
      - [ChEMBL](https://www.ebi.ac.uk/chembl/), [ZINC](http://zinc.docking.org/)
    - Small molecule in silico docking
      - [CB-Dock2](https://cadd.labshare.cn/cb-dock2/index.php)
  - Standalone tools:
    - Visualizers / Molecular Modeling
      - [PyMOL](https://pymol.org), [ChimeraX](https://www.cgl.ucsf.edu/chimerax/), [VMD](https://www.ks.uiuc.edu/Research/vmd/), [Maestro](https://www.schrodinger.com/platform/products/maestro/) (commercial)
    - MD simulation
      - [GROMACS](https://www.gromacs.org), [LAMMPS](https://www.lammps.org/#gsc.tab=0), [AMBER](https://ambermd.org) (commercial), [CHARMM](https://projects.iq.harvard.edu/karplusgroup/charmm) (commercial)
  - Scripting
    - Advanced users often use computer scripting languages to do their custom analysis

**Note: This list of databases and tools is far from exhaustive**

### The RCSB PDB 

- Primary repository for protein 3D structures
  - Experimentally determined 3D structures from the PDB from scientists around the world
  - Computed Structural Models (CSM) produced by AlphaFold
- Structures
  - Static structures
- Various annotations
  - Quality metrics

### The UniProt database

- World’s leading high-quality, comprehensive and freely accessible resource of **protein sequence and functional information**
- Databases entries are mainly organised as **Reviewed** and **Unreviewed** 
  - Reviewed entries are **expertly curated** data obtained from the Swiss-Prot database
  - Unreviewed entries are generated by **automatic annotation** using the TrEMBL resource

### The AlphaFold database

- AlphaFold:
  - AI system developed by Google DeepMind
  - Predicts a protein’s 3D structure from its amino acid sequence.
  - It regularly achieves accuracy competitive with experiment.
- [AlphaFold DB](https://alphafold.ebi.ac.uk)
  - Developed from a partnership between Google DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI)
  - Freely distributes its predictions to the scientific community.
  - Contains over 200 million
  - Downloadable proteomes of 47 other key organisms important in research and global health.

- Limitations of AlphaFold
  - Their focus is on predicting the structure of a single protein chains with a naturally occurring sequence. Many other use cases remain active areas of research, for example:
    - The version of AlphaFold used to construct in this AFDB does not output multi-chain predictions (complexes). In some cases the single-chain prediction may correspond to the structure adopted in complex.
    - In other cases (especially where the chain is structured only on binding to partner molecules) the missing context from surrounding molecules may lead to an uninformative prediction. A separate version of AlphaFold was trained for complex prediction (AlphaFold Multimer). You can find the open source code on GitHub and make multimer predictions using Google DeepMind’s Colab.
    - For regions that are intrinsically disordered or unstructured in isolation AlphaFold is expected to produce a low-confidence prediction (pLDDT < 50) and the predicted structure will have a ribbon-like appearance. AlphaFold may be of use in identifying such regions, but the prediction makes no statement about the relative likelihood of different conformations (it is not a sample from the Boltzmann distribution).
    - AlphaFold has not been validated for predicting the effect of mutations. 
      - AlphaFold is not expected to produce an unfolded protein structure given a sequence containing a destabilising point mutation.
    - AlphaFold will only produces one conformation, even though a protein is known to have multiple conformations.
    - AlphaFold does not predict the positions of any non-protein components found in experimental structures
      - Such as cofactors, metals, ligands, ions, DNA/RNA, or post-translational modifications.
      - AlphaFold is trained to predict the structure of proteins as they might appear in the PDB. Therefore backbone and side chain coordinates are frequently consistent with the expected structure in the presence of ions (e.g. for zinc-binding sites) or cofactors (e.g. side chain geometry consistent with heme binding).

### The CB-Dock2 web server

- Web server for doing protein-ligand blind docking, and detecting cavities from protein structures, essentially:
  - Determines where ligands can bind
  - Cavity center, volume, shape and residues
  - Ligand binding affinities and poses

### PyMOL

- PyMOL is a commercial product, but also has an open source version (Open-Source PyMOL)