# Protein Folding

**Note:** This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:
- Mhd Jawad Al Rahwanji, DDDD seminar 2023, Volkamer lab, Saarland University
- Paula Linh Kramer, 2023, Volkamer lab, Saarland University

## Aim of this talktorial

In this notebook, we will learn about protein folding and how to predict protein structures using machine learning. This helps us understand diseases and accelerates drug development.

Our work here will include visualizing and comparing the predicted structures to their corresponding crystalline forms.

### Contents in *Theory*

* Protein Folding
    * Proteins
    * The Folding Problem
* History
    * CASP
    * Breakthroughs
* OmegaFold
    * Innerworkings and Training
    * Performance on benchmarks
    * Orphan protein and antibody predictions
    * OmegaFold model interpretation
    * Computational performance
* Alternative Methods
    * Quantum Approach
    * Diffusion-based Models

### Contents in *Practical*

**Goal: Predict the 3D structure of a protein from a given sequence of amino acids**

* Overview
* Setup
* Processing the input sequence
* Visualizing the output prediction

### References

* [CASP](https://predictioncenter.org/)
* AlphaFold2: [Jumper *et al.*, <i>Nature</i> (2021), <b>596</b>, 583–589](https://doi.org/10.1038/s41586-021-03819-2)
* RoseTTAFold: [Baek *et al.*, <i>Science</i> (2021), <b>373</b>, 871-876](https://doi.org/10.1126/science.abj8754)
* OmegaFold: [Wu *et al.*, <i>bioRxiv</i> (2022)](https://doi.org/10.1101/2022.07.21.500999)
* [Baker lab](https://www.bakerlab.org/)
* Quantum folding: [Robert *et al.*, <i>npj Quantum Inf.</i> (2021), <b>7</b>, 38](https://doi.org/10.1038/s41534-021-00368-4)
* Protein generation: [Watson *et al.*, <i>bioRxiv</i> (2022)](https://doi.org/10.1101/2022.12.09.519842)
* [OmegaFold on Github](https://github.com/HeliXonProtein/OmegaFold)

## Theory

### Protein Folding

#### Proteins

Proteins are the building blocks of life. Our DNA encodes a series of instructions for making proteins, which play a central role in almost all biological processes. From carrying oxygen to building muscles. They give the human body structure, do work around the body, move molecules around and make new molecules. Proteins are linear chains of amino acids. There are 20 such amino acids. They fold in on themselves creating unique shapes or 3D structures made up of sheets and helices. These specific shapes determine protein functionality as they facilitate binding and docking.

For example, Hemoglobin has a shape perfectly suited to binding a molecule of oxygen. Moreover, it changes the characteristics of the iron molecules causing them to hold and release oxygen molecules at different pressure levels.

<img src="./images/image1.png"  width="900">

**Image 1**: A 3D structure of a protein.

We know of over 200 million proteins, but we only know the exact 3D shape of a fraction of them. Figuring out the sequence that makes up a protein is relatively simple as it is specified by the DNA using RNA. Figuring out a protein's functionality after knowing its shape is also fairly simple. It is extremely challenging, however, to bridge the gap. In other words, computationally predicting the shape the protein translates into based on its 1D amino acid sequence.

<img src="./images/image2.png"  width="1200">

**Image 2**: Protein creation process.

#### The Folding Problem

It has been a challenge for nearly 50 years. A protein can be made up of anywhere between 50 and 2000 amino acids. Each of which has a different chemical structure. These acids even interact physically with one another pushing and pulling the folding in arbitrary directions before stabilizing. So the number of ways a protein could theoretically fold before settling into its final 3D structure is astronomical. In 1969, [Cyrus Levinthal](https://en.wikipedia.org/wiki/Cyrus_Levinthal) estimated $10^{300}$ possible conformations for a typical protein. Yet in nature, proteins fold spontaneously, some within milliseconds – a dichotomy sometimes referred to as [Levinthal’s paradox](https://en.wikipedia.org/wiki/Levinthal%27s_paradox).

<br/>
<img src="./images/image3.gif"  width="900">

**Image 3**: Protein folding process. Credit: DeepMind. [Source](https://techxplore.com/news/2020-01-alphafold-protein.html).

One experimental way we can study protein structures is X-ray crystallography. Another is using nuclear magnetic resonance imaging. Both methods and newer methods like cryo-electron microscopy, are time-consuming and expensive to perform.

The objective is to predict the 3D structure of a protein given a sequence of amino acids. The search space for this is enormous. Furthermore, we do not have a clear enough understanding of the process to be able to create a scoring function. Solving this problem has many positive implications. It would help us better understand the world around us, and design drugs to treat diseases or enzymes to break down plastic waste.

### History

#### CASP

Critical Assessment of Protein Structure Prediction ([CASP](https://predictioncenter.org/)) is a biennial competition to catalyze research, monitor progress, and establish the state of the art in protein structure prediction. The teams are given sequences of amino acids with known structures that haven't been revealed publicly. Their objective is to predict the structures. Then the predictions are compared to the ground truths using a metric called the global distance test (GDT). "GDT is computed over the alpha carbon atoms and is reported as a percentage, ranging from 0 to 100. In general, the higher the GDT score, the more closely a model approximates a given reference structure" ([Wiki](https://en.wikipedia.org/wiki/Global_distance_test)). Other measures include: root mean squared distance ([RMSD](https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions)) and template modeling ([TM](https://en.wikipedia.org/wiki/Template_modeling_score)) score. In the early iterations of CASP, teams used to use physical and/or chemical properties at the atom or acid level. Until 2020 no team has achieved a score higher than 50% despite using deep learning techniques.

#### Breakthroughs

In [CASP14](https://predictioncenter.org/casp14/), [AlphaFold2](https://doi.org/10.1038/s41586-021-03819-2) achieved a median GDT score of 87% in the most challenging free-modeling category. That is, reconstructing the structure of a protein without using the structural knowledge of any significant template protein. It was trained on 100 thousand proteins from the protein data bank ([PDB](https://www.rcsb.org/)), and incorporates physical and biological knowledge about protein structure to improve prediction. The team was able to predict the exact position of thousands of atoms with high resolution.

<br/>
<img src="./images/image4.jpeg"  width="1800">

**Image 4**: AlphaFold2 network architecture and performance. [Source](https://doi.org/10.1038/s41586-021-03819-2).

Following that, [RoseTTAFold](https://www.science.org/doi/10.1126/science.abj8754) was announced performing about as well with less computational costs. It tracks amino acid positions during folding, and handles protein complexes. Later on, the methods for both, AlphaFold2 and RoseTTAFold, were made publicly available [here](https://github.com/deepmind/alphafold) and [here](https://github.com/RosettaCommons/RoseTTAFold), respectively.

<br/>
<img src="./images/image5.jpg"  width="1800">

**Image 5**: RoseTTAFold network architecture and performance. [Source](https://www.science.org/doi/10.1126/science.abj8754).

Recently, another model called [OmegaFold](https://doi.org/10.1101/2022.07.21.500999) was announced and its methods were also released publicly [here](https://github.com/HeliXonProtein/OmegaFold). It achieves similar performance to AlphaFold2 and enables the prediction of orphan proteins or antibodies with noisy multiple sequence alignments (MSA) (will be discussed later).

All of these end-to-end methods rely on advancements in deep learning and databases, such as the PDB, to achieve impressive performance. Internally, they use a combination of transformers and physics-aware modules that iteratively refine their predictions of atom locations.

### OmegaFold

In this takltorial, we will focus on OmegaFold, as communicated in the original work by Wu et al. titled "[High-resolution de novo structure prediction from primary sequence](https://doi.org/10.1101/2022.07.21.500999)". It is a novel method that accurately predicts protein structures from a single sequence. As we will see, it uses a pre-trained transformer-based protein language model (OmegaPLM) that makes predictions from sequences of amino acids and a geometry-inspired transformer (Geoformer) trained on their respective conformations. Additionally, a structural head is applied for prediction. Therefore, it works well on orphan proteins which have unique functional characteristics and antibodies that have constantly evolving MSAs.

To be able to understand the novelty in OmegaFold, we will touch upon its competition. Both, AlphaFold2 and RoseTTAFold, take as input multiple sequences that are aligned to the primary one. By utilizing pairwise variations across residues, these algorithms managed to outperform previous attempts. However, their performance quickly drops in the absense of MSAs. So, the authors of OmegaFold sought to remove that requirement and took an alignment-free approach which scales 10x better (**Fig 1A**).

<img src="./images/figure1A.png"  width="1800">

**Fig. 1**: Overview of OmegaFold and results. (A) Model architecture. (B) Evaluations. (C) Runtime analysis. Figure credit to [the authors of OmegaFold.](https://doi.org/10.1101/2022.07.21.500999)

#### Innerworkings and Training

OmegaPLM was trained on unaligned protein sequences in a self-supervised manner, to learn powerful residue and residue-residue representations. Those embeddings capture the structural and functional characteristics of the amino-acid sequences. These are then fed into the Geoformer, which encodes the geometrical and physical interactions between amino acids. In the end, the structural layer predicts a sequence of 3D coordinates of all the heavy atoms.

Three residue-masking strategies were applied to increase OmegaPLM's robustness by training to predict missing residues from what's left of the sequence (**Fig. 2A**). The same hyperparameters used in [ESM-1b](https://github.com/facebookresearch/esm) were used while the attention module was simplified. After pretraining on sequences in UniRef50, their PLM showed improved performance with reduced memory requirements.

<img src="./images/figure2A.png"  width="1200">

**Fig. 2**: OmegaPLM and geometric smoothing. (A) OmegaPLM pretraining routine. (B) Geofromer at work. (C) Geoformer evaluation. (D) Visualization of contact maps. Figure credit to [the authors of OmegaFold](https://doi.org/10.1101/2022.07.21.500999).

The purpose of the 50-layer Geofromer is to improve the embeddings such that they can predict geometrically-correct coordinate predictions (**Fig 2B**). Each Geoformer layer encodes information of the residues after enforcing geometric consistency. Those latent representations should satisfy properties of [Euclidean geometry](https://en.wikipedia.org/wiki/Euclidean_geometry), and normally they do not. To this end, the nonlinear attention module was simplified and redesigned to correct the embeddings. As a result, OmegaFold encodes a geometrically correct representation of a protein sequence for it to be further encoded to a 3-dimensional space in the form of 3D coordinates (**Fig. 1A**).

<img src="./images/figure2B.png"  width="1200">

**Fig. 2**: OmegaPLM and geometric smoothing. (A) OmegaPLM pretraining routine. (B) Geofromer at work. (C) Geoformer evaluation. (D) Visualization of contact maps. Figure credit to [the authors of OmegaFold](https://doi.org/10.1101/2022.07.21.500999).

Similar to AlphaFold2, OmegaFold was trained in a multitask manner, such as contact, atomic distance and torsion angle prediction. The final model was  trained on ~110,000 single-chain sequences from the PDB pre-2021 and all single domains from the SCOP v1.75 database while allowing at most 40% inter-sequence identity. For validation and tuning, newer structures were used making sure to eliminate any source of data leakage.

#### Performance on benchmarks

OmegaFold's ability to predict protein structures was assessed by testing on 2 benchmarks: a CASP set with 29 of the most challenging proteins from the free-modeling category in 2 recent CASP experiments, and a CAMEO dataset with 146 of the most recent single-chain proteins (appearing in the first 6 months of the 2022 CAMEO evaluation), spanning a wide range of prediction difficulty levels. For comparison, we computed predictions as compared to other SOTAs run in their default mode with MSAs as input. Remarkably, the structures predicted by OmegaFold, with a single sequence as input, were as accurate as the advanced MSA-based methods (**Fig. 1B**). On the CAMEO dataset, OmegaFold structures had a mean local-distance difference test (LDDT) score of 0.82, with comparable accuracy to other SOTAs predicted from MSAs. (LDDTs are a commonly used metric for structure evaluation.) On the more challenging CASP dataset, OMegaFOld structures were also quite accurate with an average TM-score-a common metric for assessing the topological similarity of protein structures of 0.79, slightly lower than that of other SOTAs. The relative performance of OmegaFold was also tested using the single-sequence versions of AlphaFold2 and RoseTTAFold on these two datasets. When only a single sequence was given as input, their predicted structures were statistically highly inferior to those of OmegaFold (**Fig. 1B**), indicating that the performance of the MSA-based methods drops when evolutionary information is not given.

<img src="./images/figure1B.png"  width="1800">

**Fig. 1**: Overview of OmegaFold and results. (A) Model architecture. (B) Evaluations. (C) Runtime analysis. Figure credit to [the authors of OmegaFold](https://doi.org/10.1101/2022.07.21.500999).

#### Orphan protein and antibody predictions

OmegaFold's performance in predicting challenging structures of antibody and orphan proteins from PDB was assessed, for which other methods perform poorly. Antibody complementarity-determining regions (CDRs) are the most diverse and variable parts of the molecules. Because of antibodies' fast-evolving nature, MSAs on CDRs, especially in the CDR3 loops on the heavy chain of the antibodies which-despite being highly enriched in amino acid composition-are extremely noisy. As a result, methods like AlphaFold2 are unreliable and have very low predicted LDDT (pLDDT) scores (**Fig. 3A**). Unlike antibodies, orphan proteins, by definition, lack sequence and structure homology information, and thus are also difficult to predict by MSA-based methods (**Fig. 3B**). On both antibody loops and orphan proteins, OmegaFold achieves much higher statistical prediction accuracy, in contrast to AlphaFold2, likely due to the advantages of its single sequence-based prediction method.

<img src="./images/figure3.png"  width="1800">

**Fig. 3**: OmegaPLM performance analysis. (A) OmegaPLM predicting antibodies. (B) Orphan protein performance. Figure credit to [the authors of OmegaFold](https://doi.org/10.1101/2022.07.21.500999).

#### OmegaFold model interpretation

To test whether Geoformer was boosting the PLM's performance, the distance (or contact) maps were calculated using embeddings from OmegaPLM and each Geofromer layer, and computed the evolution of contact accuracy and geometric inconsistency as the number of Geoformer layers increased (**Fig. 2C & D**). It was found that the contacts and distances predicted by OmegaPLM alone are reasonably accurate. Within the first 20 Geoformer laters it was found that the geometric inconsistency, measured by violations of the distance triangle inequality, is greatly reduced, and the prediction accuracy of contact pairs is much improved. The remaining Geoformer layers appear to focus on reducing the geometric inconsistency until the prediction accuracy plateaus. It was demonstrated through ablation studies on parts of OmegaFold, that the components contribute to the performance of the overall model.

<img src="./images/figure2CD.png"  width="1200">

**Fig. 2**: OmegaPLM and geometric smoothing. (A) OmegaPLM pretraining routine. (B) Geofromer at work. (C) Geoformer evaluation. (D) Visualization of contact maps. Figure credit to [the authors of OmegaFold](https://doi.org/10.1101/2022.07.21.500999).

#### Computational performance

To conclude, first, we will touch on efficiency. Since MSAs are no longer required for OmegaFold to achieve high-resolution performance, its overall runtime is much faster than AlphaFold2, as well as the latest highly-optimized ColabFold-AF2 (**Fig. 1C**). In summary, this study leverages a protein language model trained on unaligned sequences to predict protein structures from single amino acid sequences alone. Further evidence was given that such evolutionary information may well be encoded in primary sequences, which can then be used as features for structure prediction.

<img src="./images/figure1C.png"  width="900">

**Fig. 1**: Overview of OmegaFold and results. (A) Model architecture. (B) Evaluations. (C) Runtime analysis. Figure credit to [the authors of OmegaFold](https://doi.org/10.1101/2022.07.21.500999).

### Alternative Methods

#### Quantum Approach

An alternative approach to protein folding is to use quantum computing. The objective of protein folding can be defined by finding the [psi and phi angles](https://en.wikipedia.org/wiki/Dihedral_angle#Dihedral_angles_of_biological_molecules) at pivotal atoms within a protein sequence. By representing those key points as [qubits](https://en.wikipedia.org/wiki/Qubit) in a quantum system, it is possible to use [hybrid quantum/classical algorithms](https://en.wikipedia.org/wiki/Quantum_algorithm) to find those angle values. In [Robert *et al.*](https://doi.org/10.1038/s41534-021-00368-4), a resource-efficient quantum approach was proposed to fold a polymer chain of N monomers on a lattice. This method shows promise because with time [fault-tolerant](https://en.wikipedia.org/wiki/Threshold_theorem) quantum computers will become capable of handling larger inputs. When that happens, it has been shown that these computers would out-scale classical computers leading to polynomial or exponential compute time reduction thanks to [quantum superposition](https://en.wikipedia.org/wiki/Quantum_superposition). That being said, this method is currently limited in terms of monomer length due to quantum physical constraints.

<br/>
<img src="./images/image6.png"  width="1800">

**Image 5**: Folding algorithm and process of the quantum approach. [Source](https://www.nature.com/articles/s41534-021-00368-4/figures/2).

#### Diffusion-based Models

Another novel and interesting approach uses [diffusion networks](https://en.wikipedia.org/wiki/Diffusion_model) to create a generative model for protein design. This model, RoseTTAFold Diffusion ([RF diffusion](https://doi.org/10.1101/2022.12.09.519842)), was created by fine-tuning the RoseTTAFold network on protein structure denoising tasks. It enables us to design diverse, complex, functional proteins from simple molecular specifications. In protein folding, the question is how amino acids interact with each other when they are in the same string/chain. Whereas, in protein binding assessment/protein design, the question is which amino acids would be most suitable to interact with this known surface. So we have to predict a surface that has a compatible topology and a favorable binding profile to the existing surface. This is the difference between designing proteins with docking/bonding in mind and plain protein folding. Concretely, the difference is the restrictions on amino acid movement that is imposed by being linked through a bond. But the principle, i.e. amino acid interaction, is the same.

## Practical

### Overview

In this talktorial we will be using the OmegaFold model. To do so we will need to:
- Set up OmegaFold locally, to do that we just need to follow the instructions [here](https://github.com/HeliXonProtein/OmegaFold).
- Process an input sequence to predict a conformation.
- Visualize the prediction structure as well as the ground truth, and compare them.

**Note**: In the `data` folder, a `FASTA` file is available for use. More information about the particular Protein it represents can be found in the accompanying `README.md`.

**Note**: In the `data` folder, a `PDB` file of the protein is also available for use.

**Note**: In the `output` folder,  a `PDB` file will be generated for each of the sequences in our input file.

**Note**: The prediction confidence values are put in the place of `b_factors` (temperature factors) in the resulting `PDB` files.

In [1]:
# This will re-direct us to a new jupyter instance with an increased data rate limit allowing protein structure rendering using nglview.
!jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

[32m[I 03:33:29.478 NotebookApp][m The port 8888 is already in use, trying another port.
[32m[I 03:33:29.480 NotebookApp][m The port 8889 is already in use, trying another port.
[32m[I 03:33:29.480 NotebookApp][m The port 8890 is already in use, trying another port.
[32m[I 03:33:29.480 NotebookApp][m The port 8891 is already in use, trying another port.
[32m[I 03:33:29.480 NotebookApp][m The port 8892 is already in use, trying another port.
[32m[I 03:33:29.481 NotebookApp][m Serving notebooks from local directory: /Users/introvertuoso/PycharmProjects/CADDSeminar_2023/notebooks/T03_proteinfolding
[32m[I 03:33:29.481 NotebookApp][m Jupyter Notebook 6.5.2 is running at:
[32m[I 03:33:29.481 NotebookApp][m http://localhost:8930/?token=e6257d7c6a769147683a4ad2387f0f3eed43136267d3c427
[32m[I 03:33:29.481 NotebookApp][m  or http://127.0.0.1:8930/?token=e6257d7c6a769147683a4ad2387f0f3eed43136267d3c427
[32m[I 03:33:29.481 NotebookApp][m Use Control-C to stop this ser

### Setup

This will download the OmegaFold package source files from GitHub. (~2 MB)

In [None]:
import os
!pip install git+https://github.com/HeliXonProtein/OmegaFold.git

This will clone the repository. Then, it will set up the OmegaFold package.

In [None]:
!git clone https://github.com/HeliXonProtein/OmegaFold
os.chdir('OmegaFold')
!python setup.py install
os.chdir('..')

### Processing the input sequence

This will run the model. But first, it will download the weights from AWS. (~3 GB)

**Note**: The `--device cuda` argument only works if an NVIDIA GPU is available otherwise `--device cpu` should be used. See next cell for more info.

In [None]:
!omegafold ./data/rcsb_pdb_6YJ1.fasta ./output --device cuda --num_cycle 20 --model 1

In [None]:
!omegafold ./data/rcsb_pdb_7FVU.fasta ./output --device cuda --num_cycle 20 --model 1

This will show the model parameters. It provides several options for compute, memory and weights.

In [None]:
!omegafold --help

### Visualizing the output prediction

This will render the ground truth crystalline structure of the first chain (chain 'A') in the first model (model 0).

**Note**: For this structure there is in fact two chains, A and B. We're only concerned with chain 'A'.

In [1]:
from Bio.PDB import *
import nglview as nv

parser_true = PDBParser()
structure_true = parser_true.get_structure("true", "data/6yj1.pdb")[0]

view_true = nv.show_biopython(structure_true["A"])
view_true





NGLWidget()

This will render the predicted crystalline structure of the first chain (chain 'A') in the first model (model 0).

**Note**: There is only one chain in the output file anyway.

In [2]:
parser_pred = PDBParser()
structure_pred = parser_pred.get_structure("pred", "output/0th chain.pdb")[0]

view_pred = nv.show_biopython(structure_pred["A"])
view_pred

NGLWidget()

This outputs the RMS score after applying "Protein structure alignment by incremental combinatorial extension (CE) of the optimal path".

In [3]:
cea = cealign.CEAligner()
cea.set_reference(structure_true)
cea.align(structure_pred, transform=True)
cea.rms

5.652865371217041

This will render the two structures together after CE alignment.

In [4]:
view_aligned = nv.show_biopython(structure_true["A"])
view_aligned.update_cartoon(color='blue')
view_aligned.add_structure(nv.BiopythonStructure(structure_pred["A"]))
view_aligned.update_cartoon(color='orange', component=1)
view_aligned

NGLWidget()

## Discussion

Using any of the structure prediction SOTA methods, it is possible to predict the 3D structure of a protein from a sequence of amino acids with high accuracy. However, the story does not end here. As more sequencing data accumulates to feed MSA-based methods, OmegaFold fills the interim gap, as well as importantly predicts from sequences for which MSAs are difficult to construct. It is expected that a conceptual advance and further algorithmic development along these lines will continue to enable a wide spectrum of protein science applications, such as a multi-state conformational sampling, where AlphaFold2 has had some success to regions for which predictions are challenging, variant effect prediction, protein-protein interactions, and protein docking. As important as it is to predict the fold, it is equally important to find protein conformations that bind to a binding partner of interest or the other way around. That is what dictates protein function. That way, we can design effective drugs for virtually any disease or to solve dire challenges with bioengineering.

## Quiz

- Investigate the trade-off of computation time and average prediction quality, by changing the number of cycles. How bad was the prediction?
- Trade computation time for memory consumption, by changing subbatch size. How drastic are the reductions?
- Look into using a different alignment algorithm (other than CE) and observe the RMS estimate. Does it differ?
- Search the protein data bank (PDB) for another protein and attempt to fold it. Does the model perform as well, better, or worse than expected? Compare this to the example in Practical. More here: [PDB](https://www.rcsb.org/)
- Another similar task is to look up another protein from the ones published for CASP14. Predict its structure and compare the accuracy to that of AlphaFold2. More here: [CASP14](https://predictioncenter.org/casp14/)
- If you're up for a challenge, in a separate notebook, use AlphaFold2 and OmegaFold to predict the same protein and compare. More here: [AlphaFold2's GitHub page](https://github.com/deepmind/alphafold) and [RoseTTAFold's GitHub page](https://github.com/RosettaCommons/RoseTTAFold)
- Some more practice, also something to help you understand how far we've come, but is also a challenge. Implement an older method, perhaps one that uses a CNN or a physics-based one to predict a protein structure. More here: [DeepCov](https://github.com/psipred/DeepCov)