# Protein Folding

**Note:** This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:
- Mhd Jawad Al Rahwanji, DDDD seminar 2023, Volkamer lab, Saarland University
- Paula Linh Kramer, 2023, Volkamer lab, Saarland University

## Aim of this talktorial

In this notebook, we will learn about protein folding and how to predict protein structures using machine learning. This helps us understand diseases and accelerates drug development.

~~Our work here will include comparing the predicted structures to their corresponding crystalline forms.~~

### Contents in *Theory*

* Protein Folding
    * Proteins
    * The Folding Problem
* History
    * CASP
    * Breakthroughs
* Methods
    * Overview
    * OmegaFold
        * OmegaFold model
        * OmegaPLM language model
        * Performance on benchmarks
        * Orphan protein and antibody predictions
        * OmegaFold model interpretation
        * Computational performance
* Alternative Methods
    * Quantum Approach
    * Diffusion-based Models

### Contents in *Practical*

**Goal: Predict the 3D structure of a protein from a given sequence of amino acids**

* Overview
* Setup
* Processing the input sequence
* Visualizing the output prediction

### References

* [CASP](https://predictioncenter.org/)
* AlphaFold2: [Jumper *et al.*, <i>Nature</i> (2021), <b>596</b>, 583–589](https://doi.org/10.1038/s41586-021-03819-2)
* RoseTTAFold: [Baek *et al.*, <i>Science</i> (2021), <b>373</b>, 871-876](https://doi.org/10.1126/science.abj8754)
* OmegaFold: [Wu *et al.*, <i>bioRxiv</i> (2022)](https://doi.org/10.1101/2022.07.21.500999)
* [Baker lab](https://www.bakerlab.org/)
* Quantum folding: [Robert *et al.*, <i>npj Quantum Inf.</i> (2021), <b>7</b>, 38](https://doi.org/10.1038/s41534-021-00368-4)
* Protein generation: [Watson *et al.*, <i>bioRxiv</i> (2022)](https://doi.org/10.1101/2022.12.09.519842)
* [OmegaFold on Github](https://github.com/HeliXonProtein/OmegaFold)

## Theory

### Protein Folding

#### Proteins

Proteins are the building blocks of life. Our DNA encodes a series of instructions for making proteins, which play a central role in almost all biological processes. From carrying oxygen to building muscles. They give the human body structure, do work around the body, move molecules around and make new molecules. Proteins are linear chains of amino acids. There are 20 of such amino acids. They fold in on themselves creating unique shapes or 3D structures made up of sheets and helices. These specific shapes determine protein functionality as they facilitate binding and docking.

For example, Hemoglobin has a shape perfectly suited to binding a molecule of oxygen. Moreover, it changes the characteristics of the iron molecules causing them to hold and release oxygen molecules at different pressure levels.

We know of over 200 million proteins, but we only know the exact 3D shape of a fraction of them. Figuring out the sequence that makes up a protein is relatively simple as it is specified by the DNA using the RNA. Figuring out a protein's functionality after knowing its shape is also fairly simple. It is extremely challenging, however, to bridge the gap. In other words, computationally predicting the shape the protein translates into based on its 1D amino acid sequence.

#### The Folding Problem

It has been a challenge for nearly 50 years. A protein can be made up from anywhere between 50 and 2000 amino acids. Each of which has a different chemical structure. These acids even interact physically with one another pushing and pulling the folding in arbitrary directions before stabilizing. So the number of ways a protein could theoretically fold before settling into its final 3D structure is astronomical. In 1969 Cyrus Levinthal estimated $10^{300}$ possible conformations for a typical protein. Yet in nature, proteins fold spontaneously, some within milliseconds – a dichotomy sometimes referred to as Levinthal’s paradox.

One experimental way we can study protein structures is X-ray crystallography. Another is using nuclear magnetic resonance imaging. Both methods and newer methods like cryo-electron microscopy, are time-consuming and expensive to perform.

The objective is to predict the 3D structure of a protein given a sequence of amino acids. The search space for this is enormous. Furthermore, we do not have a clear enough understanding of the process to be able to create a scoring function. Solving this problem has many positive implications. It would help us better understand the world around us, design drugs to treat diseases or enzymes to break down plastic waste.

### History

#### CASP

Critical Assessment of protein Structure Prediction (CASP) is a biennial competition to catalyse research, monitor progress, and establish the state of the art in protein structure prediction. The teams are given sequences of amino acids with known structures which haven't been revealed publicly. Their objective is to predict the structures. Then the predictions are compared to the ground truths using a metric called the global distance test (GDT). In the early iterations of CASP, teams used to use physical and/or chemical properties at the atom or acid level. Until 2020 no team has achieved a score higher than 50% despite using deep learning techniques.

#### Breakthroughs

In CASP14, AlphaFold2 achieved a median GDT score of 87% on the most challenging free-modelling category. Their trick was to incorporate protein physics in their 2-stage model in the form of protein similarities. It was trained on a 100 thousand proteins from the protein data bank (PDB). The team was able to predict the exact position of thousands of atoms with high-resolution.

Following that, RoserTTAFold was announced performing about as well with less computational costs. It included a 3rd stage for tracking amino acid positions during folding. Further, it handles protein complexes. Later on the methods for both, RoseTTAFold and AlphaFold, were made publicly available. Recently, another model called OmegaFold was announced and its methods were also released publicly. It achieves similar performance to AlphaFold2 and enables prediction on orphan proteins or antibodies with noisy multiple sequence alignments (MSA).

### Methods

#### Overview

These SOTA methods are end to end models that do iterative refinements to their predictions. All of which rely on advancements in deep learning and  large databases of protein sequences and structures that enable effective training of large models. They use transformers internally which, thanks to self-attention, excel at processing long auto-dependent sequences inspired by geometry.

#### OmegaFold

In this takltorial we will focus on one method in particular, OmegaFold, as communicated in the original work by Wu et al. titled [High-resolution de novo structure prediction from primary sequence](https://doi.org/10.1101/2022.07.21.500999). It is the first computation method to successfully predict high-resolution protein structure from a single primary sequence alone. It uses a new combination of a protein language model that allows making predictions from single sequences and a geometry-inspired transformer model train on protein structures. It enables accurate predictions on orphan proteins that do not belong to any functionally characterized protein family and antibodies that tend to have noisy MSAs due to fast evolution. As such, it fills a much-encountered gap in structure prediction and brings us one step closer to understanding protein folding in nature.

To be able to understand the novelty of OmegaFold, we will touch up on how methods like AlphaFold2 and RoseTTAFold. Both of these methods need as input evolutionary data in the form of MSAs of homologous sequences aligned to the primary one, a technique which has been a staple of structure prediction methods. By extracting residue-residue covariances from these MSAs, these algorithms have been shown to greatly outperform previous approaches, including physics-based models, homology-based models and CNNs and predict structures with atomic-level accuracy for the first time in history. However, prediction accuracies for all these advanced methods drop sharply in the absence of a multitude of sequence homologs from which to construct MSAs.

So, the authors of OmegaFold sought to train an algorithm that learns to model protein 3D structures without relying on MSA preprocessing (i.e., alignment-free) (**Fig 1A**). So they capitalized on the transformer to achieve that while incorporating geometric intuition into the transformer architecture design. Their model uses a new combination of a large pretrained language model for sequence modelling and a geometry-inspired transformer model for structure prediction. It scales roughly ten-times faster with comparable or better accuracy to MSA-based methods.

##### OmegaFold model

OmegaFold is built atop a deep transformer-based protein language model (PLM) called OmegaPLM. It is trained on a large collection of unaligned and  unlabeled protein sequences, to learn single- and pairwise-residue embeddings as powerful features that model the distribution of sequences; OmegaPLM is able to capture struuctural and functional information encoded in the amino-acid sequences through the embeddings. These are then fed into Geoformer, a new geometry-inspired transformer neural network, to further distill the structural and physical pairwise relationships between amino acids. Lastly, a structural module predicts the 3D coordinates of all heavy atoms.

<img src="./images/figure1.png"  width="1800">

**Fig. 1**: Overview of OmegaFold and results. (A) Model architecture. (B) Evaluations. (C) Runtime analysis. Figure credit to [the authors of OmegaFold.](https://doi.org/10.1101/2022.07.21.500999)

The key idea behind Geofromer is to make the embeddings from our language model more geometrically consistent-amino acid node and pairwise embeddings generate consistent coordinates and distance predictions when projected to 3D. While similar in principel to the Evoformer module in AlphaFold2 which applied attention to evolutionary variation. It consists of a deep stack of 50 Geofromer layers, which is inspired by the fundamental theorem of vector calculus in geometry (**Fig 2B**). Each Geoformer layer encodes information in node representations for the residues and pairwise representations between residues by enforcing their geometric consistency. Intuitively, we can view the representation of the residues in a high-dimensional vector space, and each pairwise representation is a vector pointing from one residue to another, which will be used for predicting 3D coordinates and distances. Ideally, coordinates and pairwise distances should satisfy properties of Euclidean geometry, such as triangular inequality. However, these properties may not always hold if the representation vectors and predictions are output from neural networks. Based on these geometric insights, the team designed a geometry-inspired nonlinear attention module to sequentially update the representations, attempting to make them consistent. By stacking Geoformer layers upon the PLM, OmegaFold captures the geometry of a protein structure with representations which are then projected onto 3D space with an 8-layer structure module (**Fig. 1A**).

Lending from AlphaFold2's training, OmegaFold was trained with several structural objectives, including contact prediction, Frame Aligned Point Error (FAPE) loss-a geometrically restricted version of root-mean-squared deviation (RMSD) of relative atomic positions-and torsion angle prediction. The full model was jointly trained on ~110,000 single-chain structures from the PDB deposited before 2021 and all single domains from the SCOP v1.75 database with at most 40% sequence identity cutoff. The team use later-released structures for validation and hyperparameter selection. They also excluded protein structures that appeard in their test sets as well as their homologs up to 40% sequence identity during training.

##### OmegaPLM language model

As a precursory assessment of the language module, they applied 3 strategies for masking out residues and trained to optimize the standard Bidirectional Encoder Representations from Transformers (BERT)-style objective to predict masked residues from the rest of the sequence using a cross-entropy loss function (**Fig. 2A**) mask out a random 15% of residues in protein sequences, mask out random consecutive subsequences of 5-8 residues, and mask out half of the entire protein sequence. For convenience, we chose the same hyperparameters as those used in the popular transformer network ESM-1b, including the number of layers and the number of heads but with a more lightweight implementation of the attention module. After pretraining on sequences in UniRef50, the PLM demonstrated improved language modeling performance over vanilla transformer-based networks as well as contact prediction, with much smaller memory requirements.

##### Performance on benchmarks

OmegaFold's ability to predict protein structures was assessed by testing on 2 benchmarks: a CASP set with 29 of the most challenging proteins from the free-modelling category in 2 recent CASP experiments, and a CAMEO dataset with 146 of the most recent single-chain proteins (appearing in the first 6 months of the 2022 CAMEO evaluation), spanning a wide range of prediction difficulty levels. For comparison, we computed predictions as compared to other SOTAs run in their default mode with MSAs as input. Remarkably, the structures predicted by OmegaFold, with a single sequence as input, were as accurate as the advanced MSA-based methods (**Fig. 1B**). On the CAMEO dataset, OmegaFold structures had a mean local-distance difference test (LDDT) score of 0.82, with comparable accuracy to other SOTAs predicted from MSAs. (LDDTs are a commonly used metric for structure evaluation.) On the more challenging CASP dataset, OMegaFOld structures were also quite accurate with an average TM-score-a common metric for assessing the topological similarity of protein structures-of 0.79, slightly lower than that of other SOTAs. THe relative performance of OmegaFold was also tested using the single-sequence versions of AlphaFold2 and RoseTTAFold on these two datasets. When only a single sequence was given as input, their predicted structures were statistically highly inferior to those of OmegaFold (**Fig. 1B**), indicating that the performance of the MSA-based methods drops when evolutionary information is not given.

<img src="./images/figure2.png"  width="1800">

**Fig. 2**: OmegaPLM and geometric smoothing. (A) OmegaPLM pretraining routine. (B) Geofromer at work. (C) Geoformer evaluation. (D) Visualization of contact maps. Figure credit to [the authors of OmegaFold.](https://doi.org/10.1101/2022.07.21.500999)

##### Orphan protein and antibody predictions

OmegaFold's performance in predicting challenging structures of antibody and orphan proteins from PDB was assessed, for which other methods perform poorly. Antibody complementarity-determining regions (CDRs) are the most diverse and variable parts of the molecules. Because of antibodies' fast-evolving nature, MSAs on CDRs, especially in the CDR3 loops on the heavy chain of the antibodies which-despite being highly enriched in amino acid composition-are extremely noisy. As a result, methods like AlphaFold2 are unreliable and have very low predicted LDDT (pLDDT) scores (**Fig. 3A**). Unlike antibodies, orphan proteins, by definition, lack sequence and structure homology information, and thus are also difficult to predict by MSA-based methods (**Fig. 3b**). On both antibody loops and orphan proteins, OmegaFold achieves much higher statistical prediction accuracy, in contrast to AlphaFold2, likely dur to the advantages of its single sequence-based prediction method.

<img src="./images/figure3.png"  width="1800">

**Fig. 3**: OmegaPLM performance analysis. (A) OmegaPLM predicting antibodies. (B) Orphan protein performance. Figure credit to [the authors of OmegaFold.](https://doi.org/10.1101/2022.07.21.500999)

##### OmegaFold model interpretation

To test whether Geoformer was boosting the PLM's performance, the distance (or contact) maps were calculated using embeddings from OmegaPLM and each Geofromer layer; and computed the evolution of contact accuracy and geometric inconsistency as the number of Geoformer layers increased (**Fig. 2C & D**). It was found that the contacts and distances predicted by OmegaPLM alone are reasonably accurate. Within the first 20 Geoformer laters it was found that the geometric inconsistency, measured by violations of the distance triangle inequality, is greatly reduced, and the prediction accuracy of contac pairs is much improved. The remaining Geoformer layers appear to focus on reducing the geometric inconsistency until the prediction accuracy plateaus. It was demonstrated through ablation studies on component parts of OmegaFold, that the compinents contribute to the performance of the overall model.

##### Computational performance

To conclude, first we will touch on efficiency. Since MSAs are no longer required for OmegaFold to achieve high-resolution performance, its overall runtime is much faster than AlphaFold2, as well as the latest highly-optimized ColabFold-AF2 (**Fig. 1C**). In summary, this study leverages a protein language model trained on unaligned sequences to predict protein structures from single amino acid sequences alone. Further evidence was given that such evolutionary information may well be encoded in primary sequences, which can then be used as features for structure prediction.

### Alternative Methods

#### Quantum Approach

An alternative approach in protein folding is to use quantum computing. The objective of protein folding can be defined by finding the psi and phi angles at pivotal atoms within a protein sequence. By representing those key points as qubits in a quantum system, it is possible to use quantum classical algorithms to find those angle values. A more resource efficient quantum approach was proposed to fold a polymer chain of N monomers on a lattice. This method shows promise because with time fault-tolerant quantum computers will become capable of handling larger inputs. When that happens, it has been shown that these computers would out scale classical computers leading to polynomial or exponential compute time reduction thanks to qubit superposition.

#### Diffusion-based Models

Another novel and interesting approach uses diffusion networks to create a generative model for protein design. This model, RoseTTAFold Diffusion (RF diffusion), was created by fine-tuning RoseTTAFold network on protein structure denoising tasks. It enables us to design diverse, complex, functional proteins from simple molecular specifications. In protein folding, the question is how do amino acids interact with each other when they are in the same string/chain. Whereas, in protein binding assessment/protein design, the question is which amino acids would be most suitable to interact with this known surface. So you have to predict a surface that has compatible topology and favorable binding profile to the existing surface. But the principle, i.e. amino acid interaction, is the same. Concretely, the difference is the restrictions on amino acid movement that is imposed by being linked through a bond.

## Practical

### Overview

In this talktorial we will be using the OmegaFold model. To do so we will need to:
- Setup OmegaFold locally, to do that we just need to follow the instructions [here](https://github.com/HeliXonProtein/OmegaFold).
- In the `data` folder a `FASTA` file is available for use. We will predict the 3D structure of the sequence given in the file.
- This will produce a `PDB` file for each of the sequences in our input file. It will be saved in the output directory we provide. In our case it will be the `output` folder.

**Note**: The prediction confidence values are put in the place of `b_factors` in `PDB` files.

### Setup

In [None]:
import os
!pip install git+https://github.com/HeliXonProtein/OmegaFold.git

In [None]:
!git clone https://github.com/HeliXonProtein/OmegaFold
os.chdir('OmegaFold')
!python setup.py install
os.chdir('..')

### Processing the input sequence

In [None]:
!omegafold ./data/rcsb_pdb_6YJ1.fasta ./output --device cuda

### Visualizing the output prediction

In [None]:
!jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

In [8]:
import nglview
view = nglview.show_pdbid("6yj1")
view

NGLWidget()

In [9]:
view = nglview.show_file("output/0th chain.pdb")
view

NGLWidget()

## Discussion

Using any of the structure prediction SOTA methods, it is possible to prediction the 3D structure of a protein for a sequence of amino acids with high accuracy.

However, the story does not end here. As more sequencing data accumulates to feed MSA-based methods, OmegaFold fills the interim gap, as well as importantly predicts from sequences for which MSAs are difficult to construct. It is expected that a conceptual advance and further algorithmic development along these lines will continue to enable a wide spectrum of protein science applications, such as a multi-state conformational sampling, where AlphaFold2 has had some success to regions for which predictions are challenging, variant effect prediction, protein-protein interactions and protein docking. As important as it is to predict the fold, it is equally important to find protein conformations that bind to a binding partner of interest or the other way around. That is what dictates protein function. That way, we are able to design effective drugs for virtually any disease or to solve dire challenges with bioengineering.

## Quiz

- Investigate the trade-off of computation time and average prediction quality, by changing the number of cycles. How bad the prediction turned out?
- Trade computation time for memory consumption, by changing subbatch size. How drastic are the reductions?
- Search the protein data bank (PDB) for another protein and attempt to fold it. Does the model perform as well, better or worse than expected? Compare to the example in Practical. More here: [PDB](https://www.rcsb.org/)
- Another similar task is to look up a protein from the ones published for CASP14. Predict its structure and compare the accuracy to that of AlphaFold2. More here: [CASP14](https://predictioncenter.org/casp14/)
- If you're up for a challenge, in a separate notebook, use AlphaFold2 and OmegaFold to predict the same protein and compare. More here: [AlphaFold2's GitHub page](https://github.com/deepmind/alphafold) and [RoseTTAFold's GitHub page](https://github.com/RosettaCommons/RoseTTAFold)
- Some more practice, also something to help you understand how far we've come, but is also a challenge. Implement an older method, perhaps one that uses a CNN or a physics-based one to predict a protein structure. More here: [DeepCov](https://github.com/psipred/DeepCov)