Highly accurate protein structure prediction with AlphaFold

HEADER

\setbeamerfont{large}{size=\large}

START [0/2]

LOG

—

[.] include slide with repo, colab links

[.] train todo

Predictions of side chain chi angles as well as the final, per-residue accuracy of the structure (pLDDT) are computed with small per-residue networks on the final activations at the end of the network. The estimate of the TM-score (pTM) is obtained from a pairwise error prediction that is computed as a linear projection from the final pair representation. The final loss (that we term the frame-aligned point error (FAPE) (Fig. 3f)) compares the predicted atom positions to the true positions under many different alignments. For each alignment, defined by aligning the predicted frame (Rk,tk) to the corresponding true frame, we compute the distance of all predicted atom positions xi from the true atom positions. The resulting Nframes × Natoms distances are penalized with a clamped L1-loss. This creates a strong bias for atoms to be correct relative to the local frame of each residue and hence correct with respect to its side chain interactions, as well as providing the main source of chirality for AlphaFold (Suppl. Methods 1.9.3 and Suppl. Fig. 9).

makes effective use of the unlabelled sequence data and significantly improves the accuracy of the resulting network. Additionally, we randomly mask out or mutate individual residues within the MSA and have a Bidirectional Encoder Representations from Transformers (BERT)-style37 objective to predict the masked elements of the MSA sequences. This objective encourages the network to learn to interpret phylogenetic and covariation relationships without hardcoding a particular correlation statistic into the features. The BERT objective is trained jointly with the normal PDB structure loss on the same training examples and is not pre-trained, in contrast to recent independent work38.

refs

snips

rempellabs.com [coming soon]

CODE [0/0]

—

Results

Supplemental figures

Methods

Model training and evaluation

—

Introduction

Willy Rempel

HBSc Computer Science
BSc Mathematics
Research Associate, AISC
seeking opportunities in the field

Introduction

Although all of the ideas in the model are doubtlessly clever, the main secret behind AlphaFold 2’s success is the superb deep learning engineering. A close look at the model reveals an architecture with a large amount of small details that seem fundamental for the performance of the network. As we admire the end product, we should not turn a blind eye to the enormous budget, and the large team of full-time, handsomely paid engineers that made it possible. cite:rubieraAlphaFoldHereWhat

This, and many other tricks, are described in exhaustive detail in the Supplementary Information. A reduced subset has been analysed in a brief ablation study, but ultimately, how important are each of the minor details is anybody’s guess. cite:rubieraAlphaFoldHereWhat

\flushright{(above blog post is recommended reading)}

Model Overview cite:jumperHighlyAccurateProtein2021

Data Pipeline

Initial Input: mmCIF or FASTA files

Parsing cite:jumperHighlyAccurateProtein2021

only certain metadata (more from mmCIF)
change MSE residues into MET

[.] Genetic Search cite:jumperHighlyAccurateProtein2021

For MSAs

JackHMMER
- MGnify: MSA depth 5,000
- UniRef90: MSA depth 10,000
HHBlits
- Uniclust30 + BFD: MSA depth unlimited
MSAs duplicated and stacked

flags:
JackHMMER: -N 1 -E 0.0001 –incE 0.0001 –F1 0.0005 –F2 0.00005 –F3 0.0000005. \ HHBlits: -n 3 -e 0.001 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -maxseq 1000000.

Template Search cite:jumperHighlyAccurateProtein2021

UniRef90 MSA from prior search used for PDB70 search using HHSearch.
Filter out:
- released after the input sequence
- or identical to the input sequence
- too small
At inference use top 4 templates

Training Data cite:jumperHighlyAccurateProtein2021

75:25 self-distillation : known structure (PDB)
stochastic filters (next)

Filtering cite:jumperHighlyAccurateProtein2021

stochastic filters:
- Input mmCIFs are restricted to have resolution less than 9 Å. This is not a very restrictive filter and only removes around 0.2% of structures.
- Longer protein chains are selected with higher probability.
- Also favour protein chains from smaller clusters. They use 40% sequence identity clusters of the Protein Data Bank clustered with MMSeqs2.
- Sequences are filtered out when any single amino acid accounts for more than 80% of the input primary sequence. This filter removes about 0.8% of sequences.

MSA block deletion cite:jumperHighlyAccurateProtein2021

Block deletion tends to remove similarities (ie. whole branch phylogeny) and promote diversity
- Similar sequences are likely to be adjacent
- Contiguous blocks in MSAs are deleted.
- First MSAs are grouped by tool
- Then sorted according to tool defaults (usually e-value)

Algorithm 1 MSA Block deletion cite:jumperHighlyAccurateProtein2021

MSA clustering cite:jumperHighlyAccurateProtein2021

Similarity clusters used to randomly select subset of MSA sequences
- to reduce computational cost from attention modules, reduce $N_seq$
Modified K-means is used, with the input sequence used as first cluster center

Clustering Algorithm:

$N_clust$ centers selected from MSA
Generate a mask where p=0.15 that any position is selected by the mask
Each center is modified for each mask selected residue according to:
1. p=0.1 replaced with a uniformly sampled random amino acid
2. p=0.1 replaced with an amino acid sampled from the MSA profile
3. p=0.1 no replacement
4. p=0.7 replaced with a special token (masked_msa_token)
hamming distance measure for remaining selections

Residue cropping cite:jumperHighlyAccurateProtein2021

During training:

unclamped & clamped - sampling start index from uniform distributions
Cropped with fixed size $N_res$

[.] Featurization and model inputs cite:jumperHighlyAccurateProtein2021

target_feat
This is a feature of size [Nres, 21] consisting of the “aatype” feature.
residue_index
Positional encoding constant tensor. This is a feature of size [Nres] consisting of the “residue_index” feature.
msa_feat
This is a feature of size [Nclust, Nres, 49] constructed by concatenating “cluster_msa”, “cluster_has_deletion”, “cluster_deletion_value”, “cluster_deletion_mean”, “cluster_profile”. We draw Ncycle×Nensemble random samples from this feature to provide each recycling/ensembling iteration of the network with a different sample (see subsubsection 1.11.2).
extra_msa_feat
This is a feature of size [Nextra_seq, Nres, 25] constructed by concatenating “extra_msa”, “extra_msa_has_deletion”, “extra_msa_deletion_value”. Together with “msa_feat’ above we also draw Ncycle × Nensemble random samples from this feature (see subsubsection 1.11.2).

[.] Featurization and model inputs cite:jumperHighlyAccurateProtein2021

template_pair_feat
This is a feature of size [Ntempl, Nres, Nres, 88] and consists of concatenation of the pair residue features “template_distogram”, “template_unit_vector”, and also several residue features, which are transformed into pair features. \ The “template_aatype” feature is included via tiling and stacking (this is done twice, in both residue directions). \ Also the mask features “template_pseudo_beta_mask” and “template_backbone_frame_mask” are included, where the feature fij = maski · maskj. \
template_angle_feat
This is a feature of size [Ntempl, Nres, 51] constructed by concatenating the following features: “template_aatype”, “template_torsion_angles”, “template_alt_torsion_angles”, and “template_torsion_angles_mask”.

Table 1 Input Features (1/2) cite:jumperHighlyAccurateProtein2021

Table 1 Input Features (2/2) cite:jumperHighlyAccurateProtein2021

Self-distillation dataset cite:jumperHighlyAccurateProtein2021

Build dataset (on unlabeled sequences):
1. Make MSA for every cluster in Uniclust30
2. Remove sequences that appear in another sequences MSA
3. Keep sequences of 200 < length < 1024
4. Remove sequences where MSA < 200 alignments
For predicted structures:
- train ‘undistlled’ model on just PDB dataset
- use this model to predict above set
- for every residue pair, computer confidence metric using KL-divergence between distance distribution and a reference distribution
- reference distribution
self-distillation training took roughly 2 weeks

AlphaFold Inference

AlphaFold Inference cite:jumperHighlyAccurateProtein2021

AlphaFold receives input features derived from:
- the amino-acid sequence
- MSA
- templates (see subsubsection 1.2.9)
outputs features:
- atom coordinates
- the distogram
- per-residue confidence scores.
Recycling x3
- initial recycled inputs are zero

Algorithm 2 outlines the main steps (see also Fig 1e and the corresponding description in the main article).

Algorithm 2 Model Inference cite:jumperHighlyAccurateProtein2021

AlphaFold Training cite:jumperHighlyAccurateProtein2021

Model Architecture

Input embeddings cite:jumperHighlyAccurateProtein2021

Algorithm 3 Input embeddings cite:jumperHighlyAccurateProtein2021

Algorithm 4 Relative positional encoding cite:jumperHighlyAccurateProtein2021

Algorithm 5 One-hot encoding with nearest bin cite:jumperHighlyAccurateProtein2021

EvoFormer

EvoFormer: Overview cite:jumperHighlyAccurateProtein2021

cast as a graph inference problem
cross-optimization and information flow between MSA representation and pair-wise representation
layer normalization

Algorithm 6 EvoFormer stack cite:jumperHighlyAccurateProtein2021

EvoFormer: Row wise Gated Attention cite:jumperHighlyAccurateProtein2021

Algorithm 7 Row wise Gated Attention cite:jumperHighlyAccurateProtein2021

EvoFormer: Column wise Gated Attention cite:jumperHighlyAccurateProtein2021

Algorithm 8 Column wise Gated Attention cite:jumperHighlyAccurateProtein2021

EvoFormer: MSA Translation Layer cite:jumperHighlyAccurateProtein2021

Algorithm 9 MSA Translation Layer cite:jumperHighlyAccurateProtein2021

EvoFormer: Outer-Product Mean cite:jumperHighlyAccurateProtein2021

Algorithm 10 Outer-Product Mean cite:jumperHighlyAccurateProtein2021

EvoFormer: Residue Pairs cite:jumperHighlyAccurateProtein2021

EvoFormer: Triangular Multiplicative Update cite:jumperHighlyAccurateProtein2021

Algorithm 11 Triangular Multiplicative Update: outward cite:jumperHighlyAccurateProtein2021

Algorithm 12 Triangular Multiplicative Update: inward cite:jumperHighlyAccurateProtein2021

EvoFormer: Triangular Self-Attention cite:jumperHighlyAccurateProtein2021

Algorithm 13 Triangular Self-Attention: start cite:jumperHighlyAccurateProtein2021

Algorithm 14 Triangular Self-Attention: end cite:jumperHighlyAccurateProtein2021

Algorithm 15 Transition layer in the pair stack cite:jumperHighlyAccurateProtein2021

Algorithm 16 Template pair stack cite:jumperHighlyAccurateProtein2021

Algorithm 17 Template pointwise attention cite:jumperHighlyAccurateProtein2021

Algorithm 18 Extra MSA stack cite:jumperHighlyAccurateProtein2021

Algorithm 19 MSA global column-wise gated self-attention cite:jumperHighlyAccurateProtein2021

Structure Module

Structure Module: Overview cite:jumperHighlyAccurateProtein2021

Structure Module: Frame Representation

rotation + translation transforms $T_i := (R_i,t_i)$
no reflection, scaling, or shear
they construct ground truth frames using the position of three atoms from the ground truth PDB structures using a Gram–Schmidt process (Algorithm 21)

Structure Module: Invariant point attention (IPA) cite:jumperHighlyAccurateProtein2021

Structure Module: Algorithm Part 1 cite:jumperHighlyAccurateProtein2021

Structure Module: Algorithm Part 2 cite:jumperHighlyAccurateProtein2021

Structure Module: Algorithm Part 3 cite:jumperHighlyAccurateProtein2021

Table 2 Rigid atomic groups from torsion angles cite:jumperHighlyAccurateProtein2021

Algorithm 21 Frame construction from ground truth atom positions cite:jumperHighlyAccurateProtein2021

Algorithm 22 Invariant point attention (IPA) cite:jumperHighlyAccurateProtein2021

Algorithm 23 Backbone update cite:jumperHighlyAccurateProtein2021

Algorithm 24 Compute all atom coordinates cite:jumperHighlyAccurateProtein2021

Algorithm 25 Make a transformation that rotates around the x-axis cite:jumperHighlyAccurateProtein2021

Table 3 Ambiguous atom renaming swaps cite:jumperHighlyAccurateProtein2021

Algorithm 26 Rename symmetric ground truth atoms cite:jumperHighlyAccurateProtein2021

Distograms cite:jumperHighlyAccurateProtein2021

Structure Module: Output

predicts backbone frames $T_i$ and torsion angles $α^f_i$
then computes atom coordinates by applying the torsion angles to the corresponding amino acid structure with idealized bond angles and bond lengths.
We attach a local frame to each rigid group (see Table 2), such that the torsion axis is the x-axis, and store the ideal literature atom coordinates [97] for each amino acid relative to these frames

in a table ~xlit r,f,a , where r ∈ {ALA, ARG, ASN, … } denotes the residue type, f ∈ Storsion names denotes the frame and a the atom name. We further pre-compute rigid transformations that transform atom coordinates lit from each frame to the frame that is higher up in the hierarchy. E.g. Tr,(χ maps atoms in amino-acid 2 →χ1 ) type r from the χ2 -frame to the χ1 -frame. As we are only predicting heavy atoms, the extra backbone rigid groups ω and φ do not contain atoms, but the corresponding frames contribute to the FAPE loss for alignment to the ground truth (like all other frames). cite:jumperHighlyAccurateProtein2021

Amber Relaxation cite:jumperHighlyAccurateProtein2021

Iterative restrained energy minimization procedure:
- minimization of the AMBER99SB force field
  - additional harmonic restraints (keep system near its input structure)
  - restraints applied independently to heavy atoms, with a spring constant of 10 kcal/mol Å2
Once minimizer has converged, determine which residues still contain violations
- remove restraints from all atoms within these residues and perform restrained minimization once again
- process repeated until all violations are resolved.
Note: In the CASP14 assessment only one iteration was used
Full energy minimization and hydrogen placement was performed using OpenMM simulation package
- tolerance of 2.39 kcal/mol and unlimited number of steps (OpenMM default values)

Loss Functions

Loss Functions cite:jumperHighlyAccurateProtein2021

weighted sum
weighted to reduce importance of short sequences

Loss Functions & Auxillary Heads cite:jumperHighlyAccurateProtein2021

Side chain and backbone torsion angle loss (sec. 1.9.1)
Frame aligned point error (FAPE) (sec. 1.9.2)
- Configurations with FAPE(X,Y) = 0 (sec. 1.9.4)
- Metric properties of FAPE (sec. 1.9.5)
Chiral properties of AlphaFold and its loss (sec. 1.9.3)
- transforms $T_i$ are variant under reflection (see eq. 11 to 17)
- atom positions via backbone frames and $χ$ angles
Model confidence prediction (pLDDT) (sec. 1.9.6)
TM-score prediction (sec. 1.9.7)
Distogram prediction (sec. 1.9.8)
Masked MSA prediction (sec. 1.9.9)
“Experimentally resolved” prediction (sec. 1.9.10)
Structural violations (sec. 1.9.11)

Algorithm 27 Side chain and backbone torsion angle loss cite:jumperHighlyAccurateProtein2021

Loss Functions: FAPE

Variation of commonly used root-mean-squared deviation (RMSD) of atomic positions
not invariant to reflections, preventing proteins of the wrong chirality. cite:rubieraAlphaFoldHereWhat, cite:jumperHighlyAccurateProtein2021

Algorithm 29 Predict model confidence pLDDT cite:jumperHighlyAccurateProtein2021

TM-score prediction cite:jumperHighlyAccurateProtein2021

Global superposition metric of $C_α$ atoms (eqs. 31 - 33)
1. approximated (eqs. 34-36)
2. Probabilistic lower-bound maximum-of-expectation score (eqs. 37-38)
3. approximated TM-score using pairwise $C_α$ based computation ($e_ij$ matrix) and above (see eq. 39 and adjacent text)
4. TM-score of any residue subset $D$ can be computed (eq. 40)
  - can also be used to estimate GDT, FAPE, RMSD (using $e_ij$) matrix (not done)
  - used for confident domain packing visualizations

Distogram prediction cite:jumperHighlyAccurateProtein2021

linear project pair representations to bins, sum directed edges, ie. $z_ij + z_ji$
64 bins ($2 \angstrom - 22 \angstrom$)
prediction targets: $y^b_ij$ one-hot encoding (bins)
- from ground-truth $C_β$ atoms from all residues (except glycine, use $C_α$)

#+ATTR_LATEX :scale 0.4 :caption \caption{Cross entropy distogram loss averaged overall pairs (eq. 41)}

[.] Masked MSA prediction cite:jumperHighlyAccurateProtein2021

[.] “Experimentally resolved” prediction (fine tuning) cite:jumperHighlyAccurateProtein2021

[.] Structural violations (fine tuning) cite:jumperHighlyAccurateProtein2021

Training & Inference Details

Recycling iterations

Algorithm 30 Generic recycling inference procedure cite:jumperHighlyAccurateProtein2021

Algorithm 31 Generic recycling training procedure cite:jumperHighlyAccurateProtein2021

Algorithm 32 Embedding of evoformer and structure module outputs for recycling cite:jumperHighlyAccurateProtein2021

[.] Training stages cite:jumperHighlyAccurateProtein2021

[.] MSA resampling and ensembling cite:jumperHighlyAccurateProtein2021

Optimization details cite:jumperHighlyAccurateProtein2021

Adam $\italic{lr} == 10^-3, β_1 = 0.9, β_2 = 0.999, ε = 10^-6$
- lr warm-up for $0.128 \dot 10^6$ samples, increase again by $0.65 after 6.4 \dot 10^6$ samples
batch: 128
gradient clipping by global norm (per parameter*sample) of 0.1

[.] Parameters initialization cite:jumperHighlyAccurateProtein2021

[.] Loss clamping details cite:jumperHighlyAccurateProtein2021

[.] Dropout details cite:jumperHighlyAccurateProtein2021

[.] Evaluator setup cite:jumperHighlyAccurateProtein2021

[.] Reducing the memory consumption cite:jumperHighlyAccurateProtein2021

Results

[.] CASP14 Assessment cite:jumperHighlyAccurateProtein2021

They did well

Ablation Studies cite:jumperHighlyAccurateProtein2021

Baseline for all ablation models: Full model without noisy-student self-attention Ablations:

With noisy-student self-distillation training
No templates
No raw MSA (use MSA pairwise frequencies)
No triangles, biasing, or gating (use axial attention)
No recycling
No IPA (use direct projection)
No invariant IPA & no recycling
No end-to-end structure gradients (keep auxiliary heads)
No auxiliary distogram head
No auxiliary masked MSA head

Files

Alphafold2_talk.org

Latest commit

History

Alphafold2_talk.org

File metadata and controls

Highly accurate protein structure prediction with AlphaFold

HEADER

START [0/2]

LOG

—

[.] include slide with repo, colab links

[.] train todo

refs

snips

CODE [0/0]

—

Results

Supplemental figures

Methods

Model training and evaluation

—

Introduction

Introduction

Introduction

Model Overview cite:jumperHighlyAccurateProtein2021

Data Pipeline

Initial Input: mmCIF or FASTA files

Initial Input: mmCIF or FASTA files

Parsing cite:jumperHighlyAccurateProtein2021

[.] Genetic Search cite:jumperHighlyAccurateProtein2021

Template Search cite:jumperHighlyAccurateProtein2021

Training Data cite:jumperHighlyAccurateProtein2021

Filtering cite:jumperHighlyAccurateProtein2021

MSA block deletion cite:jumperHighlyAccurateProtein2021

Algorithm 1 MSA Block deletion cite:jumperHighlyAccurateProtein2021

MSA clustering cite:jumperHighlyAccurateProtein2021

Residue cropping cite:jumperHighlyAccurateProtein2021

[.] Featurization and model inputs cite:jumperHighlyAccurateProtein2021

[.] Featurization and model inputs cite:jumperHighlyAccurateProtein2021

Table 1 Input Features (1/2) cite:jumperHighlyAccurateProtein2021

Table 1 Input Features (2/2) cite:jumperHighlyAccurateProtein2021

Self-distillation dataset cite:jumperHighlyAccurateProtein2021

AlphaFold Inference

AlphaFold Inference cite:jumperHighlyAccurateProtein2021

Algorithm 2 Model Inference cite:jumperHighlyAccurateProtein2021

Algorithm 2 Model Inference cite:jumperHighlyAccurateProtein2021

AlphaFold Training cite:jumperHighlyAccurateProtein2021

Model Architecture

Input embeddings cite:jumperHighlyAccurateProtein2021

Algorithm 3 Input embeddings cite:jumperHighlyAccurateProtein2021

Algorithm 4 Relative positional encoding cite:jumperHighlyAccurateProtein2021

Algorithm 5 One-hot encoding with nearest bin cite:jumperHighlyAccurateProtein2021

EvoFormer

EvoFormer: Overview cite:jumperHighlyAccurateProtein2021

EvoFormer: Overview cite:jumperHighlyAccurateProtein2021

Algorithm 6 EvoFormer stack cite:jumperHighlyAccurateProtein2021

EvoFormer: Row wise Gated Attention cite:jumperHighlyAccurateProtein2021

Algorithm 7 Row wise Gated Attention cite:jumperHighlyAccurateProtein2021

EvoFormer: Column wise Gated Attention cite:jumperHighlyAccurateProtein2021

Algorithm 8 Column wise Gated Attention cite:jumperHighlyAccurateProtein2021

EvoFormer: MSA Translation Layer cite:jumperHighlyAccurateProtein2021

Algorithm 9 MSA Translation Layer cite:jumperHighlyAccurateProtein2021

EvoFormer: Outer-Product Mean cite:jumperHighlyAccurateProtein2021

Algorithm 10 Outer-Product Mean cite:jumperHighlyAccurateProtein2021

EvoFormer: Residue Pairs cite:jumperHighlyAccurateProtein2021

EvoFormer: Triangular Multiplicative Update cite:jumperHighlyAccurateProtein2021

Algorithm 11 Triangular Multiplicative Update: outward cite:jumperHighlyAccurateProtein2021

Algorithm 12 Triangular Multiplicative Update: inward cite:jumperHighlyAccurateProtein2021

EvoFormer: Triangular Self-Attention cite:jumperHighlyAccurateProtein2021

Algorithm 13 Triangular Self-Attention: start cite:jumperHighlyAccurateProtein2021

Algorithm 14 Triangular Self-Attention: end cite:jumperHighlyAccurateProtein2021

Algorithm 15 Transition layer in the pair stack cite:jumperHighlyAccurateProtein2021

Algorithm 16 Template pair stack cite:jumperHighlyAccurateProtein2021

Algorithm 17 Template pointwise attention cite:jumperHighlyAccurateProtein2021

Algorithm 18 Extra MSA stack cite:jumperHighlyAccurateProtein2021

Algorithm 19 MSA global column-wise gated self-attention cite:jumperHighlyAccurateProtein2021

Structure Module

Structure Module: Overview cite:jumperHighlyAccurateProtein2021

Structure Module: Frame Representation