Skip to content

Latest commit



593 lines (503 loc) · 26.5 KB

File metadata and controls

593 lines (503 loc) · 26.5 KB

Highly accurate protein structure prediction with AlphaFold



START [0/2]


[.] include slide with repo, colab links

[.] train todo

Predictions of side chain chi angles as well as the final, per-residue accuracy of the structure (pLDDT) are computed with small per-residue networks on the final activations at the end of the network. The estimate of the TM-score (pTM) is obtained from a pairwise error prediction that is computed as a linear projection from the final pair representation. The final loss (that we term the frame-aligned point error (FAPE) (Fig. 3f)) compares the predicted atom positions to the true positions under many different alignments. For each alignment, defined by aligning the predicted frame (Rk,tk) to the corresponding true frame, we compute the distance of all predicted atom positions xi from the true atom positions. The resulting Nframes × Natoms distances are penalized with a clamped L1-loss. This creates a strong bias for atoms to be correct relative to the local frame of each residue and hence correct with respect to its side chain interactions, as well as providing the main source of chirality for AlphaFold (Suppl. Methods 1.9.3 and Suppl. Fig. 9).

makes effective use of the unlabelled sequence data and significantly improves the accuracy of the resulting network. Additionally, we randomly mask out or mutate individual residues within the MSA and have a Bidirectional Encoder Representations from Transformers (BERT)-style37 objective to predict the masked elements of the MSA sequences. This objective encourages the network to learn to interpret phylogenetic and covariation relationships without hardcoding a particular correlation statistic into the features. The BERT objective is trained jointly with the normal PDB structure loss on the same training examples and is not pre-trained, in contrast to recent independent work38.



CODE [0/0]


Supplemental figures


Model training and evaluation




Willy Rempel
  • HBSc Computer Science
  • BSc Mathematics
  • Research Associate, AISC
  • seeking opportunities in the field


Although all of the ideas in the model are doubtlessly clever, the main secret behind AlphaFold 2’s success is the superb deep learning engineering. A close look at the model reveals an architecture with a large amount of small details that seem fundamental for the performance of the network. As we admire the end product, we should not turn a blind eye to the enormous budget, and the large team of full-time, handsomely paid engineers that made it possible. cite:rubieraAlphaFoldHereWhat

This, and many other tricks, are described in exhaustive detail in the Supplementary Information. A reduced subset has been analysed in a brief ablation study, but ultimately, how important are each of the minor details is anybody’s guess. cite:rubieraAlphaFoldHereWhat

\flushright{(above blog post is recommended reading)}

Model Overview cite:jumperHighlyAccurateProtein2021


Data Pipeline

Initial Input: mmCIF or FASTA files


Initial Input: mmCIF or FASTA files


Parsing cite:jumperHighlyAccurateProtein2021

  • only certain metadata (more from mmCIF)
  • change MSE residues into MET

[.] Genetic Search cite:jumperHighlyAccurateProtein2021

For MSAs

  • JackHMMER
    • MGnify: MSA depth 5,000
    • UniRef90: MSA depth 10,000
  • HHBlits
    • Uniclust30 + BFD: MSA depth unlimited
  • MSAs duplicated and stacked

JackHMMER: -N 1 -E 0.0001 –incE 0.0001 –F1 0.0005 –F2 0.00005 –F3 0.0000005. \ HHBlits: -n 3 -e 0.001 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -maxseq 1000000.

Template Search cite:jumperHighlyAccurateProtein2021

  • UniRef90 MSA from prior search used for PDB70 search using HHSearch.
  • Filter out:
    • released after the input sequence
    • or identical to the input sequence
    • too small
  • At inference use top 4 templates

Training Data cite:jumperHighlyAccurateProtein2021

  • 75:25 self-distillation : known structure (PDB)
  • stochastic filters (next)

Filtering cite:jumperHighlyAccurateProtein2021

  • stochastic filters:
    • Input mmCIFs are restricted to have resolution less than 9 Å. This is not a very restrictive filter and only removes around 0.2% of structures.
    • Longer protein chains are selected with higher probability.
    • Also favour protein chains from smaller clusters. They use 40% sequence identity clusters of the Protein Data Bank clustered with MMSeqs2.
    • Sequences are filtered out when any single amino acid accounts for more than 80% of the input primary sequence. This filter removes about 0.8% of sequences.

MSA block deletion cite:jumperHighlyAccurateProtein2021

  • Block deletion tends to remove similarities (ie. whole branch phylogeny) and promote diversity
    • Similar sequences are likely to be adjacent
    • Contiguous blocks in MSAs are deleted.
    • First MSAs are grouped by tool
    • Then sorted according to tool defaults (usually e-value)

Algorithm 1 MSA Block deletion cite:jumperHighlyAccurateProtein2021


MSA clustering cite:jumperHighlyAccurateProtein2021

  • Similarity clusters used to randomly select subset of MSA sequences
    • to reduce computational cost from attention modules, reduce $N_seq$
  • Modified K-means is used, with the input sequence used as first cluster center

Clustering Algorithm:

  1. $Nclust$ centers selected from MSA
  2. Generate a mask where p=0.15 that any position is selected by the mask
  3. Each center is modified for each mask selected residue according to:
    1. p=0.1 replaced with a uniformly sampled random amino acid
    2. p=0.1 replaced with an amino acid sampled from the MSA profile
    3. p=0.1 no replacement
    4. p=0.7 replaced with a special token (masked_msa_token)
  4. hamming distance measure for remaining selections

Residue cropping cite:jumperHighlyAccurateProtein2021

During training:

  1. unclamped & clamped - sampling start index from uniform distributions
  2. Cropped with fixed size $N_res$

[.] Featurization and model inputs cite:jumperHighlyAccurateProtein2021

  • target_feat
    This is a feature of size [Nres, 21] consisting of the “aatype” feature.
  • residue_index
    Positional encoding constant tensor. This is a feature of size [Nres] consisting of the “residue_index” feature.
  • msa_feat
    This is a feature of size [Nclust, Nres, 49] constructed by concatenating “cluster_msa”, “cluster_has_deletion”, “cluster_deletion_value”, “cluster_deletion_mean”, “cluster_profile”. We draw Ncycle×Nensemble random samples from this feature to provide each recycling/ensembling iteration of the network with a different sample (see subsubsection 1.11.2).
  • extra_msa_feat
    This is a feature of size [Nextra_seq, Nres, 25] constructed by concatenating “extra_msa”, “extra_msa_has_deletion”, “extra_msa_deletion_value”. Together with “msa_feat’ above we also draw Ncycle × Nensemble random samples from this feature (see subsubsection 1.11.2).

[.] Featurization and model inputs cite:jumperHighlyAccurateProtein2021

  • template_pair_feat
    This is a feature of size [Ntempl, Nres, Nres, 88] and consists of concatenation of the pair residue features “template_distogram”, “template_unit_vector”, and also several residue features, which are transformed into pair features. \ The “template_aatype” feature is included via tiling and stacking (this is done twice, in both residue directions). \ Also the mask features “template_pseudo_beta_mask” and “template_backbone_frame_mask” are included, where the feature fij = maski · maskj. \
  • template_angle_feat
    This is a feature of size [Ntempl, Nres, 51] constructed by concatenating the following features: “template_aatype”, “template_torsion_angles”, “template_alt_torsion_angles”, and “template_torsion_angles_mask”.

Table 1 Input Features (1/2) cite:jumperHighlyAccurateProtein2021


Table 1 Input Features (2/2) cite:jumperHighlyAccurateProtein2021


Self-distillation dataset cite:jumperHighlyAccurateProtein2021

  • Build dataset (on unlabeled sequences):
    1. Make MSA for every cluster in Uniclust30
    2. Remove sequences that appear in another sequences MSA
    3. Keep sequences of 200 < length < 1024
    4. Remove sequences where MSA < 200 alignments
  • For predicted structures:
    • train ‘undistlled’ model on just PDB dataset
    • use this model to predict above set
    • for every residue pair, computer confidence metric using KL-divergence between distance distribution and a reference distribution
    • reference distribution
  • self-distillation training took roughly 2 weeks

AlphaFold Inference

AlphaFold Inference cite:jumperHighlyAccurateProtein2021

  • AlphaFold receives input features derived from:
    • the amino-acid sequence
    • MSA
    • templates (see subsubsection 1.2.9)
  • outputs features:
    • atom coordinates
    • the distogram
    • per-residue confidence scores.
  • Recycling x3
    • initial recycled inputs are zero

Algorithm 2 outlines the main steps (see also Fig 1e and the corresponding description in the main article).

Algorithm 2 Model Inference cite:jumperHighlyAccurateProtein2021


Algorithm 2 Model Inference cite:jumperHighlyAccurateProtein2021


AlphaFold Training cite:jumperHighlyAccurateProtein2021


Model Architecture

Input embeddings cite:jumperHighlyAccurateProtein2021


Algorithm 3 Input embeddings cite:jumperHighlyAccurateProtein2021


Algorithm 4 Relative positional encoding cite:jumperHighlyAccurateProtein2021


Algorithm 5 One-hot encoding with nearest bin cite:jumperHighlyAccurateProtein2021



EvoFormer: Overview cite:jumperHighlyAccurateProtein2021


EvoFormer: Overview cite:jumperHighlyAccurateProtein2021

  • cast as a graph inference problem
  • cross-optimization and information flow between MSA representation and pair-wise representation
  • layer normalization

Algorithm 6 EvoFormer stack cite:jumperHighlyAccurateProtein2021


EvoFormer: Row wise Gated Attention cite:jumperHighlyAccurateProtein2021


Algorithm 7 Row wise Gated Attention cite:jumperHighlyAccurateProtein2021


EvoFormer: Column wise Gated Attention cite:jumperHighlyAccurateProtein2021


Algorithm 8 Column wise Gated Attention cite:jumperHighlyAccurateProtein2021


EvoFormer: MSA Translation Layer cite:jumperHighlyAccurateProtein2021


Algorithm 9 MSA Translation Layer cite:jumperHighlyAccurateProtein2021


EvoFormer: Outer-Product Mean cite:jumperHighlyAccurateProtein2021


Algorithm 10 Outer-Product Mean cite:jumperHighlyAccurateProtein2021


EvoFormer: Residue Pairs cite:jumperHighlyAccurateProtein2021



EvoFormer: Triangular Multiplicative Update cite:jumperHighlyAccurateProtein2021


Algorithm 11 Triangular Multiplicative Update: outward cite:jumperHighlyAccurateProtein2021


Algorithm 12 Triangular Multiplicative Update: inward cite:jumperHighlyAccurateProtein2021


EvoFormer: Triangular Self-Attention cite:jumperHighlyAccurateProtein2021


Algorithm 13 Triangular Self-Attention: start cite:jumperHighlyAccurateProtein2021


Algorithm 14 Triangular Self-Attention: end cite:jumperHighlyAccurateProtein2021


Algorithm 15 Transition layer in the pair stack cite:jumperHighlyAccurateProtein2021


Algorithm 16 Template pair stack cite:jumperHighlyAccurateProtein2021


Algorithm 17 Template pointwise attention cite:jumperHighlyAccurateProtein2021


Algorithm 18 Extra MSA stack cite:jumperHighlyAccurateProtein2021


Algorithm 19 MSA global column-wise gated self-attention cite:jumperHighlyAccurateProtein2021


Structure Module

Structure Module: Overview cite:jumperHighlyAccurateProtein2021


Structure Module: Frame Representation


  • rotation + translation transforms $T_i := (R_i,t_i)$
  • no reflection, scaling, or shear
  • they construct ground truth frames using the position of three atoms from the ground truth PDB structures using a Gram–Schmidt process (Algorithm 21)

Structure Module: Invariant point attention (IPA) cite:jumperHighlyAccurateProtein2021


Structure Module: Algorithm Part 1 cite:jumperHighlyAccurateProtein2021


Structure Module: Algorithm Part 2 cite:jumperHighlyAccurateProtein2021


Structure Module: Algorithm Part 3 cite:jumperHighlyAccurateProtein2021


Table 2 Rigid atomic groups from torsion angles cite:jumperHighlyAccurateProtein2021


Algorithm 21 Frame construction from ground truth atom positions cite:jumperHighlyAccurateProtein2021


Algorithm 22 Invariant point attention (IPA) cite:jumperHighlyAccurateProtein2021


Algorithm 23 Backbone update cite:jumperHighlyAccurateProtein2021


Algorithm 24 Compute all atom coordinates cite:jumperHighlyAccurateProtein2021


Algorithm 25 Make a transformation that rotates around the x-axis cite:jumperHighlyAccurateProtein2021


Table 3 Ambiguous atom renaming swaps cite:jumperHighlyAccurateProtein2021


Algorithm 26 Rename symmetric ground truth atoms cite:jumperHighlyAccurateProtein2021


Distograms cite:jumperHighlyAccurateProtein2021


Structure Module: Output

  • predicts backbone frames $T_i$ and torsion angles $α^f_i$
  • then computes atom coordinates by applying the torsion angles to the corresponding amino acid structure with idealized bond angles and bond lengths.
  • We attach a local frame to each rigid group (see Table 2), such that the torsion axis is the x-axis, and store the ideal literature atom coordinates [97] for each amino acid relative to these frames

in a table ~xlit r,f,a , where r ∈ {ALA, ARG, ASN, … } denotes the residue type, f ∈ Storsion names denotes the frame and a the atom name. We further pre-compute rigid transformations that transform atom coordinates lit from each frame to the frame that is higher up in the hierarchy. E.g. Tr,(χ maps atoms in amino-acid 2 →χ1 ) type r from the χ2 -frame to the χ1 -frame. As we are only predicting heavy atoms, the extra backbone rigid groups ω and φ do not contain atoms, but the corresponding frames contribute to the FAPE loss for alignment to the ground truth (like all other frames). cite:jumperHighlyAccurateProtein2021

Amber Relaxation cite:jumperHighlyAccurateProtein2021

  • Iterative restrained energy minimization procedure:
    • minimization of the AMBER99SB force field
      • additional harmonic restraints (keep system near its input structure)
      • restraints applied independently to heavy atoms, with a spring constant of 10 kcal/mol Å2
  • Once minimizer has converged, determine which residues still contain violations
    • remove restraints from all atoms within these residues and perform restrained minimization once again
    • process repeated until all violations are resolved.
  • Note: In the CASP14 assessment only one iteration was used
  • Full energy minimization and hydrogen placement was performed using OpenMM simulation package
    • tolerance of 2.39 kcal/mol and unlimited number of steps (OpenMM default values)

Loss Functions

Loss Functions cite:jumperHighlyAccurateProtein2021


  • weighted sum
  • weighted to reduce importance of short sequences

Loss Functions & Auxillary Heads cite:jumperHighlyAccurateProtein2021

  1. Side chain and backbone torsion angle loss (sec. 1.9.1)
  2. Frame aligned point error (FAPE) (sec. 1.9.2)
    • Configurations with FAPE(X,Y) = 0 (sec. 1.9.4)
    • Metric properties of FAPE (sec. 1.9.5)
  3. Chiral properties of AlphaFold and its loss (sec. 1.9.3)
    • transforms $T_i$ are variant under reflection (see eq. 11 to 17)
    • atom positions via backbone frames and $χ$ angles
  4. Model confidence prediction (pLDDT) (sec. 1.9.6)
  5. TM-score prediction (sec. 1.9.7)
  6. Distogram prediction (sec. 1.9.8)
  7. Masked MSA prediction (sec. 1.9.9)
  8. “Experimentally resolved” prediction (sec. 1.9.10)
  9. Structural violations (sec. 1.9.11)

Algorithm 27 Side chain and backbone torsion angle loss cite:jumperHighlyAccurateProtein2021


Loss Functions: FAPE


  • Variation of commonly used root-mean-squared deviation (RMSD) of atomic positions
  • not invariant to reflections, preventing proteins of the wrong chirality. cite:rubieraAlphaFoldHereWhat, cite:jumperHighlyAccurateProtein2021

Algorithm 29 Predict model confidence pLDDT cite:jumperHighlyAccurateProtein2021


TM-score prediction cite:jumperHighlyAccurateProtein2021

  • Global superposition metric of $C_α$ atoms (eqs. 31 - 33)
    1. approximated (eqs. 34-36)
    2. Probabilistic lower-bound maximum-of-expectation score (eqs. 37-38)
    3. approximated TM-score using pairwise $C_α$ based computation ($eij$ matrix) and above (see eq. 39 and adjacent text)
    4. TM-score of any residue subset $D$ can be computed (eq. 40)
      • can also be used to estimate GDT, FAPE, RMSD (using $eij$) matrix (not done)
      • used for confident domain packing visualizations

Distogram prediction cite:jumperHighlyAccurateProtein2021

  • linear project pair representations to bins, sum directed edges, ie. $zij + zji$
  • 64 bins ($2 \angstrom - 22 \angstrom$)
  • prediction targets: $y^bij$ one-hot encoding (bins)
    • from ground-truth $C_β$ atoms from all residues (except glycine, use $C_α$)

#+ATTR_LATEX :scale 0.4 :caption \caption{Cross entropy distogram loss averaged overall pairs (eq. 41)} ./imgs/distogram_loss.png

[.] Masked MSA prediction cite:jumperHighlyAccurateProtein2021

[.] “Experimentally resolved” prediction (fine tuning) cite:jumperHighlyAccurateProtein2021

[.] Structural violations (fine tuning) cite:jumperHighlyAccurateProtein2021

Training & Inference Details

Recycling iterations

Algorithm 30 Generic recycling inference procedure cite:jumperHighlyAccurateProtein2021


Algorithm 31 Generic recycling training procedure cite:jumperHighlyAccurateProtein2021


Algorithm 32 Embedding of evoformer and structure module outputs for recycling cite:jumperHighlyAccurateProtein2021


[.] Training stages cite:jumperHighlyAccurateProtein2021

[.] MSA resampling and ensembling cite:jumperHighlyAccurateProtein2021

Optimization details cite:jumperHighlyAccurateProtein2021

  • Adam $\italic{lr} == 10-3, β_1 = 0.9, β_2 = 0.999, ε = 10-6$
    • lr warm-up for $0.128 \dot 10^6$ samples, increase again by $0.65 after 6.4 \dot 10^6$ samples
  • batch: 128
  • gradient clipping by global norm (per parameter*sample) of 0.1

[.] Parameters initialization cite:jumperHighlyAccurateProtein2021

[.] Loss clamping details cite:jumperHighlyAccurateProtein2021

[.] Dropout details cite:jumperHighlyAccurateProtein2021

[.] Evaluator setup cite:jumperHighlyAccurateProtein2021

[.] Reducing the memory consumption cite:jumperHighlyAccurateProtein2021


[.] CASP14 Assessment cite:jumperHighlyAccurateProtein2021

They did well

Ablation Studies cite:jumperHighlyAccurateProtein2021

Baseline for all ablation models: Full model without noisy-student self-attention Ablations:

  1. With noisy-student self-distillation training
  2. No templates
  3. No raw MSA (use MSA pairwise frequencies)
  4. No triangles, biasing, or gating (use axial attention)
  5. No recycling
  6. No IPA (use direct projection)
  7. No invariant IPA & no recycling
  8. No end-to-end structure gradients (keep auxiliary heads)
  9. No auxiliary distogram head
  10. No auxiliary masked MSA head

Ablation Results (in main paper) cite:jumperHighlyAccurateProtein2021


Ablation Results cite:jumperHighlyAccurateProtein2021


[.] Network Probing cite:jumperHighlyAccurateProtein2021


[.] Novel Folds

They did well

[.] Attention Visualization cite:jumperHighlyAccurateProtein2021


[.] Additional Results cite:jumperHighlyAccurateProtein2021



