\setbeamerfont{large}{size=\large}
Predictions of side chain chi angles as well as the final, per-residue accuracy of the structure (pLDDT) are computed with small per-residue networks on the final activations at the end of the network. The estimate of the TM-score (pTM) is obtained from a pairwise error prediction that is computed as a linear projection from the final pair representation. The final loss (that we term the frame-aligned point error (FAPE) (Fig. 3f)) compares the predicted atom positions to the true positions under many different alignments. For each alignment, defined by aligning the predicted frame (Rk,tk) to the corresponding true frame, we compute the distance of all predicted atom positions xi from the true atom positions. The resulting Nframes × Natoms distances are penalized with a clamped L1-loss. This creates a strong bias for atoms to be correct relative to the local frame of each residue and hence correct with respect to its side chain interactions, as well as providing the main source of chirality for AlphaFold (Suppl. Methods 1.9.3 and Suppl. Fig. 9).
makes effective use of the unlabelled sequence data and significantly improves the accuracy of the resulting network. Additionally, we randomly mask out or mutate individual residues within the MSA and have a Bidirectional Encoder Representations from Transformers (BERT)-style37 objective to predict the masked elements of the MSA sequences. This objective encourages the network to learn to interpret phylogenetic and covariation relationships without hardcoding a particular correlation statistic into the features. The BERT objective is trained jointly with the normal PDB structure loss on the same training examples and is not pre-trained, in contrast to recent independent work38.
- rempellabs.com [coming soon]
- HBSc Computer Science
- BSc Mathematics
- Research Associate, AISC
- seeking opportunities in the field
Although all of the ideas in the model are doubtlessly clever, the main secret behind AlphaFold 2’s success is the superb deep learning engineering. A close look at the model reveals an architecture with a large amount of small details that seem fundamental for the performance of the network. As we admire the end product, we should not turn a blind eye to the enormous budget, and the large team of full-time, handsomely paid engineers that made it possible. cite:rubieraAlphaFoldHereWhat
This, and many other tricks, are described in exhaustive detail in the Supplementary Information. A reduced subset has been analysed in a brief ablation study, but ultimately, how important are each of the minor details is anybody’s guess. cite:rubieraAlphaFoldHereWhat
\flushright{(above blog post is recommended reading)}
- only certain metadata (more from mmCIF)
- change MSE residues into MET
For MSAs
- JackHMMER
- MGnify: MSA depth 5,000
- UniRef90: MSA depth 10,000
- HHBlits
- Uniclust30 + BFD: MSA depth unlimited
- MSAs duplicated and stacked
flags:
JackHMMER: -N 1 -E 0.0001 –incE 0.0001 –F1 0.0005 –F2 0.00005 –F3 0.0000005. \ HHBlits: -n 3 -e 0.001 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -maxseq 1000000.
- UniRef90 MSA from prior search used for PDB70 search using HHSearch.
- Filter out:
- released after the input sequence
- or identical to the input sequence
- too small
- At inference use top 4 templates
- 75:25 self-distillation : known structure (PDB)
- stochastic filters (next)
- stochastic filters:
- Input mmCIFs are restricted to have resolution less than 9 Å. This is not a very restrictive filter and only removes around 0.2% of structures.
- Longer protein chains are selected with higher probability.
- Also favour protein chains from smaller clusters. They use 40% sequence identity clusters of the Protein Data Bank clustered with MMSeqs2.
- Sequences are filtered out when any single amino acid accounts for more than 80% of the input primary sequence. This filter removes about 0.8% of sequences.
- Block deletion tends to remove similarities (ie. whole branch phylogeny) and promote diversity
- Similar sequences are likely to be adjacent
- Contiguous blocks in MSAs are deleted.
- First MSAs are grouped by tool
- Then sorted according to tool defaults (usually e-value)
- Similarity clusters used to randomly select subset of MSA sequences
- to reduce computational cost from attention modules, reduce
$N_seq$
- to reduce computational cost from attention modules, reduce
- Modified K-means is used, with the input sequence used as first cluster center
Clustering Algorithm:
- $Nclust$ centers selected from MSA
- Generate a mask where p=0.15 that any position is selected by the mask
- Each center is modified for each mask selected residue according to:
- p=0.1 replaced with a uniformly sampled random amino acid
- p=0.1 replaced with an amino acid sampled from the MSA profile
- p=0.1 no replacement
- p=0.7 replaced with a special token (masked_msa_token)
- hamming distance measure for remaining selections
During training:
- unclamped & clamped - sampling start index from uniform distributions
- Cropped with fixed size
$N_res$
- target_feat
This is a feature of size [Nres, 21] consisting of the “aatype” feature. - residue_index
Positional encoding constant tensor. This is a feature of size [Nres] consisting of the “residue_index” feature. - msa_feat
This is a feature of size [Nclust, Nres, 49] constructed by concatenating “cluster_msa”, “cluster_has_deletion”, “cluster_deletion_value”, “cluster_deletion_mean”, “cluster_profile”. We draw Ncycle×Nensemble random samples from this feature to provide each recycling/ensembling iteration of the network with a different sample (see subsubsection 1.11.2). - extra_msa_feat
This is a feature of size [Nextra_seq, Nres, 25] constructed by concatenating “extra_msa”, “extra_msa_has_deletion”, “extra_msa_deletion_value”. Together with “msa_feat’ above we also draw Ncycle × Nensemble random samples from this feature (see subsubsection 1.11.2).
- template_pair_feat
This is a feature of size [Ntempl, Nres, Nres, 88] and consists of concatenation of the pair residue features “template_distogram”, “template_unit_vector”, and also several residue features, which are transformed into pair features. \ The “template_aatype” feature is included via tiling and stacking (this is done twice, in both residue directions). \ Also the mask features “template_pseudo_beta_mask” and “template_backbone_frame_mask” are included, where the feature fij = maski · maskj. \ - template_angle_feat
This is a feature of size [Ntempl, Nres, 51] constructed by concatenating the following features: “template_aatype”, “template_torsion_angles”, “template_alt_torsion_angles”, and “template_torsion_angles_mask”.
- Build dataset (on unlabeled sequences):
- Make MSA for every cluster in Uniclust30
- Remove sequences that appear in another sequences MSA
- Keep sequences of 200 < length < 1024
- Remove sequences where MSA < 200 alignments
- For predicted structures:
- train ‘undistlled’ model on just PDB dataset
- use this model to predict above set
- for every residue pair, computer confidence metric using KL-divergence between distance distribution and a reference distribution
- reference distribution
- self-distillation training took roughly 2 weeks
- AlphaFold receives input features derived from:
- the amino-acid sequence
- MSA
- templates (see subsubsection 1.2.9)
- outputs features:
- atom coordinates
- the distogram
- per-residue confidence scores.
- Recycling x3
- initial recycled inputs are zero
Algorithm 2 outlines the main steps (see also Fig 1e and the corresponding description in the main article).
- cast as a graph inference problem
- cross-optimization and information flow between MSA representation and pair-wise representation
- layer normalization
- rotation + translation transforms
$T_i := (R_i,t_i)$ - no reflection, scaling, or shear
- they construct ground truth frames using the position of three atoms from the ground truth PDB structures using a Gram–Schmidt process (Algorithm 21)
Algorithm 21 Frame construction from ground truth atom positions cite:jumperHighlyAccurateProtein2021
Algorithm 25 Make a transformation that rotates around the x-axis cite:jumperHighlyAccurateProtein2021
- predicts backbone frames
$T_i$ and torsion angles$α^f_i$ - then computes atom coordinates by applying the torsion angles to the corresponding amino acid structure with idealized bond angles and bond lengths.
- We attach a local frame to each rigid group (see Table 2), such that the torsion axis is the x-axis, and store the ideal literature atom coordinates [97] for each amino acid relative to these frames
in a table ~xlit r,f,a , where r ∈ {ALA, ARG, ASN, … } denotes the residue type, f ∈ Storsion names denotes the frame and a the atom name. We further pre-compute rigid transformations that transform atom coordinates lit from each frame to the frame that is higher up in the hierarchy. E.g. Tr,(χ maps atoms in amino-acid 2 →χ1 ) type r from the χ2 -frame to the χ1 -frame. As we are only predicting heavy atoms, the extra backbone rigid groups ω and φ do not contain atoms, but the corresponding frames contribute to the FAPE loss for alignment to the ground truth (like all other frames). cite:jumperHighlyAccurateProtein2021
- Iterative restrained energy minimization procedure:
- minimization of the AMBER99SB force field
- additional harmonic restraints (keep system near its input structure)
- restraints applied independently to heavy atoms, with a spring constant of 10 kcal/mol Å2
- minimization of the AMBER99SB force field
- Once minimizer has converged, determine which residues still contain violations
- remove restraints from all atoms within these residues and perform restrained minimization once again
- process repeated until all violations are resolved.
- Note: In the CASP14 assessment only one iteration was used
- Full energy minimization and hydrogen placement was performed using OpenMM simulation package
- tolerance of 2.39 kcal/mol and unlimited number of steps (OpenMM default values)
- weighted sum
- weighted to reduce importance of short sequences
- Side chain and backbone torsion angle loss (sec. 1.9.1)
- Frame aligned point error (FAPE) (sec. 1.9.2)
- Configurations with FAPE(X,Y) = 0 (sec. 1.9.4)
- Metric properties of FAPE (sec. 1.9.5)
- Chiral properties of AlphaFold and its loss (sec. 1.9.3)
- transforms
$T_i$ are variant under reflection (see eq. 11 to 17) - atom positions via backbone frames and
$χ$ angles
- transforms
- Model confidence prediction (pLDDT) (sec. 1.9.6)
- TM-score prediction (sec. 1.9.7)
- Distogram prediction (sec. 1.9.8)
- Masked MSA prediction (sec. 1.9.9)
- “Experimentally resolved” prediction (sec. 1.9.10)
- Structural violations (sec. 1.9.11)
- Variation of commonly used root-mean-squared deviation (RMSD) of atomic positions
- not invariant to reflections, preventing proteins of the wrong chirality. cite:rubieraAlphaFoldHereWhat, cite:jumperHighlyAccurateProtein2021
- Global superposition metric of
$C_α$ atoms (eqs. 31 - 33)- approximated (eqs. 34-36)
- Probabilistic lower-bound maximum-of-expectation score (eqs. 37-38)
- approximated TM-score using pairwise
$C_α$ based computation ($eij$ matrix) and above (see eq. 39 and adjacent text) - TM-score of any residue subset
$D$ can be computed (eq. 40)- can also be used to estimate GDT, FAPE, RMSD (using $eij$) matrix (not done)
- used for confident domain packing visualizations
- linear project pair representations to bins, sum directed edges, ie. $zij + zji$
- 64 bins (
$2 \angstrom - 22 \angstrom$ ) - prediction targets: $y^bij$ one-hot encoding (bins)
- from ground-truth
$C_β$ atoms from all residues (except glycine, use$C_α$ )
- from ground-truth
#+ATTR_LATEX :scale 0.4 :caption \caption{Cross entropy distogram loss averaged overall pairs (eq. 41)}
Algorithm 32 Embedding of evoformer and structure module outputs for recycling cite:jumperHighlyAccurateProtein2021
- Adam $\italic{lr} == 10-3, β_1 = 0.9, β_2 = 0.999, ε = 10-6$
- lr warm-up for
$0.128 \dot 10^6$ samples, increase again by$0.65 after 6.4 \dot 10^6$ samples
- lr warm-up for
- batch: 128
- gradient clipping by global norm (per parameter*sample) of 0.1
They did well
Baseline for all ablation models: Full model without noisy-student self-attention Ablations:
- With noisy-student self-distillation training
- No templates
- No raw MSA (use MSA pairwise frequencies)
- No triangles, biasing, or gating (use axial attention)
- No recycling
- No IPA (use direct projection)
- No invariant IPA & no recycling
- No end-to-end structure gradients (keep auxiliary heads)
- No auxiliary distogram head
- No auxiliary masked MSA head
todo
They did well
todo
todo
\printbibliography