# AlphaFold 3 Walkthrough

## Introduction
AlphaFold 3 (AF3) is a multimodal structure prediction model that extends AlphaFold 2 to handle not only proteins, but also ligands, DNA/RNA, antibodies, ions, and more. It predicts all-atom 3D structures from diverse input data, including sequences, structural templates, and small molecules.

This walkthrough is intended as a high-level guide for students, developers, and researchers seeking to understand AF3’s architectural flow and practical usage. It focuses on the structure and logic of the model, rather than low-level implementation details, and is well-suited for tutorials, onboarding, or early-stage exploration.

![panel_a](https://github.com/richcmwang/AlphaFold-3-Tutorial-Series/blob/main/images/architecture.png?raw=1)

<span style="font-size:90%;">
<b>Figure 1. End-to-end architecture of AlphaFold 3.</b>
This diagram summarizes the full prediction pipeline, from input processing and embedding to structure refinement and confidence estimation. Input features include sequences, optional MSAs, template structures, and ligand conformers. Representations are passed through a series of embedding modules and the 48-block Pairformer. Recycling and a training-time diffusion module enable progressive refinement. Final atomic coordinates and confidence scores are output after structure prediction.
Adapted from Figure 1d of the AlphaFold 3 paper (Evans et al., 2024).
</span>

## Setup

Please run the cell below to download the tutorial files and set the correct working directory. This ensures that all images and utility files are accessible during the walkthrough.

In [None]:
# Clone the tutorial repository
!git clone https://github.com/richcmwang/AlphaFold-3-Tutorial-Series.git

# Change working directory to the 'tutorials' folder
import os
os.chdir('AlphaFold-3-Tutorial-Series/tutorials')

## 1. Input and Featurization Construction

AlphaFold 3 begins structure prediction by converting molecular inputs into internal features that encode shape, chemistry, and spatial relationships. These are constructed from data such as protein sequences, ligand CCD codes or SMILES, and optional structural templates like .cif files.

If templates are provided, for example, a protein backbone or a CCD-defined small molecule, they are used to compute geometric features including pairwise distances, bond angles, and torsions. These are expressed relative to local coordinate frames, ensuring that featurization depends only on internal geometry, not on absolute 3D placement.

Unlike AlphaFold 2, AlphaFold 3 treats multiple sequence alignments (MSAs) as optional input and is trained to make accurate predictions even from a single protein sequence.

The model also computes atom-level features such as atom types, valency, and bond connectivity. These features form the input representation used by AlphaFold’s neural network architecture.

## 2. Model Representations

Internally, AlphaFold 3 represents molecules using two interlinked feature types: single and pairwise representations. The single representation is a vector for each token, representing either a residue or an atom, and capturing its local chemical identity and context. These are stored in a tensor of shape n × c, where n is the number of tokens and c is the number of channels.

The pairwise representation captures the spatial relationship between every pair of tokens (residues or atoms) using a tensor of shape n × n × c. For each token pair, the model encodes not just the distance between them, but also how they are oriented with respect to one another and any torsion-like angles that describe their relative rotation. All of these geometric features are computed in local coordinate frames, which ensures that the model's reasoning is invariant to how the system is positioned or rotated in space

These two representations evolve throughout the network via the Pairformer module, which alternates between updating single-token features based on pairwise context and refining pairwise features using updated token states. This joint refinement enables the model to reason over both local identity and global structure in a consistent geometric framework.

## 3. Model Architecture

### 3.1 Overview of AlphaFold3 Architecture

AlphaFold 3 uses a transformer-based architecture designed to reason jointly over proteins, RNAs, ligands, and their complexes. Rather than building in strict SE(3)-equivariance as seen in some geometric deep learning models, AlphaFold 3 relies on the strength of large-scale, diverse training data to learn spatial reasoning directly. This reflects a broader trend in modern architectures: replacing inductive biases with general-purpose models when data is sufficient.

Although AlphaFold 3 is not formally SE(3)-equivariant, it achieves rotation and translation invariance by basing its reasoning entirely on local, relative geometry. Because the model operates on internal features like distances, directions, and angles defined in local frames, its predictions are unaffected by how the input is globally oriented or positioned.

AlphaFold 3 encodes modality-specific behavior through learned input embeddings. The same model architecture is used across all supported modalities, including proteins, ligands, RNAs, ions, and small molecules. Rather than relying on separate model pipelines, this unified design enables the model to flexibly handle a wide range of molecular systems—such as protein–ligand complexes or protein–RNA assemblies—without retraining or architectural changes.

Unlike AlphaFold 2, which used local residue frames to transform torsion angle predictions into global 3D atom positions, AlphaFold 3 predicts global atomic coordinates directly. This architectural shift reflects both design and training changes, including how spatial features are integrated through attention mechanisms and how structure prediction is supervised. The model learns to place atoms directly into a shared 3D frame, without relying on intermediate local-to-global transformations.

Additionally, AlphaFold 3’s attention mechanisms naturally integrate information across different molecule types. Inter-molecular attention allows the model to learn from interactions between protein residues, ligand atoms, and other molecular components without relying on predefined binding pockets or rigid docking constraints. Ligand atoms, including their flexible torsions, are treated as first-class tokens and participate in the same representation updates as proteins and RNAs. This design enables AlphaFold 3 to handle complex systems like protein–ligand or protein–RNA assemblies with a unified model.

Key architectural components include the Pairformer module, which refines single-token and pairwise representations through geometry-aware attention, and the diffusion module, which iteratively denoises predicted structures for accurate 3D atomic coordinates. These components are described in detail in the following sections.

When provided, MSA features are processed by a dedicated module at the start of the model to extract co-evolutionary patterns, similar to AlphaFold 2, but are no longer essential for accurate prediction, as AlphaFold 3 is trained to operate robustly even in their absence.



### 3.2 Pairformer Module

![panel_a](https://github.com/richcmwang/AlphaFold-3-Tutorial-Series/blob/main/images/pairformer.png?raw=1)
<span style="font-size:90%;">
<b>Figure 2. Pairformer Module for Updating Pairwise and Single-Token Representations.</b> Pairwise geometric context and token-level embeddings are integrated through 48 blocks of triangle update, triangle self-attention, transition layers, and single attention with pair bias. Adapted from the AlphaFold 3 paper.
</span>

The Pairformer module is the core of AlphaFold 3’s architecture, adapted from the Evoformer in AlphaFold 2. It consists of 48 sequential layers that iteratively refine the single-token and pairwise representations.

The triangle update and self-attention blocks of the module form triangles by connecting any pair (i, j) with a third node k that links to both i and j and aggregates information from all such k nodes to refine the pairwise representation P[i, j].

The following diagrams (Jumper et al,. 2021) illustrate how these blocks aggregate information from neighboring triangles. Outgoing and incoming edges capture directional relationships, with outgoing edges representing (i to k) and (j to k) and incoming edges representing (k to i) and (k to j), while starting and ending node self-attention mechanisms integrate geometric context from different parts of the molecular graph. Together, these modules help the model understand complex 3D arrangements, which is critical for accurate structure prediction.

<div align="center">
<img src="https://github.com/richcmwang/AlphaFold-3-Tutorial-Series/blob/main/images/triangle.png?raw=1" width="700">
<div align="center">



Transition layers include lightweight feed-forward networks that refine the representations, stabilize learning, and enable deeper architectures.

Single attention with pair bias module updates the single-token embeddings by incorporating geometric context from the pairwise features. In AlphaFold 3, the pair representation biases the attention logits of the single attention module. This design allows the model to integrate pairwise geometric information into single-token embeddings while maintaining architectural simplicity.

Together, these operations enable the Pairformer to build a geometry-aware representation of the molecular system that is passed downstream to the diffusion module for iterative denoising and final structure prediction.

### 3.3 Diffusion Module

#### Overview

AlphaFold 3 uses a diffusion module to perturb input geometries during training, teaching the model to recover accurate atomic structures through iterative denoising. This conditional generative training procedure allows the model to learn protein structure at multiple length scales: at low noise levels, it captures local stereochemistry, while at high noise levels, it captures the overall global arrangement of the system. At inference time, AlphaFold 3 samples random noise conditioned on the input features (such as the protein sequence and ligand graph) and denoises it recurrently to generate a final structure. This approach enables the model to produce a distribution of answers, each with sharply defined local geometry, even in regions of global uncertainty. By using this conditional generative process, AlphaFold 3 avoids the need for torsion-based parametrizations and violation losses while effectively handling complex multimolecular systems.

#### Input Preparation

<div align="center">
<img src="https://github.com/richcmwang/AlphaFold-3-Tutorial-Series/blob/main/images/attention.png?raw=1" width="500">
</div>
<span style="font-size:90%;">
<b>Figure 3. Input Processing for the Diffusion Module.</b> This figure summarizes input preparation steps, including per token (single-token and pairwise) and per-atom representations, as well as random rotations and translations before feeding them into the diffusion module. The per-token conditioning module receives processed outputs from earlier modules (Pairformer, template, and MSA), ensuring that the diffusion module (shown as the bottom attention blocks) receives refined features for iterative denoising and structure prediction.
</span>


The input features are prepared with 3 conditional modules before entering the diffusion module. The per-token conditioning module integrates raw input features (green bars, e.g. sequence, atom type) with geometric context from pairwise features (blue squares) and outputs refined single-token embeddings (red bars). These pairwise and single representations have already been processed by the Pairformer stack and, optionally, the template and MSA modules.

The per-atom conditioning module uses atom-level features, including chemical identity, bonding, and local geometric context to create enriched atom embeddings robust to rotation and translation. Rather than relying solely on raw 3D coordinates, the per-atom representation module integrates the refined single-token embeddings (from the per-token representation) with relative geometric features such as distances, angles, and local frames to build a robust, SE(3)-aware atom embedding. This design ensures that each atom representation captures both local chemical context and spatial relationships within the molecule.

Additionally, random rotation and translation are applied to the input structures before diffusion noise is added, ensuring the model learns to denoise consistently regardless of orientation.

The three attention blocks at the bottom of the diagram collectively form the diffusion module, where iterative denoising refines atomic coordinates. Readers can refer back to the main AlphaFold 3 architecture diagram (Figure 1) for the complete view of how these inputs connect to the diffusion module.

#### Training and Inference

<div align="center">
<img src="https://github.com/richcmwang/AlphaFold-3-Tutorial-Series/blob/main/images/training.png?raw=1" width="800">
</div>

<span style="font-size:90%;">
<b>Figure 4. Diffusion module input and output in AlphaFold 3.</b> This figure shows how the model’s activations, the outputs of the Pairformer (network trunk), feed into the diffusion module. Single-token embeddings (red), pairwise features (blue), and original input features (green) are processed by the diffusion module, where iterative denoising refines atomic coordinates. A STOP sign indicates where gradient flow is halted during training. Adapted from Figure 2 in the AlphaFold 3 paper (Evans et al., 2024).
</span>

AlphaFold 3 trains its diffusion module using both a training diffusion block and an inference diffusion block. During training, the model perturbs input structures with noise and teaches the diffusion module to denoise them by comparing the denoised outputs to the ground truth structure.

The inference diffusion block performs a mini-rollout of iterative denoising steps (e.g. 20 iterations) during training to supervise the model’s intermediate predictions and guide learning from early stages of the denoising process. Although a STOP sign in the architecture prevents gradients from flowing back through the inference diffusion block itself, the weights of the inference and training diffusion modules are shared. As a result, the loss computed from the inference rollout still contributes to updating the model’s parameters.

The “Permute ground truth” step acts as a data augmentation technique to enforce permutation invariance during training.

The combination of the training and inference diffusion blocks, the shared weights, and the STOP gradients ensures that the model learns to produce accurate structures consistently throughout the denoising process.

## 4. Ligand Modeling

In AlphaFold 3, ligands are represented as molecular graphs with atom- and bond-level features, derived from the Chemical Component Dictionary (CCD) and processed using RDKit. These include atomic identity, bond order, and torsion flexibility. RDKit provides an initial 3D conformer as input, but AlphaFold 3 predicts the ligand’s final pose in the context of the binding site, including both its spatial placement and the torsion angles of flexible bonds. This allows the model to refine the ligand’s internal conformation to match the local environment. Rather than relying on predefined rules or scoring functions, AlphaFold 3 learns spatial interaction patterns such as the avoidance of atomic clashes (steric fit), hydrogen bonding, and shape complementarity directly from training data. AlphaFold 3 predicts torsion angles for each rotatable bond in the ligand, enabling the reconstruction of a chemically plausible 3D conformation.

## 5. Pose Prediction

AlphaFold 3 outputs all-atom 3D coordinates for every molecule in the input system, including proteins, ligands, nucleic acids, ions, and water molecules. These coordinates form a unified structural prediction, suitable for downstream analysis or visualization.

For ligands, the model predicts detailed binding poses, resolving not just rigid-body placement but also flexible torsion angles around rotatable bonds. This enables AlphaFold 3 to capture the internal conformation of small molecules as they adapt to their binding environment.

For proteins, the model resolves both backbone geometry and side-chain conformations at atomic resolution, recovering detailed structural features such as rotamer states and hydrogen bonding patterns.

All coordinates are provided in a consistent global reference frame that is arbitrary with respect to the input orientation. Because AlphaFold 3 uses only internal geometry for featurization, the predicted structure can be translated or rotated without loss of fidelity. The global coordinate system used in AlphaFold 3’s output does not carry biological meaning on its own. However, it is applied consistently across all molecules in the system, so relative positions and interactions such as ligand binding, side-chain packing, or ion coordination can still be interpreted accurately.

Because AlphaFold 3 relies on relative geometry, it is invariant to input orientation. Unlike models such as DiffDock that require globally aligned inputs, AlphaFold 3 accepts unaligned structures and produces consistent results. While it reasons over internal coordinates, its outputs are absolute: all-atom 3D coordinates supervised on ground-truth structures.


## 6. Recycling
AlphaFold 3 performs iterative recycling, a process in which the model refines its predictions by feeding the predicted structure from one iteration back into the network. The sequence, ligand information, and any template or MSA features remain fixed, but the updated coordinates from the predicted structure are used to recompute the single-token and pairwise representations. This allows the system to progressively improve both geometric accuracy and representational quality, especially in complex systems involving flexible ligands, multi-molecular interfaces, or disordered regions.

## 7. Confidence Estimation
The model outputs several forms of structural confidence and quality metrics. One key output is the Predicted Aligned Error (PAE), which estimates the expected positional error between pairs of residues or atoms after optimal alignment. This matrix is particularly useful for evaluating the reliability of domain orientations and inter-molecular contacts.

Another important metric is the Predicted Local Distance Difference Test (pLDDT), which provides a per-atom or per-residue score indicating how confidently the model believes it has positioned each part of the structure. Higher pLDDT scores (typically ranging from 0 to 100) correspond to greater local reliability, with values above 70 often indicating regions that are well-resolved and trustworthy. This metric is particularly useful for assessing the quality of flexible regions within a protein or complex such as loops and side chains as well as binding pockets and small molecules that can adopt multiple conformations depending on their environment. For these challenging regions, pLDDT highlights areas where the model’s predictions might be less certain, helping researchers identify parts of the structure that may require additional validation or experimental support.

Together, these confidence metrics help users evaluate the reliability of the prediction, focus on high-confidence regions, and guide follow-up modeling or experimental efforts.