## Google Colab Setup

**GPU Required:** Before running, enable GPU runtime:
1. Go to **Runtime → Change runtime type**
2. Select **T4 GPU** (or better)
3. Click **Save**

In [None]:
# Install dependencies (skip if already done)
import os

# Set environment variables
os.environ['CCD_MIRROR_PATH'] = ''
os.environ['PDB_MIRROR_PATH'] = ''

if not os.path.isfile("FOUNDRY_READY"):
    print("Installing rc-foundry...")
    
    # Uninstall torchvision first to avoid operator conflicts
    os.system("pip uninstall -y torchvision")
    
    # Install rc-foundry
    os.system("pip install -q 'rc-foundry[all]'")
    
    # Mark as ready
    os.system("touch FOUNDRY_READY")
    
    print("Done!")
else:
    print("rc-foundry already installed.")

In [None]:
# Download model weights (skips already-downloaded models automatically)
# In total, ~6GB (3GB for RFD3, 3GB for RF3, <100MB for MPNN); may take a few minutes depending on your connection speed
os.system("foundry install rfd3 ligandmpnn rf3")

# Example: End-To-End *De Novo* Protein Design Pipeline

## Overview

This notebook demonstrates an end-to-end protein design workflow using three deep learning networks from the Institute for Protein Design:

| Step | Model | Purpose |
|------|-------|---------|
| 1. **Generation** | RFD3 | Generate novel proteins via diffusion |
| 2. **Sequence Design** | MPNN | Design amino acid sequences for the generated backbone |
| 3. **Structure Validation via Refolding** | RF3 | Predict the structure from designed sequence to validate designability |

All models are unified through [AtomWorks](https://github.com/RosettaCommons/atomworks) (for both inference and training), relying on Biotite `AtomArray` objects.

### Pipeline Flow
```
RFD3 (backbone) → MPNN (sequence) → RF3 (validation) → RMSD comparison
```

---

In [None]:
import warnings
warnings.filterwarnings('ignore', module='atomworks')

# Shared utilities for visualization (from AtomWorks)
from atomworks.io.utils.visualize import view

## Section 1: All-Atom Generation with RFD3

RFdiffusion3 (RFD3) generates *de novo* all-atom proteins that meet specific conditioning requirements.

**Parameters Used** *(many more are available for more complex protein design tasks)*:
- `length`: Target protein length in residues
- `diffusion_batch_size`: Number of structures to generate per batch
- `n_batches`: Number of batches to run

**Outputs:** Dictionary of `RFD3Output` objects.

In [None]:
from lightning.fabric import seed_everything
from rfd3.engine import RFD3InferenceConfig, RFD3InferenceEngine

# Set seed for reproducibility
seed_everything(0)

# Configure RFD3 inference
config = RFD3InferenceConfig(
    specification={
        'length': 80,  # Generate 80-residue proteins
    },
    diffusion_batch_size=2,  # Generate 2 structures per batch
)

# Initialize engine and run generation
model = RFD3InferenceEngine(**config)
outputs = model.run(
    inputs=None,      # None for unconditional generation
    out_dir=None,     # None to return in memory (no file output)
    n_batches=1,      # Generate 1 batch
)

In [None]:
# View generated example IDs (one key per generated structure)
outputs.keys()

In [None]:
# Inspect RFD3 outputs and extract the generated structures
for idx, data in outputs.items():
    print(f"Batch {idx}: {len(data)} structure(s)")
    print(f"  Output type: {type(data[0]).__name__}")
    print(f"  AtomArray: {data[0].atom_array}")

# Extract the first generated structure for downstream use
first_key = next(iter(outputs.keys()))
atom_array = outputs[first_key][0].atom_array

# Visualize the generated structure
view(atom_array)

---

## Section 2: Sequence Design with MPNN

Protein and Ligand MPNN (Message Passing Neural Network) designs amino acid sequences that will fold into a target backbone structure.

**Model Options:**
- `protein_mpnn`: Original ProteinMPNN for protein-only design
- `ligand_mpnn`: Extended model supporting ligand-aware design

**Key Parameters:**
- `batch_size`: Number of sequences to generate per structure
- `remove_waters`: Whether to exclude water molecules from context

In [None]:
from mpnn.inference_engines.mpnn import MPNNInferenceEngine

# Configure MPNN inference engine
# See mpnn.utils.inference.MPNN_GLOBAL_INFERENCE_DEFAULTS for all options
engine_config = {
    "model_type": "ligand_mpnn",  # or "protein_mpnn" for vanilla ProteinMPNN
    "is_legacy_weights": True,    # Required for now for ligand_mpnn and protein_mpnn
    "out_directory": None,        # Return results in memory
    "write_structures": False,
    "write_fasta": False,
}

# Configure per-input inference options
# See mpnn.utils.inference.MPNN_PER_INPUT_INFERENCE_DEFAULTS for all options
input_configs = [
    {
        "batch_size": 10,         # Generate 10 sequences per structure
        "remove_waters": True,
    }
]

# Run sequence design on the RFD3-generated backbone
model = MPNNInferenceEngine(**engine_config)
mpnn_outputs = model.run(input_dicts=input_configs, atom_arrays=[atom_array])

In [None]:
from biotite.structure import get_residue_starts
from biotite.sequence import ProteinSequence

# Extract and display the designed sequences
print(f"Generated {len(mpnn_outputs)} designed sequences:\n")

for i, item in enumerate(mpnn_outputs):
    res_starts = get_residue_starts(item.atom_array)
    # Convert 3-letter codes to 1-letter using Biotite
    seq_1letter = ''.join(
        ProteinSequence.convert_letter_3to1(res_name)
        for res_name in item.atom_array.res_name[res_starts]
    )
    print(f"Sequence {i+1}: {seq_1letter}")

---

## Section 3: Structure Prediction with RF3

RF3 (RoseTTAFold 3) predicts protein structures from sequences. By re-folding the MPNN-designed sequence, we can validate whether the design is likely to adopt the intended backbone structure.

**Outputs:** `RF3Output` objects containing:
- `atom_array`: Predicted structure as Biotite AtomArray
- `summary_confidences`: Overall confidence metrics (pLDDT, PAE, pTM, etc.)
- `confidences`: Per-atom/residue confidence scores

**Confidence Metrics:**
| Metric | Description |
|--------|-------------|
| pLDDT | Per-residue confidence (0-1, higher is better) |
| PAE | Predicted Aligned Error (lower is better) |
| pTM | Predicted TM-score |
| ranking_score | Overall model quality score |

In [None]:
from rf3.inference_engines.rf3 import RF3InferenceEngine
from rf3.utils.inference import InferenceInput


# Initialize RF3 inference engine
inference_engine = RF3InferenceEngine(ckpt_path='rf3', verbose=False)

# Create input from the MPNN-designed structure (first design)
# This re-folds the sequence to validate it adopts the intended structure
input_structure = InferenceInput.from_atom_array(atom_array, example_id="example_protein")
rf3_outputs = inference_engine.run(inputs=input_structure)

# Outputs: dict mapping example_id -> list[RF3Output] (multiple models per input)
print(f"Output keys: {rf3_outputs.keys()}")
print(f"Number of models for 'example_protein': {len(rf3_outputs['example_protein'])}")

In [None]:
# Extract the top-ranked prediction
rf3_output = rf3_outputs["example_protein"][0]

# Inspect RF3Output structure
print(f"RF3Output contains:")
print(f"  - atom_array: {len(rf3_output.atom_array)} atoms")
print(f"  - summary_confidences: {list(rf3_output.summary_confidences.keys())}")
print(f"  - confidences: {list(rf3_output.confidences.keys()) if rf3_output.confidences else None}")

# Visualize the predicted structure
view(rf3_output.atom_array)

In [None]:
# Summary confidences: overall model quality metrics
summary = rf3_output.summary_confidences

print("=== Summary Confidences ===")
print(f"  Overall pLDDT:    {summary['overall_plddt']:.3f}")
print(f"  Overall PAE:      {summary['overall_pae']:.2f} A")
print(f"  Overall PDE:      {summary['overall_pde']:.3f}")
print(f"  pTM:              {summary['ptm']:.3f}")
print(f"  ipTM:             {summary.get('iptm', 'N/A (single chain)')}")
print(f"  Ranking score:    {summary['ranking_score']:.3f}")
print(f"  Has clash:        {summary['has_clash']}")

In [None]:
# Detailed per-atom/residue confidences
conf = rf3_output.confidences

print("=== Per-Atom/Residue Confidences ===")
print(f"  atom_plddts:      {len(conf['atom_plddts'])} values (one per atom)")
print(f"  atom_chain_ids:   {len(conf['atom_chain_ids'])} values")
print(f"  token_chain_ids:  {len(conf['token_chain_ids'])} values (one per residue)")
print(f"  token_res_ids:    {len(conf['token_res_ids'])} values")
print(f"  PAE matrix:       {len(conf['pae'])}x{len(conf['pae'][0])}")

# Preview first 10 atom pLDDT scores
import numpy as np
print(f"\nFirst 10 atom pLDDTs: {np.round(conf['atom_plddts'][:10], 2).tolist()}")

---

## Section 4: Validation and Export

The final step compares the RF3-predicted structure against the original RFD3-generated backbone. A low backbone RMSD indicates the designed sequence is likely to fold into the intended structure (high designability).

In [None]:
from biotite.structure import rmsd, superimpose
from atomworks.constants import PROTEIN_BACKBONE_ATOM_NAMES
import numpy as np

# Get structures for comparison
aa_generated = atom_array              # Original RFD3 backbone (from Section 1)
aa_refolded = rf3_output.atom_array    # RF3-predicted structure

# Filter to backbone atoms (N, CA, C, O)
bb_generated = aa_generated[np.isin(aa_generated.atom_name, PROTEIN_BACKBONE_ATOM_NAMES)]
bb_refolded = aa_refolded[np.isin(aa_refolded.atom_name, PROTEIN_BACKBONE_ATOM_NAMES)]

# Superimpose structures and calculate RMSD
bb_refolded_fitted, _ = superimpose(bb_generated, bb_refolded)
rmsd_value = rmsd(bb_generated, bb_refolded_fitted)

print(f"Backbone RMSD: {rmsd_value:.2f} A")
print(f"\nInterpretation: {'Excellent' if rmsd_value < 1.0 else 'Good' if rmsd_value < 2.0 else 'Moderate'} designability")

In [None]:
from atomworks.io.utils.io_utils import to_cif_file

# Export structures to CIF format for visualization in PyMOL/ChimeraX
to_cif_file(aa_generated, "generated.cif")
to_cif_file(aa_refolded, "refolded.cif")

print("Exported structures:")
print("  - generated.cif: Original RFD3 backbone")
print("  - refolded.cif:  RF3-predicted structure")