# 01 — PDB I/O, Visualization, and Structural Comparison (Colab)

**Goals**
- Download and read PDB files
- Inspect basic metadata (chains, residues, ligands)
- Visualize structures in Colab
- Align two structures and compute RMSD

> If you opened this notebook from GitHub: `File → Save a copy in Drive` to keep your edits.


## 0) Setup (install packages)

In [2]:
# If you're using a course repo later, you can replace this with a git-clone + requirements install.
!pip -q install biopython mdtraj py3Dmol requests

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.0/8.0 MB[0m [31m111.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
import os, sys
import numpy as np
import requests
from pathlib import Path

print("Python:", sys.version.split()[0])
ROOT = Path("/content/structbio")
DATA = ROOT / "data"
OUT  = ROOT / "outputs"
for d in [DATA, OUT]:
    d.mkdir(parents=True, exist_ok=True)
print("DATA:", DATA)
print("OUT :", OUT)

Python: 3.12.12
DATA: /content/structbio/data
OUT : /content/structbio/outputs


## 1) Download a PDB file (and cache it)

In [5]:
def fetch_pdb(pdb_id: str, out_dir: Path = DATA) -> Path:
    """Download a PDB from RCSB and save it locally. Returns the local path."""
    pdb_id = pdb_id.upper()
    out_dir.mkdir(parents=True, exist_ok=True)
    out_path = out_dir / f"{pdb_id}.pdb"
    if out_path.exists() and out_path.stat().st_size > 0:
        print(f"Using cached: {out_path}")
        return out_path

    url = f"https://files.rcsb.org/download/{pdb_id}.pdb"
    r = requests.get(url, timeout=30)
    r.raise_for_status()
    out_path.write_text(r.text)
    print(f"Downloaded: {out_path}")
    return out_path

pdb1_path = fetch_pdb("1CRN")  # small protein: crambin
pdb2_path = fetch_pdb("1EJG")  # another small protein for comparison
pdb1_path, pdb2_path

Downloaded: /content/structbio/data/1CRN.pdb
Downloaded: /content/structbio/data/1EJG.pdb


(PosixPath('/content/structbio/data/1CRN.pdb'),
 PosixPath('/content/structbio/data/1EJG.pdb'))

## 2) Read and inspect PDB content with Biopython

In [6]:
from Bio.PDB import PDBParser

parser = PDBParser(QUIET=True)
structure1 = parser.get_structure("pdb1", str(pdb1_path))
structure2 = parser.get_structure("pdb2", str(pdb2_path))

def summarize_structure(structure):
    models = list(structure.get_models())
    chains = list(structure.get_chains())
    residues = [r for r in structure.get_residues()]
    atoms = list(structure.get_atoms())

    # Basic counts
    print(f"Models:   {len(models)}")
    print(f"Chains:   {len(chains)} -> {[c.id for c in chains]}")
    print(f"Residues: {len(residues)}")
    print(f"Atoms:    {len(atoms)}")

    # Identify hetero residues (ligands, ions, waters)
    hetero = []
    waters = 0
    for r in residues:
        hetflag, resseq, icode = r.get_id()
        if str(hetflag).startswith("W"):
            waters += 1
        elif str(hetflag).strip() != "":
            hetero.append(r)
    if hetero:
        names = sorted({r.get_resname() for r in hetero})
        print(f"Hetero residues (non-water): {len(hetero)} -> {names}")
    print(f"Waters: {waters}")

print("=== PDB1:", pdb1_path.name, "===")
summarize_structure(structure1)
print("\n=== PDB2:", pdb2_path.name, "===")
summarize_structure(structure2)

=== PDB1: 1CRN.pdb ===
Models:   1
Chains:   1 -> ['A']
Residues: 46
Atoms:    327
Waters: 0

=== PDB2: 1EJG.pdb ===
Models:   1
Chains:   1 -> ['A']
Residues: 46
Atoms:    641
Waters: 0


In [11]:
ls structbio/data

1CRN.pdb  1EJG.pdb


In [12]:
cat structbio/data/1CRN.pdb

HEADER    PLANT PROTEIN                           30-APR-81   1CRN              
TITLE     WATER STRUCTURE OF A HYDROPHOBIC PROTEIN AT ATOMIC RESOLUTION.        
TITLE    2 PENTAGON RINGS OF WATER MOLECULES IN CRYSTALS OF CRAMBIN             
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: CRAMBIN;                                                   
COMPND   3 CHAIN: A;                                                            
COMPND   4 ENGINEERED: YES                                                      
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 ORGANISM_SCIENTIFIC: CRAMBE HISPANICA SUBSP. ABYSSINICA;             
SOURCE   3 ORGANISM_TAXID: 3721;                                                
SOURCE   4 STRAIN: SUBSP. ABYSSINICA                                            
KEYWDS    PLANT SEED PROTEIN, PLANT PROTEIN                                     
EXPDTA    X-RAY DIFFRACTION 

### Extract a chain sequence (roughly)

In [14]:
from Bio.PDB.Polypeptide import PPBuilder

ppb = PPBuilder()

def get_chain_sequences(structure):
    seqs = {}
    for model in structure:
        for chain in model:
            peptides = ppb.build_peptides(chain)
            if not peptides:
                continue
            # Many PDBs have multiple peptide segments; concatenate for simplicity
            seq = "".join(str(p.get_sequence()) for p in peptides)
            seqs[chain.id] = seq
        break  # first model
    return seqs

seqs1 = get_chain_sequences(structure1)
seqs2 = get_chain_sequences(structure2)

print("PDB1 sequences:")
for ch, seq in seqs1.items():
    print(ch, seq[:80] + ("..." if len(seq) > 80 else ""), f"(len={len(seq)})")
print("\nPDB2 sequences:")
for ch, seq in seqs2.items():
    print(ch, seq[:80] + ("..." if len(seq) > 80 else ""), f"(len={len(seq)})")

PDB1 sequences:
A TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN (len=46)

PDB2 sequences:
A TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN (len=46)


## 3) Visualize in Colab using `py3Dmol`

In [15]:
import py3Dmol

def show_pdb(pdb_path: Path, style="cartoon", color="spectrum", width=650, height=450):
    pdb_txt = pdb_path.read_text()
    view = py3Dmol.view(width=width, height=height)
    view.addModel(pdb_txt, "pdb")
    if style == "cartoon":
        view.setStyle({"cartoon": {"color": color}})
    elif style == "stick":
        view.setStyle({"stick": {}})
    else:
        view.setStyle({style: {}})
    view.zoomTo()
    return view

v1 = show_pdb(pdb1_path, style="cartoon")
v1.show()

In [16]:
v2 = show_pdb(pdb2_path, style="cartoon")
v2.show()

### Optional: highlight ligands/hetero atoms

In [41]:
def show_with_hetero(pdb_path: Path, width=650, height=450):
    pdb_txt = pdb_path.read_text()
    view = py3Dmol.view(width=width, height=height)
    view.addModel(pdb_txt, "pdb")
    # Protein cartoon
    view.setStyle({"protein": {}}, {"cartoon": {"color": "spectrum"}})
    # Hetero as sticks (includes ions/ligands)
    view.setStyle({"hetflag": True}, {"stick": {"radius":10.0}})
    view.setStyle({"stick": {"radius":0.3}})
    view.zoomTo()
    return view

my_view = show_with_hetero(pdb1_path)
my_view.show()

In [27]:
help(my_view)

Help on view in module py3Dmol object:

class view(builtins.object)
 |  view(query='', width=640, height=480, viewergrid=None, data=None, style=None, linked=True, options={}, format=None, js='https://cdn.jsdelivr.net/npm/3dmol@2.5.3/build/3Dmol-min.js')
 |
 |  A class for constructing embedded 3Dmol.js views in ipython notebooks.
 |  The results are completely static which means there is no need for there
 |  to be an active kernel but also that there is no communication between
 |  the javascript viewer and ipython.
 |
 |  Optionally, a viewergrid tuple (rows,columns) can be passed to create
 |  a grid of viewers in a single canvas object.  Successive commands than need to
 |  specify which viewer they apply to (with viewer=(r,c)) or will apply to all
 |  viewers in the grid.
 |
 |  The API for the created object is exactly that for $3Dmol.GLViewer, with
 |  the exception that the functions all return None.
 |  http://3dmol.org/doc/GLViewer.html
 |
 |  Methods defined here:
 |
 |  __g

## 4) Compare two structures: alignment + RMSD (with MDTraj)

We’ll do a **Cα RMSD** after superposition.

Notes:
- RMSD requires the same number/order of atoms in both structures.
- For unrelated proteins, RMSD is not meaningful. This is best for **same protein**, different conformations/conditions.
- If your structures differ (missing residues, different chains), you may need to **select a common subset**.


In [None]:
import mdtraj as md

t1 = md.load(str(pdb1_path))
t2 = md.load(str(pdb2_path))

print("t1 atoms:", t1.n_atoms, "residues:", t1.n_residues)
print("t2 atoms:", t2.n_atoms, "residues:", t2.n_residues)

# Select alpha carbons (CA) in each
sel1 = t1.topology.select("name CA and protein")
sel2 = t2.topology.select("name CA and protein")
print("CA counts:", len(sel1), len(sel2))

### If CA counts match, align and compute RMSD

In [None]:
def ca_rmsd_after_alignment(t_ref, t_mobile):
    sel_ref = t_ref.topology.select("name CA and protein")
    sel_mob = t_mobile.topology.select("name CA and protein")
    if len(sel_ref) != len(sel_mob):
        raise ValueError(
            f"CA atom counts differ ({len(sel_ref)} vs {len(sel_mob)}). "
            "Choose structures with matching residue sets or define a common selection."
        )

    # Make copies so we don't modify originals
    ref = t_ref.slice(np.arange(t_ref.n_frames))
    mob = t_mobile.slice(np.arange(t_mobile.n_frames))

    mob.superpose(ref, atom_indices=sel_mob, ref_atom_indices=sel_ref)
    rmsd_nm = md.rmsd(mob, ref, atom_indices=sel_mob, ref_atom_indices=sel_ref)
    return rmsd_nm

try:
    rmsd_nm = ca_rmsd_after_alignment(t1, t2)
    print(f"Cα RMSD after alignment: {rmsd_nm[0]*10:.3f} Å")
except Exception as e:
    print("RMSD comparison not possible with these two PDBs as-is.")
    print("Reason:", e)

## 5) Compare two conformations of the *same* protein (recommended demo)

Below we fetch two PDB entries for the **same protein family** so RMSD makes sense.

Tip: If you already have two structures of the same protein (apo vs holo, WT vs mutant), substitute their PDB IDs.


In [None]:
# Example pair: lysozyme structures (commonly used for RMSD demos)
pdbA = fetch_pdb("1AKI")  # hen egg-white lysozyme
pdbB = fetch_pdb("1LYZ")  # hen egg-white lysozyme (classic structure)

ta = md.load(str(pdbA))
tb = md.load(str(pdbB))

rmsd_nm = ca_rmsd_after_alignment(ta, tb)
print(f"{pdbA.stem} vs {pdbB.stem}: Cα RMSD = {rmsd_nm[0]*10:.3f} Å")

# Visualize the two structures (separately)
show_pdb(pdbA).show()
show_pdb(pdbB).show()

## 6) Save outputs (figures, tables)

In [None]:
import pandas as pd

# Example: write a small summary table
rows = [
    {"pdb": "1AKI", "n_atoms": ta.n_atoms, "n_residues": ta.n_residues},
    {"pdb": "1LYZ", "n_atoms": tb.n_atoms, "n_residues": tb.n_residues},
    {"pair": "1AKI vs 1LYZ", "CA_RMSD_Ang": float(rmsd_nm[0]*10)},
]
df = pd.DataFrame(rows)
out_csv = OUT / "pdb_summary.csv"
df.to_csv(out_csv, index=False)
print("Wrote:", out_csv)
df

## 7) Exercises (turn in your notebook)

1. Pick a PDB of interest (protein or complex) and summarize:
   - number of chains
   - number of residues
   - any hetero residues (ligands/ions)
2. Make a visualization that clearly shows:
   - secondary structure (cartoon)
   - and ligand/hetero atoms (sticks), if present
3. Find **two structures of the same protein** (e.g., apo vs holo, WT vs mutant) and compute:
   - Cα RMSD after alignment
   - (optional) RMSD using all heavy atoms of the protein

**Submission:** upload your `.ipynb` (and any saved outputs if requested).
