#  Protein Binding Pocket Finder

This notebook implements a geometry-based algorithm for predicting **ligand binding pockets** in protein structures.

## Principle
Instead of computing expensive energy functions, the protein surface is sampled geometrically:
1. A 3D grid of points is placed over the entire protein
2. Points that are too close to the protein (collision) or too far away (empty space) are removed
3. The remaining points — located inside cavities and pockets — are grouped into clusters
4. Each cluster corresponds to one potential binding pocket

## Input
A `.pdb` file (here: `1H8D.pdb` — a kinase from the RCSB Protein Data Bank)

---

## 0. Imports

All required libraries are imported here in one central place:
- **`numpy`** – fast array operations and vector math
- **`Bio.PDB`** – Biopython module for reading and analyzing PDB files
- **`scipy.spatial.KDTree`** – efficient spatial lookups (much faster than naive distance calculations)
- **`sklearn.cluster.DBSCAN`** – density-based clustering algorithm that requires no fixed cluster count
- **`time`** – runtime measurement

In [11]:
import numpy as np
import time

from Bio import PDB
from scipy.spatial import KDTree
from sklearn.cluster import DBSCAN

---
## 1. Reading the Protein Structure

A PDB file contains the amino acids of the protein as well as water molecules, ions, and ligands.
We only want the **protein atoms** — everything else (HETATM records) is filtered out.

### Why filter?
- **Water** would distort the grid, as water molecules have no fixed position and are often crystallographic artifacts
- **Ligands / co-factors** can obscure the very binding pocket we are trying to find

In PDB files, the first element of the residue ID (`residue.id[0]`) encodes the record type:
- `' '` (space) → standard amino acid 
- `'H_...'` → hetero-atom (ligand, ion, etc.) 
- `'W'` → water molecule 

In [12]:
def get_protein_structure(pdb_file: str):
    """
    Reads a PDB file and returns the structure object along with a list
    of all protein atoms (excluding water and ligands).

    Parameters:
        pdb_file: Path to the .pdb file

    Returns:
        structure:     Bio.PDB Structure object (full hierarchy)
        protein_atoms: List of all Atom objects belonging to amino acids
    """
    # QUIET=True suppresses warnings for minor format inconsistencies
    # PERMISSIVE=True allows non-standard PDB entries
    parser = PDB.PDBParser(QUIET=True, PERMISSIVE=True)

    # Load the structure — 'protein_obj' is an internal label for the object
    structure = parser.get_structure('protein_obj', pdb_file)

    protein_atoms = []

    # Hierarchical iteration: Model → Chain → Residue → Atom
    for model in structure:
        for chain in model:
            for residue in chain:
                # Keep only standard amino acids (no water, no ligands)
                if residue.id[0] == ' ':
                    for atom in residue:
                        protein_atoms.append(atom)

    print(f"Loaded: {len(protein_atoms)} protein atoms found.")
    return structure, protein_atoms


# --- Execution ---
FILE_PATH = "1H8D.pdb"

try:
    structure, atoms = get_protein_structure(FILE_PATH)

    # Spot check: print the first 5 atoms to verify correct parsing
    print("\nCoordinate check (first 5 atoms):")
    for atom in atoms[:5]:
        residue = atom.get_parent()
        print(f"  Residue: {residue.get_resname():>3} | Atom: {atom.get_name():<4} | XYZ: {atom.get_coord()}")

except FileNotFoundError:
    print(f"Error: File '{FILE_PATH}' not found.")

Loaded: 2333 protein atoms found.

Coordinate check (first 5 atoms):
  Residue: ILE | Atom: N    | XYZ: [ 5.169 -8.919 17.688]
  Residue: ILE | Atom: CA   | XYZ: [ 4.5   -8.595 18.976]
  Residue: ILE | Atom: C    | XYZ: [ 3.397 -9.622 19.265]
  Residue: ILE | Atom: O    | XYZ: [ 2.507 -9.808 18.423]
  Residue: ILE | Atom: CB   | XYZ: [ 3.896 -7.17  18.971]


---
## 2. Creating the Search Grid

We place a regular 3D grid over the entire protein — like a chessboard pattern in space, but in three dimensions.

### Parameters:
| Parameter | Meaning | Recommendation |
|-----------|---------|----------------|
| `spacing` | Distance between grid points in Ångström | 2.0 Å (testing), 1.0 Å (final) |
| Buffer (`±5 Å`) | Extra space around the protein | fixed at 5 Å |

### Why `numpy.meshgrid`?
`meshgrid` generates all combinations of x, y, z coordinates at once — much more efficient than three nested `for` loops.

> **Warning:** With `spacing=1.0` the point count explodes (~8× more points than at 2.0 Å). For large proteins this can mean several million points.

In [13]:
def create_search_grid(protein_atoms: list, spacing: float = 2.0) -> np.ndarray:
    """
    Creates a uniform 3D grid around the protein.

    Parameters:
        protein_atoms: List of Biopython Atom objects
        spacing:       Distance between grid points in Ångström

    Returns:
        grid: numpy array of shape (N, 3) containing N grid points
    """
    BUFFER = 5.0  # Å — padding region extending beyond the protein

    # Extract all atom coordinates into a single (N, 3) matrix
    coords = np.array([atom.get_coord() for atom in protein_atoms])

    # Bounding box: smallest and largest coordinates along x, y, z
    min_coords = coords.min(axis=0) - BUFFER
    max_coords = coords.max(axis=0) + BUFFER

    # Generate axis arrays
    x = np.arange(min_coords[0], max_coords[0], spacing)
    y = np.arange(min_coords[1], max_coords[1], spacing)
    z = np.arange(min_coords[2], max_coords[2], spacing)

    # Generate all (x, y, z) combinations and reshape into an (N, 3) matrix
    grid = np.array(np.meshgrid(x, y, z)).T.reshape(-1, 3)

    print(f" Grid created: {grid.shape[0]:,} points at {spacing} Å spacing.")
    return grid


# --- Execution ---
if 'atoms' in locals():
    grid = create_search_grid(atoms, spacing=2.0)

 Grid created: 26,796 points at 2.0 Å spacing.


---
## 3. Filtering Pocket Candidates

Not every grid point is meaningful. We apply two geometric filters:

### Filter 1 — Collision check (`min_dist`)
Points **inside** the protein clash with the Van der Waals radii of the atoms.
→ All points with any atom closer than `min_dist` (2.5 Å) are **removed**.

### Filter 2 — Surface proximity (`max_dist`)
Points far **outside** the protein lie in the bulk solvent — no pocket there.
→ Only points with at least one atom within `max_dist` (5.0 Å) are **kept**.

### Why `KDTree`?
A KDTree is a data structure built for spatial lookups. Instead of checking all ~2,333 atoms for each of the ~27,000 grid points (~63 million comparisons), the KDTree reduces the cost to **O(N log N)**.

In [14]:
def find_pocket_points(
    grid_points: np.ndarray,
    protein_atoms: list,
    min_dist: float = 2.5,
    max_dist: float = 5.0
) -> np.ndarray:
    """
    Filters grid points down to potential binding pocket candidates.

    A point qualifies as a pocket candidate if it:
        - does NOT lie within min_dist of any atom (no clash)
        - has at least one atom within max_dist (close to the surface)

    Parameters:
        grid_points:   numpy array (N, 3) of grid points
        protein_atoms: list of Biopython Atom objects
        min_dist:      minimum allowed distance to the protein in Å (Van der Waals buffer)
        max_dist:      maximum allowed distance to the surface in Å

    Returns:
        pocket_points: numpy array (M, 3) of remaining candidate points
    """
    # Extract coordinates from Biopython objects
    atom_coords = np.array([atom.get_coord() for atom in protein_atoms])

    # Build the KDTree once — reused for both filters
    tree = KDTree(atom_coords)

    # --- Filter 1: Collision check ---
    # query_ball_point returns for each grid point all atom indices
    # within min_dist. If the list is empty → no clash → point is kept.
    clash_neighbors = tree.query_ball_point(grid_points, min_dist)
    no_clash_mask = np.array([len(neighbors) == 0 for neighbors in clash_neighbors])
    candidates = grid_points[no_clash_mask]

    # --- Filter 2: Surface proximity ---
    # From the candidate set: keep only points that have at least one atom
    # within max_dist (i.e. close to the protein surface)
    surface_neighbors = tree.query_ball_point(candidates, max_dist)
    near_surface_mask = np.array([len(neighbors) > 0 for neighbors in surface_neighbors])
    pocket_points = candidates[near_surface_mask]

    print(f"Filtered: {len(grid_points):,} → {len(pocket_points):,} pocket candidates "
          f"({len(pocket_points)/len(grid_points)*100:.1f}% remaining).")
    return pocket_points


# --- Execution with runtime measurement ---
if 'atoms' in locals() and 'grid' in locals():
    start = time.time()
    pocket_candidates = find_pocket_points(grid, atoms)
    elapsed = time.time() - start

Filtered: 26,796 → 4,087 pocket candidates (15.3% remaining).


---
## 4. Cluster Analysis: Identifying Individual Pockets

The ~4,000 pocket points form connected regions in space. We use **DBSCAN** to group them into discrete pockets.

### Why DBSCAN instead of k-Means?
- DBSCAN requires **no prior specification** of the number of clusters
- DBSCAN detects clusters of arbitrary shape (pockets are irregular)
- DBSCAN marks isolated outlier points as **noise** (label `-1`)

### Parameters:
| Parameter | Meaning | Typical value |
|-----------|---------|---------------|
| `eps` | Maximum distance between two points in the same cluster | 2.5 Å |
| `min_samples` | Minimum points required to form a cluster | 5 |

In [15]:
def cluster_pocket_points(
    pocket_points: np.ndarray,
    eps: float = 2.5,
    min_samples: int = 5
) -> dict:
    """
    Groups pocket points into discrete binding pockets using DBSCAN.

    Parameters:
        pocket_points: numpy array (N, 3) of filtered candidate points
        eps:           maximum distance between two points in the same cluster (Å)
        min_samples:   minimum number of points required to form a valid cluster

    Returns:
        pockets: dictionary {cluster_id: np.ndarray of points}
                 sorted by cluster size (largest pocket first)
    """
    if len(pocket_points) == 0:
        print(" No points available for clustering.")
        return {}

    # Run DBSCAN clustering
    db = DBSCAN(eps=eps, min_samples=min_samples).fit(pocket_points)
    labels = db.labels_  # Array of length N with cluster IDs (-1 = noise)

    # Group points by cluster ID (exclude noise label -1)
    unique_labels = set(labels) - {-1}  # Discard noise points
    pockets = {
        label: pocket_points[labels == label]
        for label in unique_labels
    }

    # Sort by cluster size (largest pocket = most likely binding site)
    pockets = dict(sorted(pockets.items(), key=lambda item: len(item[1]), reverse=True))

    # Print summary
    noise_count = np.sum(labels == -1)
    print(f"{len(pockets)} pockets found ({noise_count} noise points discarded):")
    for rank, (p_id, p_points) in enumerate(pockets.items()):
        center = np.mean(p_points, axis=0)
        print(f"   Pocket #{rank+1}: {len(p_points):>4} points | Center: "
              f"({center[0]:.1f}, {center[1]:.1f}, {center[2]:.1f})")

    return pockets


# --- Execution ---
if 'pocket_candidates' in locals():
    pockets_dict = cluster_pocket_points(pocket_candidates)

22 pockets found (126 noise points discarded):
   Pocket #1: 3925 points | Center: (13.0, -1.0, 18.9)
   Pocket #2:    5 points | Center: (2.2, 3.9, 12.6)
   Pocket #3:    4 points | Center: (0.3, 11.8, 29.1)
   Pocket #4:    3 points | Center: (35.1, 12.3, 21.3)
   Pocket #5:    3 points | Center: (12.5, -9.7, 23.3)
   Pocket #6:    2 points | Center: (-2.2, 13.3, 4.6)
   Pocket #7:    2 points | Center: (28.8, 14.3, 4.6)
   Pocket #8:    2 points | Center: (29.8, 14.3, 7.6)
   Pocket #9:    2 points | Center: (34.8, -19.7, 22.6)
   Pocket #10:    1 points | Center: (1.8, 10.3, -7.4)
   Pocket #11:    1 points | Center: (3.8, -13.7, -1.4)
   Pocket #12:    1 points | Center: (27.8, -7.7, 0.6)
   Pocket #13:    1 points | Center: (29.8, 12.3, 4.6)
   Pocket #14:    1 points | Center: (-8.2, 12.3, 12.6)
   Pocket #15:    1 points | Center: (-4.2, 10.3, 12.6)
   Pocket #16:    1 points | Center: (-12.2, 14.3, 18.6)
   Pocket #17:    1 points | Center: (33.8, 12.3, 22.6)
   Pocket #18:   

---
## 5. Export for Visualization in UCSF Chimera / ChimeraX

Each pocket is saved as a separate `.pdb` file, with every point encoded as a dummy `HETATM` atom.
All files can then be loaded simultaneously in Chimera and color-coded individually.

**Workflow in Chimera:**
1. Load the protein: `File → Open → 1H8D.pdb`
2. Load pockets: `File → Open → step3_pocket_0.pdb`, etc.
3. Set representation: `Actions → Atoms/Bonds → sphere`
4. Set colors: `Actions → Color → ...`

In [16]:
def save_points_to_pdb(points: np.ndarray, output_file: str) -> None:
    """
    Saves a numpy array of 3D coordinates as a PDB file.
    Each point is written as a HETATM entry (dummy atom 'N').

    Parameters:
        points:      numpy array (N, 3) with XYZ coordinates
        output_file: target file path
    """
    with open(output_file, 'w') as f:
        for i, (x, y, z) in enumerate(points):
            # PDB HETATM format (column-accurate per PDB specification):
            # Cols 1-6: Record type | 7-11: Atom serial | 13-16: Atom name
            # 18-20: Residue name | 22: Chain | 23-26: Residue seq | 31-54: X, Y, Z
            f.write(f"HETATM{i+1:5d}  P   PTS A   1    {x:8.3f}{y:8.3f}{z:8.3f}  1.00  0.00           N\n")
        f.write("END\n")
    print(f" Saved: {output_file} ({len(points)} points)")


def export_all_steps(atoms: list, output_dir: str = ".") -> None:
    """
    Runs the full pipeline and exports each stage as a PDB file.

    Output files:
        step1_full_grid.pdb          — the complete grid around the protein
        step2_pocket_candidates.pdb  — filtered pocket candidate points
        step3_pocket_N.pdb           — individual clustered pockets (sorted by size)
    """
    import os
    os.makedirs(output_dir, exist_ok=True)

    # Step 1: Full grid
    print(" Step 1: Generating grid ...")
    full_grid = create_search_grid(atoms, spacing=2.0)
    save_points_to_pdb(full_grid, f"{output_dir}/step1_full_grid.pdb")

    # Step 2: Filtered candidates
    print("\nStep 2: Filtering candidates ...")
    candidates = find_pocket_points(full_grid, atoms)
    save_points_to_pdb(candidates, f"{output_dir}/step2_pocket_candidates.pdb")

    # Step 3: Clustering
    print("\n Step 3: Clustering pockets ...")
    pockets = cluster_pocket_points(candidates)

    print("\n Exporting pockets ...")
    for rank, (p_id, p_points) in enumerate(pockets.items()):
        save_points_to_pdb(p_points, f"{output_dir}/step3_pocket_{rank}.pdb")

    print(f"\nExport complete. {len(pockets)} pocket files saved.")


# --- Execution ---
if 'atoms' in locals():
    export_all_steps(atoms, output_dir="pocket_output")

 Step 1: Generating grid ...
 Grid created: 26,796 points at 2.0 Å spacing.
 Saved: pocket_output/step1_full_grid.pdb (26796 points)

Step 2: Filtering candidates ...
Filtered: 26,796 → 4,087 pocket candidates (15.3% remaining).
 Saved: pocket_output/step2_pocket_candidates.pdb (4087 points)

 Step 3: Clustering pockets ...
22 pockets found (126 noise points discarded):
   Pocket #1: 3925 points | Center: (13.0, -1.0, 18.9)
   Pocket #2:    5 points | Center: (2.2, 3.9, 12.6)
   Pocket #3:    4 points | Center: (0.3, 11.8, 29.1)
   Pocket #4:    3 points | Center: (35.1, 12.3, 21.3)
   Pocket #5:    3 points | Center: (12.5, -9.7, 23.3)
   Pocket #6:    2 points | Center: (-2.2, 13.3, 4.6)
   Pocket #7:    2 points | Center: (28.8, 14.3, 4.6)
   Pocket #8:    2 points | Center: (29.8, 14.3, 7.6)
   Pocket #9:    2 points | Center: (34.8, -19.7, 22.6)
   Pocket #10:    1 points | Center: (1.8, 10.3, -7.4)
   Pocket #11:    1 points | Center: (3.8, -13.7, -1.4)
   Pocket #12:    1 points

---
## 6. Residue Extraction — Who Lines the Pocket?

Once we have the pocket points, the next biological question is: **which amino acids form the walls of each pocket?**
These residues are the ones a ligand would actually make contact with.

### How it works
We use the same KDTree approach as before, but now in reverse:
instead of finding atoms close to the protein, we find **protein atoms close to each pocket point**.
Any atom within `threshold` (4.5 Å) of any pocket point belongs to a **lining residue**.

### Why 4.5 Å?
This distance comfortably captures:
- Direct Van der Waals contacts (~3.5 Å)
- Hydrogen bond donor/acceptor pairs (~3.5–4.0 Å)
- Weak electrostatic interactions (~4.5 Å)

Going much larger would pull in residues that are structurally far from the pocket.

In [17]:
def extract_and_save_residues(
    pockets_dict: dict,
    protein_atoms: list,
    output_file: str = "pocket_residues.txt",
    threshold: float = 4.5
) -> None:
    """
    Identifies all protein residues lining each pocket and saves them to a text file.

    A residue is considered a pocket-lining residue if any of its atoms
    lies within `threshold` Å of any pocket point.

    Parameters:
        pockets_dict:  dictionary {cluster_id: np.ndarray of pocket points}
        protein_atoms: list of Biopython Atom objects
        output_file:   path to save the residue report
        threshold:     maximum distance from pocket point to atom in Å
    """
    # Build a KDTree over all protein atom coordinates for fast lookup
    atom_coords = np.array([atom.get_coord() for atom in protein_atoms])
    tree = KDTree(atom_coords)

    with open(output_file, 'w') as f:
        f.write("LIGAND BINDING SITE PREDICTIONS — RESIDUE LIST\n")
        f.write("=" * 50 + "\n\n")

        for p_id, points in pockets_dict.items():
            # For every pocket point, find all atom indices within threshold
            neighbor_indices = tree.query_ball_point(points, threshold)

            # Flatten into a set of unique atom indices (avoid counting the same atom twice)
            flat_indices = set(idx for sublist in neighbor_indices for idx in sublist)

            # Map atom indices → residue names and numbers
            found_residues = set()
            for idx in flat_indices:
                res = protein_atoms[idx].get_parent()
                # Format: NAME-NUMBER, e.g. HIS-195
                res_info = f"{res.get_resname()}-{res.id[1]}"
                found_residues.add(res_info)

            # Sort alphabetically for readable output
            sorted_res = sorted(found_residues)

            f.write(f"Pocket {p_id} ({len(points)} points):\n")
            f.write(f"Lining Residues: {', '.join(sorted_res)}\n")
            f.write("-" * 40 + "\n")

            print(f"   Pocket {p_id}: {len(sorted_res)} lining residues identified.")

    print(f"\nResidue report saved to: {output_file}")


# --- Execution ---
# Re-run the pipeline to ensure pockets_dict is available
if 'atoms' in locals():
    full_grid = create_search_grid(atoms, spacing=2.0)
    pocket_candidates = find_pocket_points(full_grid, atoms)
    pockets_dict = cluster_pocket_points(pocket_candidates)
    extract_and_save_residues(pockets_dict, atoms)

 Grid created: 26,796 points at 2.0 Å spacing.
Filtered: 26,796 → 4,087 pocket candidates (15.3% remaining).
22 pockets found (126 noise points discarded):
   Pocket #1: 3925 points | Center: (13.0, -1.0, 18.9)
   Pocket #2:    5 points | Center: (2.2, 3.9, 12.6)
   Pocket #3:    4 points | Center: (0.3, 11.8, 29.1)
   Pocket #4:    3 points | Center: (35.1, 12.3, 21.3)
   Pocket #5:    3 points | Center: (12.5, -9.7, 23.3)
   Pocket #6:    2 points | Center: (-2.2, 13.3, 4.6)
   Pocket #7:    2 points | Center: (28.8, 14.3, 4.6)
   Pocket #8:    2 points | Center: (29.8, 14.3, 7.6)
   Pocket #9:    2 points | Center: (34.8, -19.7, 22.6)
   Pocket #10:    1 points | Center: (1.8, 10.3, -7.4)
   Pocket #11:    1 points | Center: (3.8, -13.7, -1.4)
   Pocket #12:    1 points | Center: (27.8, -7.7, 0.6)
   Pocket #13:    1 points | Center: (29.8, 12.3, 4.6)
   Pocket #14:    1 points | Center: (-8.2, 12.3, 12.6)
   Pocket #15:    1 points | Center: (-4.2, 10.3, 12.6)
   Pocket #16:    1 p

---
## 7. Chemical Analysis & Pocket Ranking

Not all pockets are equal. We now ask: **what kind of ligand would fit here?**
The answer depends almost entirely on the chemical character of the lining residues.

### The Chemical Basis: Polarity

Amino acids are divided into two broad groups based on their side-chain chemistry:

| Group | Residues | Properties |
|-------|----------|------------|
| **Non-polar / Hydrophobic** | ALA, VAL, LEU, ILE, MET, PHE, TRP, PRO, GLY | Repel water; prefer lipophilic ligands |
| **Polar / Charged** | SER, THR, CYS, TYR, ASN, GLN, ASP, GLU, LYS, ARG, HIS | Form H-bonds; prefer hydrophilic ligands |

### The Polarity Ratio
We calculate what fraction of pocket-lining residues are polar:
```
polarity_ratio = count_polar / (count_polar + count_nonpolar)
```
- **ratio > 0.5** → pocket prefers **polar ligands** (e.g. ATP, sugars, charged drugs)
- **ratio < 0.5** → pocket prefers **non-polar ligands** (e.g. lipophilic drugs, fatty acids)

### Ranking Score
Pockets are ranked primarily by **size** (number of points), which correlates with:
- Available volume for ligand binding
- Number of atom contacts a ligand can make
- General druggability (larger pockets → more opportunities for selectivity)

> **Note:** This is a simplified score. Real docking programs like AutoDock Vina additionally compute
> van der Waals terms, electrostatics, desolvation penalties, and torsional entropy.

In [18]:
def analyze_and_rank_pockets(
    pockets_dict: dict,
    protein_atoms: list,
    threshold: float = 4.5
) -> list:
    """
    Analyzes the chemical environment of each pocket and ranks them by size.

    For each pocket:
        - Identifies all lining residues within `threshold` Å
        - Counts polar vs. non-polar residues
        - Assigns a ligand preference (polar vs. non-polar)
        - Scores the pocket by its point count (proxy for volume)

    Parameters:
        pockets_dict:  dictionary {cluster_id: np.ndarray of pocket points}
        protein_atoms: list of Biopython Atom objects
        threshold:     contact distance to determine lining residues (Å)

    Returns:
        ranked_pockets: list of result dicts, sorted by score (best first)
    """
    # Standard amino acid classification by polarity
    NON_POLAR = {'ALA', 'VAL', 'LEU', 'ILE', 'MET', 'PHE', 'TRP', 'PRO', 'GLY'}
    POLAR     = {'SER', 'THR', 'CYS', 'TYR', 'ASN', 'GLN', 'ASP', 'GLU', 'LYS', 'ARG', 'HIS'}

    # Build KDTree once for all pockets
    atom_coords = np.array([atom.get_coord() for atom in protein_atoms])
    tree = KDTree(atom_coords)

    results = []

    for p_id, points in pockets_dict.items():
        # --- Find all lining residues ---
        neighbor_indices = tree.query_ball_point(points, threshold)
        flat_indices = set(idx for sublist in neighbor_indices for idx in sublist)

        # Collect residue names for all lining atoms
        res_names = [protein_atoms[idx].get_parent().get_resname() for idx in flat_indices]

        # --- Count polar vs. non-polar residues ---
        count_polar    = sum(1 for r in res_names if r in POLAR)
        count_nonpolar = sum(1 for r in res_names if r in NON_POLAR)
        total = count_polar + count_nonpolar if (count_polar + count_nonpolar) > 0 else 1

        # --- Determine ligand preference ---
        preference = "Polar Ligands" if count_polar >= count_nonpolar else "Non-Polar Ligands"

        # --- Scoring: pocket size (point count) as proxy for volume ---
        # A larger pocket has more room for a ligand and more potential contacts.
        score = len(points)

        results.append({
            'id':           p_id,
            'size':         len(points),
            'score':        score,
            'preference':   preference,
            'polar_ratio':  count_polar / total,
            'residues':     sorted(set(res_names))  # Unique residue names, sorted
        })

    # Sort by score descending — highest-scoring pocket is Rank 1
    ranked_pockets = sorted(results, key=lambda x: x['score'], reverse=True)
    return ranked_pockets


# --- Execution & Report ---
if 'pockets_dict' in locals():
    ranked_list = analyze_and_rank_pockets(pockets_dict, atoms)

    print("--- POCKET RANKING (most likely binding site first) ---\n")

    with open("pocket_ranking.txt", "w") as f:
        f.write("RANKING OF PREDICTED BINDING SITES\n")
        f.write("=" * 40 + "\n\n")

        for i, p in enumerate(ranked_list):
            rank_info = (f"Rank {i+1}: Pocket {p['id']} | "
                         f"Score: {p['score']} | Best for: {p['preference']}")
            print(rank_info)
            f.write(rank_info + "\n")
            f.write(f"  Residues involved: {', '.join(p['residues'])}\n")
            f.write(f"  Polarity ratio:    {p['polar_ratio']:.2f}\n\n")

    # Separate into polar and non-polar recommendations
    polar_sites    = [p['id'] for p in ranked_list if p['preference'] == "Polar Ligands"]
    nonpolar_sites = [p['id'] for p in ranked_list if p['preference'] == "Non-Polar Ligands"]

    print(f"\n Recommended for Polar Ligands:     Pocket IDs {polar_sites}")
    print(f"Recommended for Non-Polar Ligands:  Pocket IDs {nonpolar_sites}")
    print("\n Full ranking saved to: pocket_ranking.txt")

--- POCKET RANKING (most likely binding site first) ---

Rank 1: Pocket 0 | Score: 3925 | Best for: Polar Ligands
Rank 2: Pocket 10 | Score: 5 | Best for: Non-Polar Ligands
Rank 3: Pocket 19 | Score: 4 | Best for: Polar Ligands
Rank 4: Pocket 12 | Score: 3 | Best for: Polar Ligands
Rank 5: Pocket 13 | Score: 3 | Best for: Non-Polar Ligands
Rank 6: Pocket 4 | Score: 2 | Best for: Polar Ligands
Rank 7: Pocket 5 | Score: 2 | Best for: Polar Ligands
Rank 8: Pocket 7 | Score: 2 | Best for: Polar Ligands
Rank 9: Pocket 14 | Score: 2 | Best for: Polar Ligands
Rank 10: Pocket 1 | Score: 1 | Best for: Polar Ligands
Rank 11: Pocket 2 | Score: 1 | Best for: Polar Ligands
Rank 12: Pocket 3 | Score: 1 | Best for: Polar Ligands
Rank 13: Pocket 6 | Score: 1 | Best for: Polar Ligands
Rank 14: Pocket 8 | Score: 1 | Best for: Polar Ligands
Rank 15: Pocket 9 | Score: 1 | Best for: Non-Polar Ligands
Rank 16: Pocket 11 | Score: 1 | Best for: Non-Polar Ligands
Rank 17: Pocket 15 | Score: 1 | Best for: Polar

---
## 8. Validation Test — Checking the Analysis

Before trusting the results, we run a structured sanity check:

1. **Was a top pocket found?** → Confirms the pipeline produced at least one meaningful result
2. **Is the polarity ratio sensible?** → Extreme values (0.0 or 1.0) may indicate a data issue
3. **Are polar and non-polar pockets both represented?** → A healthy dataset should have both

This test cell is useful to re-run after changing parameters like `spacing`, `min_dist`, or `eps`.

In [19]:
def test_final_analysis(pockets_dict: dict, protein_atoms: list) -> None:
    """
    Runs a structured validation of the chemical analysis and ranking.
    Prints a summary of the top pocket and overall statistics.

    Parameters:
        pockets_dict:  dictionary {cluster_id: np.ndarray of pocket points}
        protein_atoms: list of Biopython Atom objects
    """
    print("Step 8: Validating Chemical Analysis & Ranking...")

    ranked_results = analyze_and_rank_pockets(pockets_dict, protein_atoms)

    if not ranked_results:
        print(" FAILED: No pockets available for analysis.")
        return

    # --- Check 1: Top pocket summary ---
    top = ranked_results[0]
    print(f"\n TOP PREDICTED BINDING SITE: Pocket {top['id']}")
    print(f"   Size (grid points): {top['size']}")
    print(f"   Best suited for:    {top['preference']}")
    print(f"   Polarity ratio:     {top['polar_ratio']:.1%} polar residues")
    print(f"   Key residues:       {', '.join(top['residues'][:8])}{'...' if len(top['residues']) > 8 else ''}")

    # --- Check 2: Overall statistics ---
    polar_pockets    = [p['id'] for p in ranked_results if p['preference'] == "Polar Ligands"]
    nonpolar_pockets = [p['id'] for p in ranked_results if p['preference'] == "Non-Polar Ligands"]

    print(f"\n--- Summary Statistics ---")
    print(f"   Total pockets found:            {len(ranked_results)}")
    print(f"   Pockets for polar ligands:      {len(polar_pockets)}")
    print(f"   Pockets for non-polar ligands:  {len(nonpolar_pockets)}")

    # --- Check 3: Sanity check on top pocket ---
    if top['size'] > 0 and 0.0 < top['polar_ratio'] < 1.0:
        print("\nSUCCESS: Analysis looks biologically reasonable.")
    elif top['polar_ratio'] in (0.0, 1.0):
        print("\n WARNING: Extreme polarity ratio — check residue classification.")
    else:
        print("\nWARNING: Unexpected result — check input data.")


# --- Execution ---
if 'pockets_dict' in locals():
    test_final_analysis(pockets_dict, atoms)

Step 8: Validating Chemical Analysis & Ranking...

 TOP PREDICTED BINDING SITE: Pocket 0
   Size (grid points): 3925
   Best suited for:    Polar Ligands
   Polarity ratio:     64.6% polar residues
   Key residues:       ALA, ARG, ASN, ASP, CYS, GLN, GLU, GLY...

--- Summary Statistics ---
   Total pockets found:            22
   Pockets for polar ligands:      18
   Pockets for non-polar ligands:  4

SUCCESS: Analysis looks biologically reasonable.


---
## 9. Full Pipeline — One Function to Run Everything

This is the **master function** that ties all previous steps together into a single call.
It runs geometry → filtering → clustering → chemical analysis → report in sequence
and prints a clean top-5 summary at the end.

### Reading the Final Output

| Field | What it means |
|-------|---------------|
| `RANK 1` | Most likely binding site (largest pocket) |
| `Confidence Score` | Number of grid points inside the pocket (higher = larger cavity) |
| `Chemical Nature` | Polar or Non-Polar — guides ligand selection |
| `% polar` | Fraction of lining residues that are polar |
| `Key Residues` | First 8 unique amino acids lining the pocket |

>  **Drug discovery context:** A high-confidence polar pocket with HIS, ASP, or GLU residues
> is a strong indicator of a catalytic site — ideal for competitive inhibitors.

In [20]:
def run_full_prediction(protein_atoms: list, spacing: float = 2.0) -> list:
    """
    Runs the complete binding pocket prediction pipeline end-to-end.

    Steps performed:
        1. Build 3D search grid
        2. Filter grid to pocket candidate points
        3. Cluster candidates into discrete pockets (DBSCAN)
        4. Analyze chemical environment & rank pockets
        5. Print top-5 ranking report

    Parameters:
        protein_atoms: list of Biopython Atom objects
        spacing:       grid point spacing in Å (2.0 for testing, 1.0 for final)

    Returns:
        ranked_results: list of result dicts, sorted by score (best first)
    """
    print("Starting Binding Site Prediction Pipeline...\n")

    # Step 1 & 2: Geometry — Grid → Filter
    full_grid         = create_search_grid(protein_atoms, spacing=spacing)
    pocket_candidates = find_pocket_points(full_grid, protein_atoms)

    # Step 3: Clustering — group candidates into discrete pockets
    pockets_dict = cluster_pocket_points(pocket_candidates)

    # Step 4: Chemical Analysis & Ranking
    ranked_results = analyze_and_rank_pockets(pockets_dict, protein_atoms)

    # Step 5: Print final report (Top 5)
    print("\n" + "=" * 55)
    print("  FINAL RESULTS — TOP 5 PREDICTED BINDING SITES")
    print("=" * 55)

    for i, p in enumerate(ranked_results[:5]):
        key_res = ', '.join(p['residues'][:8])
        suffix  = '...' if len(p['residues']) > 8 else ''
        print(f"\nRANK {i+1}: Pocket {p['id']}")
        print(f"   Confidence Score (size):  {p['score']} points")
        print(f"   Chemical Nature:          {p['preference']} ({p['polar_ratio']:.1%} polar)")
        print(f"   Key Residues:             {key_res}{suffix}")
        print("   " + "-" * 45)

    return ranked_results


# --- Final Execution ---
if 'atoms' in locals():
    final_rankings = run_full_prediction(atoms, spacing=2.0)

Starting Binding Site Prediction Pipeline...

 Grid created: 26,796 points at 2.0 Å spacing.
Filtered: 26,796 → 4,087 pocket candidates (15.3% remaining).
22 pockets found (126 noise points discarded):
   Pocket #1: 3925 points | Center: (13.0, -1.0, 18.9)
   Pocket #2:    5 points | Center: (2.2, 3.9, 12.6)
   Pocket #3:    4 points | Center: (0.3, 11.8, 29.1)
   Pocket #4:    3 points | Center: (35.1, 12.3, 21.3)
   Pocket #5:    3 points | Center: (12.5, -9.7, 23.3)
   Pocket #6:    2 points | Center: (-2.2, 13.3, 4.6)
   Pocket #7:    2 points | Center: (28.8, 14.3, 4.6)
   Pocket #8:    2 points | Center: (29.8, 14.3, 7.6)
   Pocket #9:    2 points | Center: (34.8, -19.7, 22.6)
   Pocket #10:    1 points | Center: (1.8, 10.3, -7.4)
   Pocket #11:    1 points | Center: (3.8, -13.7, -1.4)
   Pocket #12:    1 points | Center: (27.8, -7.7, 0.6)
   Pocket #13:    1 points | Center: (29.8, 12.3, 4.6)
   Pocket #14:    1 points | Center: (-8.2, 12.3, 12.6)
   Pocket #15:    1 points | Ce