#  Protein Binding Pocket Finder

This notebook implements a geometry-based algorithm for predicting **ligand binding pockets** in protein structures.

## Principle
Instead of computing expensive energy functions, the protein surface is sampled geometrically:
1. A 3D grid of points is placed over the entire protein
2. Points that are too close to the protein (collision) or too far away (empty space) are removed
3. The remaining points — located inside cavities and pockets — are grouped into clusters
4. Each cluster corresponds to one potential binding pocket

## Input
A `.pdb` file (here: `1H8D.pdb` — a kinase from the RCSB Protein Data Bank)

---

## 0. Imports

All required libraries are imported here in one central place:
- **`numpy`** – fast array operations and vector math
- **`Bio.PDB`** – Biopython module for reading and analyzing PDB files
- **`scipy.spatial.KDTree`** – efficient spatial lookups (much faster than naive distance calculations)
- **`sklearn.cluster.DBSCAN`** – density-based clustering algorithm that requires no fixed cluster count
- **`time`** – runtime measurement

In [1]:
import numpy as np
import time

from Bio import PDB
from Bio.PDB import is_aa
from scipy.spatial import KDTree
from sklearn.cluster import DBSCAN

---
## 1. Reading the Protein Structure
PDB files contain not only the protein atoms, but also water molecules, ligands, ions, and other hetero-atoms.

For binding site prediction, we only keep atoms belonging to the protein itself.

### Why filter?

* **Water molecules** introduce noise because their positions are often not biologically stable.
* **Ligands and co-factors** may already occupy binding pockets and would bias the prediction.

### Filtering in BioPython

The structure is parsed using BioPython and iterated hierarchically:

Structure → Model → Chain → Residue → Atom

We use:

```python
is_aa(residue, standard=True)
```

to select only the 20 standard amino acids.

This automatically excludes:

* water (HOH)
* ligands
* ions
* co-factors

### Example PDB records

```
ATOM      1  N   ALA A   1 ...
HETATM 1234  O   HOH A 201 ...
HETATM 2345 FE   HEM A 500 ...
```

Only `ATOM` records corresponding to amino acids are kept.

The resulting atom list is used for grid generation in the next step.


In [2]:
def get_protein_structure(pdb_file: str):
    """
    Reads a PDB file and returns the structure object along with a list
    of all protein atoms (excluding water and ligands).

    Parameters:
        pdb_file: Path to the .pdb file

    Returns:
        structure:     Bio.PDB Structure object (full hierarchy)
        protein_atoms: List of all Atom objects belonging to amino acids
    """
    # QUIET=True suppresses warnings for minor format inconsistencies
    # PERMISSIVE=True allows non-standard PDB entries
    parser = PDB.PDBParser(QUIET=True, PERMISSIVE=True)

    # Load the structure — 'protein_obj' is an internal label for the object
    structure = parser.get_structure('protein_obj', pdb_file)

    protein_atoms = []

    # Hierarchical iteration: Model → Chain → Residue → Atom
    for model in structure:
        for chain in model:
            for residue in chain:
                # robust protein check
                if not is_aa(residue, standard=True): 
                    continue

                for atom in residue:
                    protein_atoms.append(atom)
                

    print(f"Loaded: {len(protein_atoms)} protein atoms found.")
    return structure, protein_atoms


# --- Execution ---
FILE_PATH = "1H8D.pdb"

try:
    structure, atoms = get_protein_structure(FILE_PATH)

    # Spot check: print the first 5 atoms to verify correct parsing
    print("\nCoordinate check (first 5 atoms):")
    for atom in atoms[:5]:
        residue = atom.get_parent()
        print(f"  Residue: {residue.get_resname():>3} | Atom: {atom.get_name():<4} | XYZ: {atom.get_coord()}")

except FileNotFoundError:
    print(f"Error: File '{FILE_PATH}' not found.")

Loaded: 2333 protein atoms found.

Coordinate check (first 5 atoms):
  Residue: ILE | Atom: N    | XYZ: [ 5.169 -8.919 17.688]
  Residue: ILE | Atom: CA   | XYZ: [ 4.5   -8.595 18.976]
  Residue: ILE | Atom: C    | XYZ: [ 3.397 -9.622 19.265]
  Residue: ILE | Atom: O    | XYZ: [ 2.507 -9.808 18.423]
  Residue: ILE | Atom: CB   | XYZ: [ 3.896 -7.17  18.971]


## 2. Creating the Search Grid

To search for potential binding sites, we place a regular 3D grid around the protein.

The grid covers the full protein and extends beyond its surface by a small margin (buffer).

### Parameters

| Parameter | Meaning                      | Typical Value                  |
| --------- | ---------------------------- | ------------------------------ |
| `spacing` | Distance between grid points | 2.0 Å (testing), 1.0 Å (final) |
| `buffer`  | Extra space around protein   | 5.0 Å                          |

### Method

First, the bounding box of the protein is computed from the atom coordinates.

Then, evenly spaced points are generated along each axis using:

```python
np.arange()
```

and combined into 3D coordinates using:

```python
np.meshgrid()
```

The result is a NumPy array of shape:

```python
(N, 3)
```

where each row is one grid point.

### Note on Performance

Smaller spacing increases resolution but dramatically increases the number of grid points (~8× more when going from 2.0 Å to 1.0 Å).


In [3]:
def create_search_grid(protein_atoms: list, spacing: float = 1.0) -> np.ndarray:
    """
    Creates a uniform 3D grid around the protein.

    Parameters:
        protein_atoms: List of Biopython Atom objects
        spacing:       Distance between grid points in Ångström

    Returns:
        grid: numpy array of shape (N, 3) containing N grid points
    """
    BUFFER = 5.0  # Å — padding region extending beyond the protein

    # Extract all atom coordinates into a single (N, 3) matrix
    coords = np.array([atom.get_coord() for atom in protein_atoms])

    # Bounding box: smallest and largest coordinates along x, y, z
    min_coords = coords.min(axis=0) - BUFFER
    max_coords = coords.max(axis=0) + BUFFER

    # Generate axis arrays
    x = np.arange(min_coords[0], max_coords[0], spacing)
    y = np.arange(min_coords[1], max_coords[1], spacing)
    z = np.arange(min_coords[2], max_coords[2], spacing)

    # Generate all (x, y, z) combinations and reshape into an (N, 3) matrix
    grid = np.stack(np.meshgrid(x, y, z), axis=-1).reshape(-1, 3)

    print(f" Grid created: {grid.shape[0]:,} points at {spacing} Å spacing.")
    return grid


# --- Execution ---
if 'atoms' in locals():
    grid = create_search_grid(atoms, spacing=1.0)

 Grid created: 210,672 points at 1.0 Å spacing.


## 3. Filtering Pocket Candidates

The grid contains many irrelevant points. We keep only points near the protein surface using two distance filters.

### Filter 1 — Collision check (`min_dist = 2.5 Å`)

Points too close to any atom are inside the protein and are removed.

---

### Filter 2 — Surface proximity (`max_dist = 5.0 Å`)

Points too far from the protein are in bulk solvent and are removed.

Only points within 5.0 Å of at least one atom are kept.

---

### Efficient search using KDTree

Distance queries are accelerated using a KDTree, reducing the computation from millions of comparisons to **O(N log N)**.

---

### Result

The output is a reduced set of grid points:

```python
(M, 3)
```

representing candidate binding pocket locations.

In [4]:
def find_pocket_points(
    grid_points: np.ndarray,
    protein_atoms: list,
    min_dist: float = 2.5,
    max_dist: float = 5.0
) -> np.ndarray:
    """
    Filters grid points down to potential binding pocket candidates.

    A point qualifies if:
        - it is not closer than min_dist to any atom (no clash)
        - it has at least one atom within max_dist (near surface)
    """

    # Safety check
    if len(grid_points) == 0:
        return np.empty((0, 3))
    
    # Extract coordinates from Biopython objects
    atom_coords = np.array([atom.get_coord() for atom in protein_atoms])

    # Build the KDTree once — reused for both filters
    tree = KDTree(atom_coords)

    # --- Filter 1: Collision check ---
    # query_ball_point returns for each grid point all atom indices
    # within min_dist. If the list is empty → no clash → point is kept.
    clash_neighbors = tree.query_ball_point(grid_points, min_dist)
    no_clash_mask = np.array([len(neighbors) == 0 for neighbors in clash_neighbors])
    candidates = grid_points[no_clash_mask]

    # Early exit if empty
    if len(candidates) == 0:
        return np.empty((0, 3))
    
    # --- Filter 2: Surface proximity ---
    # From the candidate set: keep only points that have at least one atom
    # within max_dist (i.e. close to the protein surface)
    surface_neighbors = tree.query_ball_point(candidates, max_dist)
    near_surface_mask = np.array([len(neighbors) > 0 for neighbors in surface_neighbors])
    pocket_points = candidates[near_surface_mask]

    print(f"Filtered: {len(grid_points):,} → {len(pocket_points):,} pocket candidates "
          f"({len(pocket_points)/len(grid_points)*100:.1f}% remaining).")
    return pocket_points


# --- Execution with runtime measurement ---
if 'atoms' in locals() and 'grid' in locals():
    start = time.time()
    pocket_candidates = find_pocket_points(grid, atoms)
    elapsed = time.time() - start

Filtered: 210,672 → 32,673 pocket candidates (15.5% remaining).


---
## 4. Cluster Analysis: Identifying Individual Pockets

In this step, the generated grid points were grouped into clusters to identify potential ligand binding regions.

Grid points that passed the previous filtering step were spatially clustered based on their proximity. The underlying idea is that valid binding pockets are represented by dense groups of neighboring grid points rather than isolated points.

Clustering allows the reduction of the search space to a small number of candidate regions and helps to locate the most relevant binding pockets for subsequent docking calculations.

### Why DBSCAN instead of k-Means?
- DBSCAN requires **no prior specification** of the number of clusters
- DBSCAN detects clusters of arbitrary shape (pockets are irregular)
- DBSCAN marks isolated outlier points as **noise** (label `-1`)

### Parameters:
| Parameter | Meaning | Typical value |
|-----------|---------|---------------|
| `eps` | Maximum distance between two points in the same cluster | 2.5 Å |
| `min_samples` | Minimum points required to form a cluster | 5 |

In [5]:
def cluster_pocket_points(
    pocket_points: np.ndarray,
    eps: float = 2.5,
    min_samples: int = 5
) -> dict:
    """
    Groups pocket points into discrete binding pockets using DBSCAN.

    Parameters:
        pocket_points: numpy array (N, 3) of filtered candidate points
        eps:           maximum distance between two points in the same cluster (Å)
        min_samples:   minimum number of points required to form a valid cluster

    Returns:
        pockets: dictionary {cluster_id: np.ndarray of points}
                 sorted by cluster size (largest pocket first)
    """
    if len(pocket_points) == 0:
        print(" No points available for clustering.")
        return {}

    # Run DBSCAN clustering
    db = DBSCAN(eps=eps, min_samples=min_samples)
    labels = db.fit_predict(pocket_points)

    # Group points by cluster ID (exclude noise label -1)
    unique_labels = [label for label in set(labels) if label != -1]
    # Re-index clusters sequentially
    pockets = {}

    for new_id, label in enumerate(unique_labels, start=1):
        pockets[new_id] = pocket_points[labels == label]

    # Sort by cluster size (largest pocket = most likely binding site)
    pockets = dict(sorted(pockets.items(), key=lambda item: len(item[1]), reverse=True))

# Filter too small and too large clusters
    MAX_POINTS = 2000
    MIN_POINTS = 5

    
    pockets = {
    label: pts
    for label, pts in pockets.items()
    if MIN_POINTS <= len(pts) <= MAX_POINTS    }

    # Print summary
    noise_count = np.sum(labels == -1)
    print(f"{len(pockets)} pockets found ({noise_count} noise points discarded):")
   
    for rank, (p_id, p_points) in enumerate(pockets.items()):
        center = np.mean(p_points, axis=0)
        print(f"   Pocket #{rank+1}: {len(p_points):>4} points | Center: "
              f"({center[0]:.1f}, {center[1]:.1f}, {center[2]:.1f})")

    return pockets


# --- Execution ---
if 'pocket_candidates' in locals():
    pockets_dict = cluster_pocket_points(pocket_candidates)

11 pockets found (82 noise points discarded):
   Pocket #1:   38 points | Center: (21.3, -6.0, 27.7)
   Pocket #2:   25 points | Center: (10.3, -3.7, 14.1)
   Pocket #3:   14 points | Center: (20.8, 3.2, 24.1)
   Pocket #4:    8 points | Center: (16.1, -4.0, 0.1)
   Pocket #5:    7 points | Center: (16.4, 4.6, 29.9)
   Pocket #6:    7 points | Center: (3.1, 11.6, 13.9)
   Pocket #7:    6 points | Center: (9.6, -4.6, 20.0)
   Pocket #8:    6 points | Center: (7.0, -3.2, 27.0)
   Pocket #9:    6 points | Center: (16.0, 1.4, 19.1)
   Pocket #10:    5 points | Center: (7.8, -3.7, 0.0)
   Pocket #11:    5 points | Center: (6.8, 7.9, 31.8)


---
## 5. Export for Visualization in UCSF Chimera

To inspect the results, all grid points and predicted pockets are exported as **PDB files**.

Each grid point is written as a **dummy HETATM**, allowing visualization in tools like PyMOL.

### Exported files:

| File                          | Description                          |
| ----------------------------- | ------------------------------------ |
| `step1_full_grid.pdb`         | Complete 3D search grid              |
| `step2_pocket_candidates.pdb` | Filtered surface candidate points    |
| `step3_pocket_N.pdb`          | Individual clustered binding pockets |

### Why export?

* Enables **visual validation** of the algorithm
* Confirms pockets are located on the **protein surface**
* Provides output for **further analysis or docking**

**Workflow in Chimera:**
1. Load the protein: `File → Open → 1H8D.pdb`
2. Load pockets: `File → Open → step3_pocket_0.pdb`, etc.
3. Set representation: `Actions → Atoms/Bonds → sphere`
4. Set colors: `Actions → Color → ...`

In [6]:
def save_points_to_pdb(points: np.ndarray, output_file: str) -> None:
    """
    Saves a numpy array of 3D coordinates as a PDB file.
    Each point is written as a HETATM entry (dummy atom 'N').

    Parameters:
        points:      numpy array (N, 3) with XYZ coordinates
        output_file: target file path
    """
    with open(output_file, 'w') as f:
        for i, (x, y, z) in enumerate(points):
            # PDB HETATM format (column-accurate per PDB specification):
            # Cols 1-6: Record type | 7-11: Atom serial | 13-16: Atom name
            # 18-20: Residue name | 22: Chain | 23-26: Residue seq | 31-54: X, Y, Z
            f.write(f"HETATM{i+1:5d}  P   PTS A   1    {x:8.3f}{y:8.3f}{z:8.3f}  1.00  0.00           N\n")
        f.write("END\n")
    print(f" Saved: {output_file} ({len(points)} points)")


def export_all_steps(atoms: list, output_dir: str = ".") -> None:
    """
    Runs the full pipeline and exports each stage as a PDB file.

    Output files:
        step1_full_grid.pdb          — the complete grid around the protein
        step2_pocket_candidates.pdb  — filtered pocket candidate points
        step3_pocket_N.pdb           — individual clustered pockets (sorted by size)
    """
    import os
    os.makedirs(output_dir, exist_ok=True)

    # Step 1: Full grid
    print(" Step 1: Generating grid ...")
    full_grid = create_search_grid(atoms, spacing=1.0)
    save_points_to_pdb(full_grid, f"{output_dir}/step1_full_grid.pdb")

    # Step 2: Filtered candidates
    print("\nStep 2: Filtering candidates ...")
    candidates = find_pocket_points(full_grid, atoms)
    save_points_to_pdb(candidates, f"{output_dir}/step2_pocket_candidates.pdb")

    # Step 3: Clustering
    print("\n Step 3: Clustering pockets ...")
    pockets = cluster_pocket_points(candidates)


    print("\n Exporting pockets ...")
    for rank, (p_id, p_points) in enumerate(pockets.items()):
        save_points_to_pdb(p_points, f"{output_dir}/step3_pocket_{rank}.pdb")

    print(f"\nExport complete. {len(pockets)} pocket files saved.")


# --- Execution ---
if 'atoms' in locals():
    export_all_steps(atoms, output_dir="pocket_output")

 Step 1: Generating grid ...
 Grid created: 210,672 points at 1.0 Å spacing.
 Saved: pocket_output/step1_full_grid.pdb (210672 points)

Step 2: Filtering candidates ...
Filtered: 210,672 → 32,673 pocket candidates (15.5% remaining).
 Saved: pocket_output/step2_pocket_candidates.pdb (32673 points)

 Step 3: Clustering pockets ...
11 pockets found (82 noise points discarded):
   Pocket #1:   38 points | Center: (21.3, -6.0, 27.7)
   Pocket #2:   25 points | Center: (10.3, -3.7, 14.1)
   Pocket #3:   14 points | Center: (20.8, 3.2, 24.1)
   Pocket #4:    8 points | Center: (16.1, -4.0, 0.1)
   Pocket #5:    7 points | Center: (16.4, 4.6, 29.9)
   Pocket #6:    7 points | Center: (3.1, 11.6, 13.9)
   Pocket #7:    6 points | Center: (9.6, -4.6, 20.0)
   Pocket #8:    6 points | Center: (7.0, -3.2, 27.0)
   Pocket #9:    6 points | Center: (16.0, 1.4, 19.1)
   Pocket #10:    5 points | Center: (7.8, -3.7, 0.0)
   Pocket #11:    5 points | Center: (6.8, 7.9, 31.8)

 Exporting pockets ...
 Sa

---
## 6. Residue Extraction — Identifying Pocket-Lining Residues

After identifying the pocket clusters, we determine **which amino acids form the pocket walls**.

For each pocket point, we search for nearby protein atoms using a **KDTree**.
If any atom of a residue lies within a distance threshold, the entire residue is considered part of the pocket.

### How it works
We use the same KDTree approach as before, but now in reverse:
instead of finding atoms close to the protein, we find **protein atoms close to each pocket point**.
Any atom within `threshold` (4.5 Å) of any pocket point belongs to a **lining residue**.

### Why 4.5 Å?

This distance captures typical ligand–protein interactions, including:

* Van der Waals contacts (~3.5 Å)
* Hydrogen bonds donor/acceptor pairs (~3.5–4.0 Å)
* Electrostatic interactions (~4.5 Å)

### Output

For each pocket, a text file lists:

* Pocket ID
* Number of pocket points
* All lining residues (e.g., HIS-195, GLU-198)

This provides the biological context of each predicted binding site.

In [7]:
def extract_and_save_residues(
    pockets_dict: dict,
    protein_atoms: list,
    output_file: str = "pocket_residues.txt",
    threshold: float = 4.5
) -> None:
    """
    Identifies all protein residues lining each pocket and saves them to a text file.

    A residue is considered a pocket-lining residue if any of its atoms
    lies within `threshold` Å of any pocket point.

    Parameters:
        pockets_dict:  dictionary {cluster_id: np.ndarray of pocket points}
        protein_atoms: list of Biopython Atom objects
        output_file:   path to save the residue report
        threshold:     maximum distance from pocket point to atom in Å
    """
    # Build a KDTree over all protein atom coordinates for fast lookup
    atom_coords = np.array([atom.get_coord() for atom in protein_atoms])
    tree = KDTree(atom_coords)

    with open(output_file, 'w') as f:
        f.write("LIGAND BINDING SITE PREDICTIONS — RESIDUE LIST\n")

        for p_id, points in pockets_dict.items():
            # For every pocket point, find all atom indices within threshold
            neighbor_indices = tree.query_ball_point(points, threshold)

            # Flatten into a set of unique atom indices (avoid counting the same atom twice)
            flat_indices = set(idx for sublist in neighbor_indices for idx in sublist)

            # Map atom indices → residue names and numbers
            found_residues = set()
            for idx in flat_indices:
                res = protein_atoms[idx].get_parent()
                resname = res.get_resname()
                # Format: NAME-NUMBER, e.g. HIS-195
                chain_id = res.get_parent().id
                
                hetflag, resseq, icode = res.id

                if icode.strip():
                    resnum = f"{resseq}{icode}"
                else:
                    resnum = f"{resseq}"

                res_info = f"{res.get_resname()}-{chain_id}{res.id[1]}"
                found_residues.add(res_info)

            # Sort alphabetically for readable output
            sorted_res = sorted(found_residues)

            f.write(f"Pocket {p_id} ({len(points)} points):\n")
            f.write(f"Lining Residues: {', '.join(sorted_res)}\n")

            print(f"   Pocket {p_id}: {len(sorted_res)} lining residues identified.")

    print(f"\nResidue report saved to: {output_file}")


# --- Execution ---
# Re-run the pipeline to ensure pockets_dict is available
if 'atoms' in locals():
    full_grid = create_search_grid(atoms, spacing=1.0)
    pocket_candidates = find_pocket_points(full_grid, atoms)
    pockets_dict = cluster_pocket_points(pocket_candidates)
    extract_and_save_residues(pockets_dict, atoms)

 Grid created: 210,672 points at 1.0 Å spacing.
Filtered: 210,672 → 32,673 pocket candidates (15.5% remaining).
11 pockets found (82 noise points discarded):
   Pocket #1:   38 points | Center: (21.3, -6.0, 27.7)
   Pocket #2:   25 points | Center: (10.3, -3.7, 14.1)
   Pocket #3:   14 points | Center: (20.8, 3.2, 24.1)
   Pocket #4:    8 points | Center: (16.1, -4.0, 0.1)
   Pocket #5:    7 points | Center: (16.4, 4.6, 29.9)
   Pocket #6:    7 points | Center: (3.1, 11.6, 13.9)
   Pocket #7:    6 points | Center: (9.6, -4.6, 20.0)
   Pocket #8:    6 points | Center: (7.0, -3.2, 27.0)
   Pocket #9:    6 points | Center: (16.0, 1.4, 19.1)
   Pocket #10:    5 points | Center: (7.8, -3.7, 0.0)
   Pocket #11:    5 points | Center: (6.8, 7.9, 31.8)
   Pocket 2: 19 lining residues identified.
   Pocket 4: 18 lining residues identified.
   Pocket 9: 13 lining residues identified.
   Pocket 5: 8 lining residues identified.
   Pocket 10: 10 lining residues identified.
   Pocket 13: 11 lining re

---
## 7. Chemical Analysis & Pocket Ranking

Not all pockets are equally suitable for ligand binding. We now analyze the **chemical environment of each pocket** to determine what type of ligand is most likely to bind there.

This depends on the amino acids that line the pocket walls.

### Chemical Classification of Amino Acids
Residues are divided into **five chemical groups**:

| Group                  | Residues                               | Properties                          |
| ---------------------- | -------------------------------------- | ----------------------------------- |
| **Hydrophobic**        | ALA, VAL, LEU, ILE, MET, PHE, TRP, PRO | Favor non-polar, lipophilic ligands |
| **Polar (uncharged)**  | SER, THR, ASN, GLN, TYR, CYS           | Form hydrogen bonds                 |
| **Positively charged** | LYS, ARG, HIS                          | Attract negatively charged ligands  |
| **Negatively charged** | ASP, GLU                               | Attract positively charged ligands  |
| **Special cases**      | GLY                                    | Provide structural flexibility      |

This provides realistic description of the pocket chemistry.

### Ranking Score

Pockets are ranked primarily by **size (number of grid points)**, which serves as a proxy for pocket volume.

Larger pockets generally:

* accommodate larger ligands
* allow more interaction sites
* and are more likely to represent functional binding sites

### Output

The final report includes for each pocket:

* Pocket rank and score
* Preferred ligand type
* Detailed chemical composition
* List of lining residues

This step provides the final functional interpretation of the predicted binding sites.

In [8]:
def analyze_and_rank_pockets(
    pockets_dict: dict,
    protein_atoms: list,
    threshold: float = 4.5
) -> list:
    """
    chemical analysis and ranking of binding pockets.

    Each pocket is evaluated based on:

    - Pocket size (volume proxy)
    - Chemical composition of lining residues
    - Hydrophobic vs polar balance
    - Charge environment

    Residues are classified into 5 chemical groups:

        Hydrophobic
        Polar (uncharged)
        Positive
        Negative
        Special

    Returns:
        ranked_pockets: sorted list of pocket analysis results
    """

    # --- 5-group amino acid classification ---

    HYDROPHOBIC = {'ALA','VAL','LEU','ILE','MET','PHE','TRP'}
    POLAR       = {'SER','THR','ASN','GLN','TYR'}
    POSITIVE    = {'LYS','ARG','HIS'}
    NEGATIVE    = {'ASP','GLU'}
    SPECIAL     = {'GLY','PRO','CYS'}

    # --- Build KDTree once ---
    atom_coords = np.array([atom.get_coord() for atom in protein_atoms])
    tree = KDTree(atom_coords)

    results = []

    for p_id, points in pockets_dict.items():

        neighbor_indices = tree.query_ball_point(points, threshold)
        flat_indices = set(idx for sublist in neighbor_indices for idx in sublist)

        # unique residues
        residues = set()
        for idx in flat_indices:
            resname = protein_atoms[idx].get_parent().get_resname()
            residues.add(resname)

        # --- Count each chemical group ---

        counts = {

            'hydrophobic': sum(r in HYDROPHOBIC for r in residues),
            'polar':       sum(r in POLAR for r in residues),
            'positive':    sum(r in POSITIVE for r in residues),
            'negative':    sum(r in NEGATIVE for r in residues),
            'special':     sum(r in SPECIAL for r in residues)

        }

        total = sum(counts.values()) if sum(counts.values()) > 0 else 1

        # --- Calculate ratios ---

        hydrophobic_ratio = counts['hydrophobic'] / total
        polar_ratio       = counts['polar']       / total
        charge_ratio      = (counts['positive'] + counts['negative']) / total

        # --- Determine ligand preference ---

        if charge_ratio > 0.25:
            preference = "Charged Ligands"

        elif hydrophobic_ratio > 0.5:
            preference = "Hydrophobic Ligands"

        elif polar_ratio > 0.3:
            preference = "Polar Ligands"

        else:
            preference = "Mixed Ligands"

        # --- Improved scoring model ---

        size_score = len(points)

        chemistry_bonus = (
            counts['hydrophobic'] * 2 +
            counts['polar'] * 2 +
            counts['positive'] * 3 +
            counts['negative'] * 3
        )

        score = size_score + chemistry_bonus

        results.append({

            'id': p_id,

            'size': len(points),

            'score': score,

            'preference': preference,

            'composition': counts,

            'hydrophobic_ratio': hydrophobic_ratio,

            'polar_ratio': polar_ratio,

            'charge_ratio': charge_ratio,

            'residues': sorted(residues)

        })

    ranked_pockets = sorted(results, key=lambda x: x['score'], reverse=True)

    return ranked_pockets


if 'pockets_dict' in locals():

    ranked_list = analyze_and_rank_pockets(pockets_dict, atoms)

    print("\n--- POCKET RANKING ---\n")

    with open("pocket_ranking.txt", "w") as f:

        f.write("PROFESSIONAL BINDING SITE ANALYSIS\n")

        for i, p in enumerate(ranked_list):

            text = (
                f"Rank {i+1} | Pocket {p['id']} | Score: {p['score']}\n"
                f"Size: {p['size']} grid points\n"
                f"Preferred Ligand Type: {p['preference']}\n"
                f"Composition:\n"
                f"  Hydrophobic: {p['composition']['hydrophobic']}\n"
                f"  Polar:       {p['composition']['polar']}\n"
                f"  Positive:    {p['composition']['positive']}\n"
                f"  Negative:    {p['composition']['negative']}\n"
                f"  Special:     {p['composition']['special']}\n"
                f"Residues:\n"
                f"  {', '.join(p['residues'])}\n"
            )


            print(text)
            f.write(text)

    print("\nSaved to pocket_ranking.txt")


--- POCKET RANKING ---

Rank 1 | Pocket 2 | Score: 65
Size: 38 grid points
Preferred Ligand Type: Polar Ligands
Composition:
  Hydrophobic: 5
  Polar:       4
  Positive:    2
  Negative:    1
  Special:     0
Residues:
  ARG, ASN, ASP, HIS, ILE, LEU, MET, PHE, SER, THR, TRP, TYR

Rank 2 | Pocket 4 | Score: 44
Size: 25 grid points
Preferred Ligand Type: Mixed Ligands
Composition:
  Hydrophobic: 5
  Polar:       3
  Positive:    0
  Negative:    1
  Special:     3
Residues:
  ALA, ASP, CYS, GLN, GLY, LEU, MET, PRO, SER, THR, TRP, VAL

Rank 3 | Pocket 9 | Score: 32
Size: 14 grid points
Preferred Ligand Type: Hydrophobic Ligands
Composition:
  Hydrophobic: 5
  Polar:       1
  Positive:    2
  Negative:    0
  Special:     0
Residues:
  HIS, ILE, LEU, LYS, PHE, THR, TRP, VAL

Rank 4 | Pocket 13 | Score: 26
Size: 7 grid points
Preferred Ligand Type: Charged Ligands
Composition:
  Hydrophobic: 3
  Polar:       2
  Positive:    2
  Negative:    1
  Special:     2
Residues:
  ARG, ASP, GLY, 

---
## 8. Validation Test — Checking the Analysis

Before trusting the results, we run a structured sanity check:

1. **Was a top pocket found?** → Confirms the pipeline produced at least one meaningful result
2. **Is the polarity ratio sensible?** → Extreme values (0.0 or 1.0) may indicate a data issue
3. **Are polar and non-polar pockets both represented?** → A healthy dataset should have both

This test cell is useful to re-run after changing parameters like `spacing`, `min_dist`, or `eps`.

In [9]:
def test_final_analysis(pockets_dict: dict, protein_atoms: list) -> None:
    """
    Runs a structured validation of the chemical analysis and ranking.
    Prints a summary of the top pocket and overall statistics.

    Parameters:
        pockets_dict:  dictionary {cluster_id: np.ndarray of pocket points}
        protein_atoms: list of Biopython Atom objects
    """
    print("Step 8: Validating Chemical Analysis & Ranking...")

    ranked_results = analyze_and_rank_pockets(pockets_dict, protein_atoms)

    if not ranked_results:
        print(" FAILED: No pockets available for analysis.")
        return

    # --- Check 1: Top pocket summary ---
    top = ranked_results[0]
    print(f"\n TOP PREDICTED BINDING SITE: Pocket {top['id']}")
    print(f"   Size (grid points): {top['size']}")
    print(f"   Best suited for:    {top['preference']}")
    print(f"   Polarity ratio:     {top['polar_ratio']:.1%} polar residues")
    print(f"   Key residues:       {', '.join(top['residues'][:8])}{'...' if len(top['residues']) > 8 else ''}")

    # --- Check 2: Overall statistics ---
    hydrophobic_pockets = [int(p['id']) for p in ranked_results if p['preference'] == "Hydrophobic Ligands"]
    polar_pockets       = [int(p['id']) for p in ranked_results if p['preference'] == "Polar Ligands"]
    charged_pockets     = [int(p['id']) for p in ranked_results if p['preference'] == "Charged Ligands"]
    mixed_pockets       = [int(p['id']) for p in ranked_results if p['preference'] == "Mixed Ligands"]
    

    print(f"\n--- Summary Statistics ---")

    print(f"   Total pockets found:          {len(ranked_results)}")

    print(f"   Hydrophobic pockets:         {len(hydrophobic_pockets)}")
    print(f"   Polar pockets:               {len(polar_pockets)}")
    print(f"   Charged pockets:             {len(charged_pockets)}")
    print(f"   Mixed chemistry pockets:     {len(mixed_pockets)}") 

    print("\nPocket IDs by category:")

    print(f"   Hydrophobic: {hydrophobic_pockets}")
    print(f"   Polar:       {polar_pockets}")
    print(f"   Charged:     {charged_pockets}")
    print(f"   Mixed:       {mixed_pockets}")

    # --- Check 3: Sanity check on top pocket ---
    if top['size'] > 0 and 0.0 < top['polar_ratio'] < 1.0:
        print("\nSUCCESS: Analysis looks biologically reasonable.")
    elif top['polar_ratio'] in (0.0, 1.0):
        print("\n WARNING: Extreme polarity ratio — check residue classification.")
    else:
        print("\nWARNING: Unexpected result — check input data.")


# --- Execution ---
if 'pockets_dict' in locals():
    test_final_analysis(pockets_dict, atoms)

Step 8: Validating Chemical Analysis & Ranking...

 TOP PREDICTED BINDING SITE: Pocket 2
   Size (grid points): 38
   Best suited for:    Polar Ligands
   Polarity ratio:     33.3% polar residues
   Key residues:       ARG, ASN, ASP, HIS, ILE, LEU, MET, PHE...

--- Summary Statistics ---
   Total pockets found:          11
   Hydrophobic pockets:         2
   Polar pockets:               3
   Charged pockets:             3
   Mixed chemistry pockets:     3

Pocket IDs by category:
   Hydrophobic: [9, 10]
   Polar:       [2, 8, 11]
   Charged:     [13, 5, 7]
   Mixed:       [4, 6, 3]

SUCCESS: Analysis looks biologically reasonable.


---
## 9. Full Pipeline — One Function to Run Everything

This is the **master function** that ties all previous steps together into a single call.
It runs geometry → filtering → clustering → chemical analysis → report in sequence
and prints a clean top-5 summary at the end.

### Reading the Final Output

| Field | What it means |
|-------|---------------|
| `RANK 1` | Most likely binding site (largest pocket) |
| `Confidence Score` | Number of grid points inside the pocket (higher = larger cavity) |
| `Chemical Nature` | Polar or Non-Polar — guides ligand selection |
| `% polar` | Fraction of lining residues that are polar |
| `Key Residues` | First 8 unique amino acids lining the pocket |

>  **Drug discovery context:** A high-confidence polar pocket with HIS, ASP, or GLU residues
> is a strong indicator of a catalytic site — ideal for competitive inhibitors.

In [10]:
def run_full_prediction(protein_atoms: list, spacing: float = 1.0) -> list:
    """
    Runs the complete binding pocket prediction pipeline end-to-end.

    Steps performed:
        1. Build 3D search grid
        2. Filter grid to pocket candidate points
        3. Cluster candidates into discrete pockets (DBSCAN)
        4. Analyze chemical environment & rank pockets
        5. Print top-5 ranking report

    Parameters:
        protein_atoms: list of Biopython Atom objects
        spacing:       grid point spacing in Å (2.0 for testing, 1.0 for final)

    Returns:
        ranked_results: list of result dicts, sorted by score (best first)
    """
    print("Starting Binding Site Prediction Pipeline...\n")

    # Step 1 & 2: Geometry — Grid → Filter
    full_grid         = create_search_grid(protein_atoms, spacing=spacing)
    pocket_candidates = find_pocket_points(full_grid, protein_atoms)

    # Step 3: Clustering — group candidates into discrete pockets
    pockets_dict = cluster_pocket_points(pocket_candidates)

    # Step 4: Chemical Analysis & Ranking
    ranked_results = analyze_and_rank_pockets(pockets_dict, protein_atoms)

    # Step 5: Print final report (Top 5)
    
    print("  FINAL RESULTS — TOP 5 PREDICTED BINDING SITES")
    
    for i, p in enumerate(ranked_results[:5]):
        key_res = ', '.join(p['residues'][:8])
        suffix  = '...' if len(p['residues']) > 8 else ''
        print(f"\nRANK {i+1}: Pocket {p['id']}")
        print(f"   Confidence Score (size):  {p['score']} points")
        print(f"   Chemical Nature:          {p['preference']} ({p['polar_ratio']:.1%} polar)")
        print(f"   Key Residues:             {key_res}{suffix}")
       

    return ranked_results


# --- Final Execution ---
if 'atoms' in locals():
    final_rankings = run_full_prediction(atoms, spacing=1.0)
    pockets_dict = cluster_pocket_points(pocket_candidates)

    print("DEBUG: pockets passed to ranking:", len(pockets_dict))
    for k, v in pockets_dict.items():
        print("Pocket", k, "size:", len(v))

Starting Binding Site Prediction Pipeline...

 Grid created: 210,672 points at 1.0 Å spacing.
Filtered: 210,672 → 32,673 pocket candidates (15.5% remaining).
11 pockets found (82 noise points discarded):
   Pocket #1:   38 points | Center: (21.3, -6.0, 27.7)
   Pocket #2:   25 points | Center: (10.3, -3.7, 14.1)
   Pocket #3:   14 points | Center: (20.8, 3.2, 24.1)
   Pocket #4:    8 points | Center: (16.1, -4.0, 0.1)
   Pocket #5:    7 points | Center: (16.4, 4.6, 29.9)
   Pocket #6:    7 points | Center: (3.1, 11.6, 13.9)
   Pocket #7:    6 points | Center: (9.6, -4.6, 20.0)
   Pocket #8:    6 points | Center: (7.0, -3.2, 27.0)
   Pocket #9:    6 points | Center: (16.0, 1.4, 19.1)
   Pocket #10:    5 points | Center: (7.8, -3.7, 0.0)
   Pocket #11:    5 points | Center: (6.8, 7.9, 31.8)
  FINAL RESULTS — TOP 5 PREDICTED BINDING SITES

RANK 1: Pocket 2
   Confidence Score (size):  65 points
   Chemical Nature:          Polar Ligands (33.3% polar)
   Key Residues:             ARG, ASN,