<a href="https://colab.research.google.com/github/win-eva/EGFR-TKI-Docking-Analysis/blob/main/01_fetchdata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Fetching Data for Analysis**
## 1. EGFR Protein Structures


| Mutation Class       | PDB ID | Ligand       | Generation | Notes                                                                 |
|---------------------|--------|-------------|------------|-----------------------------------------------------------------------|
| Wildtype            | 4WKQ   | Gefitinib   | 1st        |                                                                       |
| Wildtype            | 4G5J   | Afatinib    | 2nd        |                                                                       |
| Wildtype            | 8F1X   | Mobocertinib| 3rd        |                                                                       |
| L858R               | 2ITZ   | Gefitinib   | 1st        |                                                                       |
| L858R               | 6JWL   | Osimertinib | 3rd        |                                                                       |
| L858R               | 2ITT   | AEE788      | 2nd        | co-crystallised with AEE788, a second-generation EGFR TKI analogous to Afatinib |
| T790M               | 6JX0   | Osimertinib | 3rd        |                                                                       |
| T790M               | 4G5P   | Afatinib    | 2nd        |                                                                       |
| T790M               | 5GMP   | XTF-262     | 1st/2nd*   | surrogate for T790M-specific hinge binding due to limited co-crystal structures|
| Exon20 Insertion    | 4LRM   | PD168393    | 1st        | used as reference for first-generation inhibitors                                 |
| Exon20 Insertion    | 9GC6   | A1IZ9       | 2nd/3rd    | represents second- and third-generation inhibitors                            |
| Exon20 Insertion    | 9GL8   | STX-721     | 3rd        | shows mutant-induced pocket geometry                               |

\*XTF-262 functions as a surrogate ligand to model hinge binding for early-generation TKIs in the context of the T790M mutation.




In [None]:
#upload requirements.txt and install environment
!pip install -r requirements.txt

### Downloading PDB Structures

Downloading all selected EGFR receptor structures to the `proteins/` folder automatically.

In [None]:
import urllib.request
import urllib.error
import os

#create folder to store PDBs
os.makedirs("proteins", exist_ok=True)

#list of PDBs to fetch
pdb_ids = {
    "EGFR_wt_4WKQ": "4WKQ",
    "EGFR_wt_4G5J": "4G5J",
    "EGFR_wt_8F1X": "8F1X",
    "EGFR_L858R_2ITZ": "2ITZ",
    "EGFR_L858R_6JWL": "6JWL",
    "EGFR_L858R_2ITT": "2ITT",
    "EGFR_T790M_6JX0": "6JX0",
    "EGFR_T790M_4G5P": "4G5P",
    "EGFR_T790M_5GMP": "5GMP",
    "EGFR_exon20_4LRM": "4LRM",
#     Note: The two other Exon20 PDB structures (9GL8, 9GC6) were only available
#     in PDBx/mmCIF format --> have to fetch it manually.
}

#downloading PDBs
for name, pdb_id in pdb_ids.items():
     pdb_id = pdb_id.upper()
     pdb_file = f"proteins/{name}.pdb"
     url = f"https://files.rcsb.org/download/{pdb_id}.pdb"

     try:
         urllib.request.urlretrieve(url, pdb_file)
         print(f"Downloaded {pdb_file}")
     except urllib.error.HTTPError as e:
         if e.code == 404:
             print(f"PDB ID {pdb_id} not found on RCSB. Skipping {name}.")
         else:
             print(f"Failed to download {pdb_id}: {e}")

print("Protein download process completed")

Two Exon20 Insertion structures (9GC6, 9GL8) were available only in PDBx/mmCIF format. Due to limitations in automated conversion on Google Colab, these files were manually converted from `.cif` to `.pdb` using PyMOL to ensure consistency and avoid file corruption.

## 2. Ligand Structures

Ligands were retrieved from ChEMBL. Controls were manually fetched (unable to fetch them through this code).

**TKI ligands:** Gefitinib, Erlotinib, Afatinib, Mobocertinib, Osimertinib

**Controls:** Aspirin, Caffeine, Ibuprofen

Generating a CSV file, `ligands_smiles.csv`, containing canonical SMILES strings and associated activity data:

In [None]:
from chembl_webresource_client.new_client import new_client
import pandas as pd

#ChEMBL endpoints
target = new_client.target       #for querying targets
activity = new_client.activity   #for activity data
molecule = new_client.molecule   #for molecular info (SMILES, etc.)

In [None]:
#TKIs: fetch all binding assay activity data against EGFR WT
#Controls: negative controls, only SMILES needed (no activity)

#TKIs with ChEMBL IDs
selected_ligands = [
    {"name": "Gefitinib", "chembl_id": "CHEMBL939", "type": "TKI"},
    {"name": "Erlotinib", "chembl_id": "CHEMBL553", "type": "TKI"},
    {"name": "Afatinib", "chembl_id": "CHEMBL1173655", "type": "TKI"},
    {"name": "Mobocertinib", "chembl_id": "CHEMBL4650319", "type": "TKI"},
    {"name": "Osimertinib", "chembl_id": "CHEMBL3353410", "type": "TKI"}
]

#control ligands (manual SMILES)
control_ligands = [
    {"name": "Aspirin", "chembl_id": "CHEMBL25", "canonical_smiles": "CC(=O)OC1=CC=CC=C1C(=O)O", "type": "control"},
    {"name": "Caffeine", "chembl_id": "CHEMBL68", "canonical_smiles": "CN1C=NC2=C1C(=O)N(C(=O)N2C)C", "type": "control"},
    {"name": "Ibuprofen", "chembl_id": "CHEMBL521", "canonical_smiles": "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O", "type": "control"}
]

all_ligands = selected_ligands + control_ligands
ligand_data = []

for ligand in all_ligands:
    chembl_id = ligand["chembl_id"]

    #get canonical SMILES if missing (only for TKIs)
    if ligand.get("type") == "TKI" and "canonical_smiles" not in ligand:
        try:
            mol_info = molecule.get(chembl_id)
            ligand["canonical_smiles"] = mol_info.get("molecule_structures", {}).get("canonical_smiles")
        except Exception as e:
            print(f"[SMILES] Failed for {ligand['name']} ({chembl_id}): {e}")
            ligand["canonical_smiles"] = None

    #fetch activity data for TKIs only
    if ligand["type"] == "TKI":
        try:
            results = activity.filter(
                molecule_chembl_id=chembl_id,
                target_chembl_id="CHEMBL203",   #EGFR WT
                assay_type="B",                 #binding assays
                standard_relation="="           #exact values
            )
            if results:
                for r in results:
                    ligand_data.append({
                        "chembl_id": chembl_id,
                        "name": ligand["name"],
                        "canonical_smiles": ligand.get("canonical_smiles"),
                        "standard_type": r.get("standard_type"),
                        "standard_value": r.get("standard_value"),
                        "standard_units": r.get("standard_units")
                    })
            else:
                #no activity found
                ligand_data.append({
                    "chembl_id": chembl_id,
                    "name": ligand["name"],
                    "canonical_smiles": ligand.get("canonical_smiles"),
                    "standard_type": None,
                    "standard_value": None,
                    "standard_units": None
                })
        except Exception as e:
            print(f"[Activity] Failed for {ligand['name']} ({chembl_id}): {e}")
            ligand_data.append({
                "chembl_id": chembl_id,
                "name": ligand["name"],
                "canonical_smiles": ligand.get("canonical_smiles"),
                "standard_type": None,
                "standard_value": None,
                "standard_units": None
            })
    else:
        #controls: just keep canonical SMILES, no activity
        ligand_data.append({
            "chembl_id": chembl_id,
            "name": ligand["name"],
            "canonical_smiles": ligand.get("canonical_smiles"),
            "standard_type": None,
            "standard_value": None,
            "standard_units": None
        })

In [None]:
df = pd.DataFrame(ligand_data)
df = df.sort_values(["name", "standard_type"], ignore_index=True)
df.to_csv("ligands_smiles.csv", index=False)
print("Saved ligands_smiles.csv")

In [None]:
print(df)