<a href="https://colab.research.google.com/github/win-eva/EGFR-TKI-Docking-Analysis/blob/main/scripts/01_fetchdata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Fetching EGFR Proteins**

## Wildtype
- PDB 4WKQ (Gefitinib)
- PDB 4G5J (Afatinib)
- PDB 8F1X (Mobocertinib)

## L858R
- PDB 2ITZ (Gefitinib)
- PDB 6JWL (Osimertinib)
- PDB 2ITT (AEE788)  
  *Note:* This structure was co-crystallised with AEE788, a second-generation EGFR TKI similar to Afatinib, providing a representative pocket conformation.

## T790M
- PDB 6JX0 (Osimertinib)
- PDB 4G5P (Afatinib)
- PDB 5GMP (XTF-262)  
*Note:* XTF-262 serves as a surrogate for T790M-specific hinge binding due to lack of co-crystals with first-gen TKIs.

## Exon20 Insertion
- PDB 4LRM (PD168393)
- PDB 9GC6 (A1IZ9)
- PDB 9GL8 (STX-721)  
*Notes:*  
  - 4LRM was used as a first-generation reference.  
  - A1IZ9 represents second/third-generation inhibitors.  
  - STX-721 captures mutant-induced pocket geometry.



## Downloading PDB Structures

This cell automatically downloads all selected EGFR receptor structures into the `proteins/` folder.

In [1]:
import urllib.request
import urllib.error
import os

# Folder to store the PDBs in
os.makedirs("proteins", exist_ok=True)

# List of PDBs to fetch
pdb_ids = {
    "EGFR_wt_4WKQ": "4WKQ",
    "EGFR_wt_4G5J": "4G5J",
    "EGFR_wt_8F1X": "8F1X",
    "EGFR_L858R_2ITZ": "2ITZ",
    "EGFR_L858R_6JWL": "6JWL",
    "EGFR_L858R_2ITT": "2ITT",
    "EGFR_T790M_6JX0": "6JX0",
    "EGFR_T790M_4G5P": "4G5P",
    "EGFR_T790M_5GMP": "5GMP",
    "EGFR_exon20_4LRM": "4LRM"
}

# Downloading PDBs
for name, pdb_id in pdb_ids.items():
    pdb_id = pdb_id.upper()
    pdb_file = f"proteins/{name}.pdb"
    url = f"https://files.rcsb.org/download/{pdb_id}.pdb"

    try:
        urllib.request.urlretrieve(url, pdb_file)
        print(f"Downloaded {pdb_file}")
    except urllib.error.HTTPError as e:
        if e.code == 404:
            print(f"PDB ID {pdb_id} not found on RCSB. Skipping {name}.")
        else:
            print(f"Failed to download {pdb_id}: {e}")

print("Protein download process completed!")

Downloaded proteins/EGFR_wt_4WKQ.pdb
Downloaded proteins/EGFR_wt_4G5J.pdb
Downloaded proteins/EGFR_wt_8F1X.pdb
Downloaded proteins/EGFR_L858R_2ITZ.pdb
Downloaded proteins/EGFR_L858R_6JWL.pdb
Downloaded proteins/EGFR_L858R_2ITT.pdb
Downloaded proteins/EGFR_T790M_6JX0.pdb
Downloaded proteins/EGFR_T790M_4G5P.pdb
Downloaded proteins/EGFR_T790M_5GMP.pdb
Downloaded proteins/EGFR_exon20_4LRM.pdb
Protein download process completed!


Two Exon20 Insertion structures (PDB 9GL8 + 9GC6) were only available in PDBx/mmCIF format. These were downloaded as .cif files and converted to .pdb using PyMOL to maintain consistency with the automated workflow for older structures.

## **Fetching Ligand Structures**

The following ligands are retrieved from ChEMBL or manually sourced (for controls).  
**TKI ligands**: Gefitinib, Erlotinib, Afatinib, Mobocertinib, Osimertinib.  
**Controls**: Aspirin, Caffeine, Ibuprofen.

The following cells create a CSV file `selected_egfr_ligands_smiles.csv` containing canonical SMILES and activity data.


In [2]:
!pip install chembl_webresource_client

from chembl_webresource_client.new_client import new_client

# Connect to target and activity endpoints
target = new_client.target
activity = new_client.activity



In [6]:
# ChEMBL clients
activity = new_client.activity
molecule = new_client.molecule

# TKI ligands with ChEMBL IDs
selected_ligands = [
    {"name": "Gefitinib", "molecule_chembl_id": "CHEMBL939"},
    {"name": "Erlotinib", "molecule_chembl_id": "CHEMBL553"},
    {"name": "Afatinib", "molecule_chembl_id": "CHEMBL1173655"},
    {"name": "Mobocertinib", "molecule_chembl_id": "CHEMBL4650319"},
    {"name": "Osimertinib", "molecule_chembl_id": "CHEMBL3353410"}
]

# Control ligands with manually retrieved SMILES and ChEMBL IDs
control_ligands = [
    {"name": "Aspirin", "molecule_chembl_id": "CHEMBL25", "canonical_smiles": "CC(=O)OC1=CC=CC=C1C(=O)O"},
    {"name": "Caffeine", "molecule_chembl_id": "CHEMBL68", "canonical_smiles": "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"},
    {"name": "Ibuprofen", "molecule_chembl_id": "CHEMBL521", "canonical_smiles": "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"}
]

all_ligands = selected_ligands + control_ligands
ligand_data = []

for ligand in all_ligands:
    chembl_id = ligand['molecule_chembl_id']

    # Fetch canonical SMILES for TKI ligands if missing
    if 'canonical_smiles' not in ligand or ligand['canonical_smiles'] is None:
        try:
            mol_info = molecule.get(chembl_id)
            ligand['canonical_smiles'] = mol_info.get('molecule_structures', {}).get('canonical_smiles')
        except Exception as e:
            print(f"Failed to fetch SMILES for {ligand['name']} ({chembl_id}): {e}")
            ligand['canonical_smiles'] = None

    try:
        # Fetch activity only for TKIs
        if chembl_id not in ["CHEMBL25", "CHEMBL68", "CHEMBL521"]:
            res = activity.filter(
                molecule_chembl_id=chembl_id,
                target_chembl_id='CHEMBL203',  # EGFR wild-type
                molecule_type="Small molecule"
            )
            if res:
                for r in res:
                    ligand_data.append({
                        'molecule_chembl_id': chembl_id,
                        'name': ligand['name'],
                        'canonical_smiles': ligand.get('canonical_smiles'),
                        'standard_type': r.get('standard_type'),
                        'standard_value': r.get('standard_value'),
                        'standard_units': r.get('standard_units')
                    })
            else:
                # TKI ligand with no activity data found
                ligand_data.append({
                    'molecule_chembl_id': chembl_id,
                    'name': ligand['name'],
                    'canonical_smiles': ligand.get('canonical_smiles'),
                    'standard_type': None,
                    'standard_value': None,
                    'standard_units': None
                })
        else:
            # Controls: just use manually set SMILES
            ligand_data.append({
                'molecule_chembl_id': chembl_id,
                'name': ligand['name'],
                'canonical_smiles': ligand['canonical_smiles'],
                'standard_type': None,
                'standard_value': None,
                'standard_units': None
            })

    except Exception as e:
        print(f"Failed to fetch activity for {ligand['name']} ({chembl_id}): {e}")
        ligand_data.append({
            'molecule_chembl_id': chembl_id,
            'name': ligand['name'],
            'canonical_smiles': ligand.get('canonical_smiles'),
            'standard_type': None,
            'standard_value': None,
            'standard_units': None
        })

In [7]:
import pandas as pd

df = pd.DataFrame(ligand_data)
df.to_csv("selected_egfr_ligands_smiles.csv", index=False)
print("Saved selected_egfr_ligands_smiles.csv")

Saved selected_egfr_ligands_smiles.csv


In [8]:
print(df)

     molecule_chembl_id         name  \
0             CHEMBL939    Gefitinib   
1             CHEMBL939    Gefitinib   
2             CHEMBL939    Gefitinib   
3             CHEMBL939    Gefitinib   
4             CHEMBL939    Gefitinib   
...                 ...          ...   
1141      CHEMBL3353410  Osimertinib   
1142      CHEMBL3353410  Osimertinib   
1143           CHEMBL25      Aspirin   
1144           CHEMBL68     Caffeine   
1145          CHEMBL521    Ibuprofen   

                                       canonical_smiles standard_type  \
0        COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1          IC50   
1        COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1          IC50   
2        COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1    Inhibition   
3        COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1          IC50   
4        COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCCCN1CCOCC1            Kd   
...                                                 ...           ...   
1141  C=