<a href="https://colab.research.google.com/github/win-eva/EGFR-TKI-Docking-Analysis/blob/main/02_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Preprocessing**
## 1. Ligand Preparation
Ligand preprocessing was performed locally on macOS, as certain RDKit and Meeko functions were either incompatible with or unstable in Google Colab.
### 1.1 Environment Setup
A dedicated conda environment (`chem`) was created with RDKit, pandas, numpy and Meeko installed:

In [None]:
#create and activate environment with ligprep_env.yml file
conda env create -f ligprep_env.yml

conda activate chem

### 1.2 Processing
Ligand SMILES read in Python, duplicates removed and each ligand converted into a 3D `.pdbqt` file. During this process: hydrogens added, and torsions and charges prepared using RDKit and Meeko.

In [None]:
import os
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from meeko import MoleculePreparation

In [None]:
csv_file = "ligands_smiles.csv"

In [None]:
df = pd.read_csv(csv_file)
df_unique = df.drop_duplicates(subset=['canonical_smiles'])
df_unique.reset_index(drop=True, inplace=True)

In [None]:
#create directory for ligands
os.makedirs("ligand_pdbqt", exist_ok=True)

In [None]:
#converts SMILES to 3D structures; adds hydrogens, prepares torsions and charges,
#and saves each ligand as a .pdbqt file ready for AutoDock Vina
for idx, row in df_unique.iterrows():
    name = row['name']
    smiles = row['canonical_smiles']
    mol = Chem.MolFromSmiles(smiles)
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol, AllChem.ETKDG())
    prep = MoleculePreparation()
    prep.prepare(mol)
    pdbqt_path = f"ligand_pdbqt/{name}.pdbqt"
    prep.write_pdbqt_file(pdbqt_path)
    print(f"Saved {pdbqt_path}")

## 2. Protein Preparation
### 2.1 PyMOL Preprocessing
The downloaded EGFR PDB structures were preprocessed in PyMOL to remove ions, bound ligands and water molecules. Cleaned structures were saved with the suffix `_nowater.pdb`.

### 2.2 Environment Setup for MGLTools
MGLTools (version 1.5.7 Patch 1; https://ccsb.scripps.edu/mgltools/downloads) was used to convert the cleaned `.pdb` files into .pdbqt format for docking.

Receptor preparation was performed in a Google Cloud Linux environment, as this provided a stable platform for running MGLTools after other setups (Google Colab and macOS) proved incompatible.

In [None]:
#upload requirements.txt and install environment
!pip install -r requirements.txt

In [None]:
#create directories for protein files
mkdir -p ~/receptors/pdb_files
mkdir -p ~/receptors/receptors_pdbqt

#upload .pdb files first
#then move preprocessed .pdb files into working directory
mv ~/EGFR_*_nowater.pdb ~/receptors/pdb_files/

#upload and extract MGLTools
tar -xvzf mgltools_x86_64Linux2_1.5.7p1.tar.gz
cd mgltools_x86_64Linux2_1.5.7
source ./initMGLtools.sh

### 2.3 Conversion to `.pdbqt`
Each receptor converted to `.pdbqt` using the `prepare_receptor4.py` script from MGLTools. Repeated for every PDB file across all four receptor types:

In [None]:
#wildtype
~/mgltools_x86_64Linux2_1.5.7/bin/pythonsh \
~/mgltools_x86_64Linux2_1.5.7/MGLToolsPckgs/AutoDockTools/Utilities24/prepare_receptor4.py \
-r ~/receptors/pdb_files/EGFR_wt_*_nowater.pdb \
-o ~/receptors/receptors_pdbqt/EGFR_wt_*_nowater.pdbqt

#L858R mutant
~/mgltools_x86_64Linux2_1.5.7/bin/pythonsh \
~/mgltools_x86_64Linux2_1.5.7/MGLToolsPckgs/AutoDockTools/Utilities24/prepare_receptor4.py \
-r ~/receptors/pdb_files/EGFR_L858R_*_nowater.pdb \
-o ~/receptors/receptors_pdbqt/EGFR_L858R_*_nowater.pdbqt

#T790M mutant
~/mgltools_x86_64Linux2_1.5.7/bin/pythonsh \
~/mgltools_x86_64Linux2_1.5.7/MGLToolsPckgs/AutoDockTools/Utilities24/prepare_receptor4.py \
-r ~/receptors/pdb_files/EGFR_T790M_*_nowater.pdb \
-o ~/receptors/receptors_pdbqt/EGFR_T790M_*_nowater.pdbqt

#Exon20 insertion mutant
~/mgltools_x86_64Linux2_1.5.7/bin/pythonsh \
~/mgltools_x86_64Linux2_1.5.7/MGLToolsPckgs/AutoDockTools/Utilities24/prepare_receptor4.py \
-r ~/receptors/pdb_files/EGFR_exon20_*_nowater.pdb \
-o ~/receptors/receptors_pdbqt/EGFR_exon20_*_nowater.pdbqt

The resulting `.pdbqt` files were saved in the `receptors_pdbqt` folder, downloaded locally, and later uploaded to Google Colab for docking.