# 🧬 Drug Side Effect Prediction with Transformers

**Complete pipeline for training and evaluating drug side effect prediction model**

- 🚀 PyTorch 2.x optimizations (AMP, torch.compile, Flash Attention)
- 📊 10-fold cross-validation
- 📈 Comprehensive metrics (RMSE, AUC-ROC, Pearson, etc.)
- ⚡ GPU accelerated training

---

## 1️⃣ Setup Environment

Install dependencies and clone repository

In [None]:
# Check GPU availability
!nvidia-smi

In [None]:
# Install dependencies
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install -q numpy pandas scikit-learn scipy tqdm tensorboard matplotlib seaborn
!pip install -q rdkit subword-nmt

print("✓ Dependencies installed!")

In [None]:
# Clone repository (replace with your GitHub URL)
import os

REPO_URL = "https://github.com/YOUR_USERNAME/drug-side-effect-prediction.git"
REPO_NAME = "drug-side-effect-prediction"

# Remove existing directory if any
if os.path.exists(REPO_NAME):
    !rm -rf {REPO_NAME}

# Clone
!git clone {REPO_URL}

# Change to repo directory
%cd {REPO_NAME}

print("✓ Repository cloned!")
!ls -lh

### Alternative: Upload files directly

If you don't have a GitHub repo, upload files manually:

In [None]:
# Uncomment to upload files manually
# from google.colab import files
# uploaded = files.upload()
# 
# # Create project structure
# !mkdir -p data/raw data/processed
# !mkdir -p outputs/checkpoints outputs/logs outputs/results

## 2️⃣ Verify Installation

In [None]:
import torch
import numpy as np
import pandas as pd

print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("CUDA version:", torch.version.cuda)
print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")
print("\n✓ All imports successful!")

In [None]:
# Test project imports
!python test_imports.py

## 3️⃣ Upload Data

Upload your data files to `data/raw/`

In [None]:
# Create directories
!mkdir -p data/raw data/processed

# Upload data files
from google.colab import files

print("Please upload the following files:")
print("  1. drug_SMILES_750.csv")
print("  2. drug_codes_chembl_freq_1500.txt")
print("  3. subword_units_map_chembl_freq_1500.csv")
print("  4. drug_side.pkl")
print("\nUploading...")

uploaded = files.upload()

# Move files to data/raw
for filename in uploaded.keys():
    !mv {filename} data/raw/

print("\n✓ Data files uploaded!")
!ls -lh data/raw/

### Alternative: Download from Google Drive

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Copy data from Drive (adjust paths)
# !cp /content/drive/MyDrive/drug_data/* data/raw/

print("✓ Google Drive mounted!")

## 4️⃣ Preprocess Data

Extract features and create cross-validation splits

In [None]:
# Run preprocessing
!python preprocess_data.py \
    --data_dir data/raw \
    --output_dir data/processed \
    --top_k 50 \
    --n_folds 10 \
    --random_state 42

print("\n✓ Preprocessing complete!")
print("\nProcessed files:")
!ls -lh data/processed/

In [None]:
# Check dataset statistics
import json

with open('data/processed/dataset_statistics.json', 'r') as f:
    stats = json.load(f)

print("="*60)
print("Dataset Statistics")
print("="*60)
print(f"Total samples:      {stats['dataset']['num_samples']:,}")
print(f"Drugs:              {stats['dataset']['num_drugs']:,}")
print(f"Side effects:       {stats['dataset']['num_side_effects']:,}")
print(f"Positive samples:   {stats['dataset']['num_positive']:,}")
print(f"Negative samples:   {stats['dataset']['num_negative']:,}")
print(f"Positive ratio:     {stats['dataset']['positive_ratio']:.2%}")
print("="*60)

## 5️⃣ Training

Train model with all optimizations

### Option A: Quick Training (1 fold, few epochs)

In [None]:
# Quick training for testing (1 fold, 10 epochs)
!python train.py \
    --config fast \
    --epochs 10 \
    --batch_size 128 \
    --lr 1e-4 \
    --device cuda \
    --use_amp \
    --start_fold 0 \
    --end_fold 1 \
    --seed 42

print("\n✓ Quick training complete!")

### Option B: Full Training (10-fold CV, 200 epochs)

In [None]:
# Full training (WARNING: This will take 80-150 hours!)
# Uncomment to run

# !python train.py \
#     --config fast \
#     --epochs 200 \
#     --batch_size 128 \
#     --lr 1e-4 \
#     --device cuda \
#     --use_amp \
#     --compile_model \
#     --start_fold 0 \
#     --end_fold 10 \
#     --seed 42

print("Full training configured (uncomment to run)")

### Option C: Train Specific Folds

In [None]:
# Train specific fold (e.g., fold 0)
FOLD = 0
EPOCHS = 50

!python train.py \
    --config fast \
    --epochs {EPOCHS} \
    --batch_size 128 \
    --device cuda \
    --use_amp \
    --start_fold {FOLD} \
    --end_fold {FOLD + 1}

print(f"\n✓ Fold {FOLD} training complete!")

## 6️⃣ Monitor Training

View training progress with TensorBoard

In [None]:
# Load TensorBoard extension
%load_ext tensorboard

# Start TensorBoard
%tensorboard --logdir logs/tensorboard

In [None]:
# Check training results
import json

# Load results for fold 0
with open('outputs/results/fold_0_results.json', 'r') as f:
    results = json.load(f)

print("="*60)
print("Fold 0 Validation Results")
print("="*60)
print(f"RMSE:       {results['rmse']:.4f}")
print(f"MAE:        {results['mae']:.4f}")
print(f"Pearson:    {results['pearson']:.4f}")
print(f"Spearman:   {results['spearman']:.4f}")
print(f"AUC-ROC:    {results['auc_roc']:.4f}")
print(f"AUC-PR:     {results['auc_pr']:.4f}")
print(f"Accuracy:   {results['accuracy']:.4f}")
print(f"F1-Score:   {results['f1']:.4f}")
print("="*60)

## 7️⃣ Evaluation

Evaluate trained models on test set

In [None]:
# Evaluate all trained folds
!python evaluate.py \
    --checkpoint_dir outputs/checkpoints \
    --output_dir outputs \
    --device cuda \
    --save_predictions

print("\n✓ Evaluation complete!")

In [None]:
# Show test results
import json

# Load aggregated results
with open('outputs/results/test_aggregated_results.json', 'r') as f:
    agg_results = json.load(f)

print("="*60)
print("Test Results (Mean ± Std)")
print("="*60)

metrics_to_show = [
    'rmse', 'mae', 'pearson', 'spearman',
    'auc_roc', 'auc_pr', 'accuracy', 'f1'
]

for metric in metrics_to_show:
    mean_key = f"{metric}_mean"
    std_key = f"{metric}_std"
    
    if mean_key in agg_results:
        mean_val = agg_results[mean_key]
        std_val = agg_results.get(std_key, 0)
        print(f"{metric:12s}: {mean_val:.4f} ± {std_val:.4f}")

print("="*60)

## 8️⃣ Visualization

Visualize predictions and metrics

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Load predictions for fold 0
predictions_df = pd.read_csv('outputs/results/fold_0/predictions.csv')

# Plot predictions vs actual
plt.figure(figsize=(10, 8))
plt.scatter(predictions_df['label'], predictions_df['prediction'], 
            alpha=0.5, s=10)

# Diagonal line
min_val = min(predictions_df['label'].min(), predictions_df['prediction'].min())
max_val = max(predictions_df['label'].max(), predictions_df['prediction'].max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2)

plt.xlabel('Actual Values', fontsize=12)
plt.ylabel('Predicted Values', fontsize=12)
plt.title('Predictions vs Actual (Fold 0)', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Plot error distribution
plt.figure(figsize=(10, 6))
plt.hist(predictions_df['error'], bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Absolute Error', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Error Distribution (Fold 0)', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("✓ Visualizations complete!")

## 9️⃣ Download Results

Download trained models and results

In [None]:
# Zip results
!zip -r results.zip outputs/results/
!zip -r checkpoints.zip outputs/checkpoints/

print("✓ Results zipped!")
!ls -lh *.zip

In [None]:
# Download results
from google.colab import files

files.download('results.zip')
files.download('checkpoints.zip')

print("✓ Files downloaded!")

### Alternative: Save to Google Drive

In [None]:
# Copy to Google Drive
!cp -r outputs/results /content/drive/MyDrive/drug_prediction_results/
!cp -r outputs/checkpoints /content/drive/MyDrive/drug_prediction_checkpoints/

print("✓ Results saved to Google Drive!")

## 🔟 Inference (Optional)

Make predictions on new drug-side effect pairs

In [None]:
# Load trained model for inference
import torch
from config import get_default_config
from model import create_model
from smiles_encoder import create_smiles_encoder

# Load config
config = get_default_config()
config.device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Create model
model = create_model(config.model, device=config.device)

# Load checkpoint
checkpoint = torch.load(
    'outputs/checkpoints/fold_0/best_model.pth',
    map_location=config.device
)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Create SMILES encoder
smiles_encoder = create_smiles_encoder(
    vocab_path='data/raw/drug_codes_chembl_freq_1500.txt',
    subword_map_path='data/raw/subword_units_map_chembl_freq_1500.csv',
    max_len=50
)

print("✓ Model loaded for inference!")

In [None]:
# Make prediction on a single drug-SE pair
import numpy as np

def predict_side_effect(drug_smiles, se_id, se_index, se_mask):
    """
    Predict side effect severity
    
    Args:
        drug_smiles: SMILES string
        se_id: Side effect ID
        se_index: SE substructure indices
        se_mask: SE substructure mask
    
    Returns:
        prediction: Predicted severity score
    """
    # Encode drug SMILES
    drug_encoded, drug_mask = smiles_encoder.encode(drug_smiles)
    
    # Prepare inputs
    drug_tensor = torch.from_numpy(drug_encoded).unsqueeze(0).to(config.device)
    drug_mask_tensor = torch.from_numpy(drug_mask).unsqueeze(0).to(config.device)
    
    se_tensor = torch.from_numpy(se_index[se_id]).unsqueeze(0).to(config.device)
    se_mask_tensor = torch.from_numpy(se_mask[se_id]).unsqueeze(0).to(config.device)
    
    # Predict
    with torch.no_grad():
        output, _, _ = model(drug_tensor, se_tensor, drug_mask_tensor, se_mask_tensor)
    
    return output.item()

# Example prediction
# Load SE data
se_index = np.load('data/processed/SE_sub_index_50_0.npy')
se_mask = np.load('data/processed/SE_sub_mask_50_0.npy')

# Example drug SMILES
example_smiles = "CC(C)Cc1ccc(cc1)[C@@H](C)C(=O)O"  # Ibuprofen
example_se_id = 0  # Side effect ID

prediction = predict_side_effect(example_smiles, example_se_id, se_index, se_mask)

print(f"Drug: {example_smiles}")
print(f"Side Effect ID: {example_se_id}")
print(f"Predicted Severity: {prediction:.4f}")

## 💡 Tips & Tricks

### For Faster Training:
1. **Use smaller batch size** if out of memory: `--batch_size 64`
2. **Enable AMP**: `--use_amp` (already enabled)
3. **Use fast config**: `--config fast`
4. **Reduce epochs** for testing: `--epochs 10`

### For Better Results:
1. **Train all 10 folds** for robust evaluation
2. **Use 200 epochs** for full training
3. **Try different learning rates**: `--lr 1e-4` or `--lr 5e-5`
4. **Enable torch.compile** (PyTorch 2.0+): `--compile_model`

### For Memory Issues:
1. **Use memory efficient config**: `--config memory_efficient`
2. **Reduce batch size**: `--batch_size 32`
3. **Enable gradient checkpointing** in config

### To Resume Training:
1. Load checkpoint in `trainer.py`
2. Continue from last epoch

---

## 🔧 Troubleshooting

### Common Issues:

**1. CUDA Out of Memory:**
```bash
# Reduce batch size
python train.py --batch_size 32 --config memory_efficient
```

**2. Import Errors:**
```bash
# Verify all imports
python test_imports.py
```

**3. Unknown Token Warnings:**
- Normal! Limited to 10 warnings per run
- Unknown tokens are handled gracefully

**4. Slow Training:**
- Make sure GPU is enabled: Runtime → Change runtime type → GPU
- Enable AMP: `--use_amp`
- Use fast config: `--config fast`

**5. Session Timeout:**
- Colab free tier: 12 hour limit
- Save checkpoints frequently
- Use Google Drive to persist data

---

## 📝 Summary

### What We Did:
1. ✅ Set up environment and installed dependencies
2. ✅ Cloned/uploaded project code
3. ✅ Uploaded and preprocessed data
4. ✅ Trained drug side effect prediction model
5. ✅ Evaluated model performance
6. ✅ Visualized results
7. ✅ Downloaded/saved results

### Expected Results:
- **RMSE:** ~0.25
- **MAE:** ~0.18
- **Pearson:** ~0.85
- **AUC-ROC:** ~0.95
- **AUC-PR:** ~0.92

### Next Steps:
1. Train all 10 folds for complete evaluation
2. Experiment with hyperparameters
3. Try different model architectures
4. Deploy model for inference

---

**Happy Training! 🚀**