# Drug Side Effect Prediction with HSTrans

This notebook clones the Drug-Side-Effect repository and runs the training for the HSTrans model.

## Repository: https://github.com/tuanha1305/Drug-Side-Effect.git

## 1. Setup Environment

In [None]:
# Check if GPU is available
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")
else:
    print("No GPU available, using CPU")

In [None]:
# Clone the repository
!git clone https://github.com/tuanha1305/Drug-Side-Effect.git

# Change to the repository directory
%cd Drug-Side-Effect

# List the contents
!ls -la

## 2. Install Dependencies

In [None]:
# Install required packages
!pip install torch>=1.9.0 numpy>=1.19.0 pandas>=1.2.0 scipy>=1.6.0 scikit-learn>=0.24.0 matplotlib>=3.3.0

# Install RDKit (for chemical informatics)
!pip install rdkit-pypi

# Install subword-nmt
!pip install subword-nmt

# Install networkx
!pip install networkx>=2.5

## 3. Check Data Files

In [None]:
# Check if data files exist
import os
import pandas as pd

data_files = [
    'data/drug_side.pkl',
    'data/drug_SMILES_750.csv',
    'data/raw_frequency_750.mat',
    'data/side_effect_label_750.mat',
    'data/subword_units_map_chembl_freq_1500.csv'
]

for file in data_files:
    if os.path.exists(file):
        print(f"✓ {file} exists")
    else:
        print(f"✗ {file} missing")

# Check the main Python files
python_files = ['main.py', 'Net.py', 'Encoder.py', 'utils.py', 'smiles2vector.py']
for file in python_files:
    if os.path.exists(file):
        print(f"✓ {file} exists")
    else:
        print(f"✗ {file} missing")

## 4. Training Configuration

In [None]:
# Training parameters
import os

# Set device (use GPU if available, otherwise CPU)
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# Training hyperparameters
learning_rate = 1e-4
weight_decay = 0.01
num_epochs = 200
batch_size = 128
log_interval = 40

print(f"Learning rate: {learning_rate}")
print(f"Weight decay: {weight_decay}")
print(f"Number of epochs: {num_epochs}")
print(f"Batch size: {batch_size}")

# Create necessary directories
os.makedirs('checkpoints', exist_ok=True)
os.makedirs('results', exist_ok=True)
os.makedirs('predictResult', exist_ok=True)
os.makedirs('data/sub', exist_ok=True)

print("Directories created successfully!")

## 5. Run Training

In [None]:
# Run the training script with specified parameters
!python main.py \
    --model 0 \
    --lr {learning_rate} \
    --wd {weight_decay} \
    --epoch {num_epochs} \
    --log_interval {log_interval} \
    --cuda_name {device}

## 6. Monitor Training Progress

In [None]:
# Check training progress by looking at the metrics file
import json
import matplotlib.pyplot as plt
import pandas as pd

try:
    with open('results/train_metrics_per_epoch.json', 'r') as f:
        metrics = json.load(f)
    
    df_metrics = pd.DataFrame(metrics)
    print("Training Metrics:")
    print(df_metrics.tail())  # Show last few epochs
    
    # Plot training loss
    plt.figure(figsize=(12, 4))
    
    plt.subplot(1, 3, 1)
    plt.plot(df_metrics['epoch'], df_metrics['train_loss'])
    plt.title('Training Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    
    plt.subplot(1, 3, 2)
    plt.plot(df_metrics['epoch'], df_metrics['MSE'])
    plt.title('MSE')
    plt.xlabel('Epoch')
    plt.ylabel('MSE')
    
    plt.subplot(1, 3, 3)
    plt.plot(df_metrics['epoch'], df_metrics['RMSE'])
    plt.title('RMSE')
    plt.xlabel('Epoch')
    plt.ylabel('RMSE')
    
    plt.tight_layout()
    plt.show()
    
except FileNotFoundError:
    print("Training metrics file not found yet. Training may still be in progress.")
except Exception as e:
    print(f"Error reading metrics: {e}")

## 7. View Results

In [None]:
# Display final results
try:
    with open('results/metrics_original.json', 'r') as f:
        final_metrics = json.load(f)
    
    print("Final Training Results:")
    print("=" * 40)
    for metric, value in final_metrics.items():
        print(f"{metric}: {value:.5f}")
    
except FileNotFoundError:
    print("Results file not found yet. Training may still be running.")
except Exception as e:
    print(f"Error reading results: {e}")

## 8. Check Model Checkpoints

In [None]:
# List saved model checkpoints
import os

checkpoint_dir = 'checkpoints'
if os.path.exists(checkpoint_dir):
    checkpoints = [f for f in os.listdir(checkpoint_dir) if f.endswith('.pth')]
    checkpoints.sort()
    print(f"Found {len(checkpoints)} checkpoints:")
    for ckpt in checkpoints:
        ckpt_path = os.path.join(checkpoint_dir, ckpt)
        size_mb = os.path.getsize(ckpt_path) / (1024 * 1024)
        print(f"  {ckpt} ({size_mb:.2f} MB)")
else:
    print("No checkpoints directory found.")

## 9. Download Results (Optional)

In [None]:
# Zip and download results
import zipfile
import os
from google.colab import files

# Create a zip file with results
with zipfile.ZipFile('training_results.zip', 'w') as zipf:
    # Add results
    if os.path.exists('results'):
        for root, dirs, files in os.walk('results'):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), '.'))
    
    # Add checkpoints
    if os.path.exists('checkpoints'):
        for root, dirs, files in os.walk('checkpoints'):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), '.'))
    
    # Add predictions
    if os.path.exists('predictResult'):
        for root, dirs, files in os.walk('predictResult'):
            for file in files:
                zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), '.'))

print("Results zipped successfully!")
print("File size:", os.path.getsize('training_results.zip') / (1024*1024), "MB")

# Download the zip file (uncomment the line below to download)
# files.download('training_results.zip')

## 10. Training Summary

This notebook has:
1. ✅ Cloned the Drug-Side-Effect repository
2. ✅ Installed all required dependencies
3. ✅ Set up the training environment
4. ✅ Configured training parameters
5. ✅ Ran the HSTrans model training
6. ✅ Monitored training progress
7. ✅ Displayed final results
8. ✅ Listed saved checkpoints
9. ✅ Prepared results for download

### Model Information:
- **Architecture**: HSTrans (Hierarchical Transformer)
- **Task**: Drug-Drug Side Effect Prediction
- **Input**: Drug SMILES sequences and side effect information
- **Output**: Side effect probability predictions

### Key Metrics:
- **MSE**: Mean Squared Error (lower is better)
- **RMSE**: Root Mean Squared Error (lower is better)
- **SCC**: Spearman Correlation Coefficient (higher is better)
- **AUC**: Area Under Curve (higher is better)
- **AUPR**: Area Under Precision-Recall Curve (higher is better)

### Files Generated:
- `checkpoints/`: Model checkpoints (.pth files)
- `results/`: Training metrics and final results
- `predictResult/`: Model predictions
- `data/sub/`: Substructure analysis files

To continue training from a checkpoint, you can use the `--resume` parameter:
```bash
python main.py --resume checkpoints/latest_0.pth --lr 1e-4 --epoch 200
```