[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/r-kowalczyk/graph-link-prediction/blob/main/notebooks/colab_runner.ipynb)

# Graph Link Prediction - Google Colab Runner

This notebook provides a ready-to-use environment for running the hybrid graph link prediction pipeline on Google Colab with GPU acceleration.

## What This Notebook Does

This notebook trains a link prediction model that combines:
- **Structural embeddings**: Node2Vec captures graph topology and connectivity patterns
- **Semantic embeddings**: Transformer models capture textual descriptions of nodes

The pipeline predicts biomedical links (edges) between nodes based on both structural and semantic information.

Execute the cells below in order:
1. **Install Dependencies**: Installs PyTorch Geometric and required packages
2. **Install Package**: Installs the graph-link-prediction package from GitHub
3. **Run Training**: Executes the full training pipeline


After training completes, check the `artifacts/` directory for:
- `metrics.json`: Performance metrics (ROC-AUC, PR-AUC, F1, Precision@k, etc.)
- `curves/roc.png`: ROC curve visualisation
- `curves/pr.png`: Precision-Recall curve visualisation

## Configuration

The default configuration uses:
- **Variant**: `hybrid` (combines structural + semantic embeddings)
- **Config file**: `configs/full.yaml`
- **Model**: `bioformers/bioformer-16L` (biomedical transformer)

To use different settings, modify the training command in the last cell or create a custom config file.


## Step 1: Install PyTorch Geometric Dependencies

This cell installs PyTorch Geometric and its required dependencies. These are needed for the Node2Vec structural embedding algorithm, which leverages GPU acceleration when available.

The installation automatically detects your PyTorch and CUDA versions to install compatible wheels.


In [None]:
# Install PyTorch Geometric dependencies compatible with the current PyTorch version
import torch
from google.colab import drive


def install_pytorch_geometric():
    """Install PyTorch Geometric and dependencies matching the current PyTorch/CUDA setup."""
    TORCH_VERSION = torch.__version__.split("+")[0]
    CUDA_VERSION = torch.version.cuda.replace(".", "") if torch.version.cuda else None
    _BASE_URL = "https://data.pyg.org/whl"

    if CUDA_VERSION:
        # GPU-enabled installation
        print(
            f"Installing PyTorch Geometric for PyTorch {TORCH_VERSION} with CUDA {CUDA_VERSION}"
        )
        %pip install -q torch-scatter -f {_BASE_URL}/torch-{TORCH_VERSION}+cu{CUDA_VERSION}.html
        %pip install -q torch-sparse -f {_BASE_URL}/torch-{TORCH_VERSION}+cu{CUDA_VERSION}.html
        %pip install -q torch-cluster -f {_BASE_URL}/torch-{TORCH_VERSION}+cu{CUDA_VERSION}.html
        %pip install -q torch-spline-conv -f {_BASE_URL}/torch-{TORCH_VERSION}+cu{CUDA_VERSION}.html
        %pip install -q pyg-lib -f {_BASE_URL}/torch-{TORCH_VERSION}+cu{CUDA_VERSION}.html
    else:
        # CPU-only installation
        print(f"Installing PyTorch Geometric for PyTorch {TORCH_VERSION} (CPU only)")
        %pip install -q torch-scatter -f {_BASE_URL}/torch-{TORCH_VERSION}+cpu.html
        %pip install -q torch-sparse -f {_BASE_URL}/torch-{TORCH_VERSION}+cpu.html
        %pip install -q torch-cluster -f {_BASE_URL}/torch-{TORCH_VERSION}+cpu.html
        %pip install -q torch-spline-conv -f {_BASE_URL}/torch-{TORCH_VERSION}+cpu.html
        %pip install -q pyg-lib -f {_BASE_URL}/torch-{TORCH_VERSION}+cpu.html
    
    %pip install -q torch-geometric
    print("PyTorch Geometric installation complete!")


install_pytorch_geometric()

# Mount Google Drive (optional - only needed if you want to save results to Drive)
drive.mount("/content/drive")

## Step 2: Install the Graph Link Prediction Package

This cell installs the `graph-link-prediction` package from GitHub. The package includes all the training, evaluation, and embedding code needed to run the pipeline.

In [None]:
%pip install -q git+https://github.com/r-kowalczyk/graph-link-prediction.git

In [None]:
# Prepare configs locally (full.yaml, test.yaml)
!mkdir -p configs
!curl -sSLo configs/full.yaml https://raw.githubusercontent.com/r-kowalczyk/graph-link-prediction/main/configs/full.yaml
!curl -sSLo configs/test.yaml https://raw.githubusercontent.com/r-kowalczyk/graph-link-prediction/main/configs/test.yaml


## Step 3: Run the Training Pipeline

This cell executes the full training pipeline. It will:

1. **Load data** from the configured data directory (`graph-link-prediction/data/` by default)
2. **Generate embeddings**:
   - Structural embeddings using Node2Vec (runs on GPU if available)
   - Semantic embeddings using the transformer model (downloads model on first run)
3. **Train models**:
   - Logistic Regression baseline
   - MLP with hyperparameter search
4. **Evaluate** on validation and test sets
5. **Save results** to `artifacts/<timestamp>/`:
   - `metrics.json`: All performance metrics
   - `curves/roc.png`: ROC curve
   - `curves/pr.png`: Precision-Recall curve

**Note**: The first run will download the transformer model (`bioformers/bioformer-16L`), which may take a few minutes. Subsequent runs will use the cached model.

### Configuration Options

You can modify the training command below to:
- Use a different variant: `--variant structural` (structural only) or `--variant semantic` (semantic only)
- Use a different config file: `--config configs/full.yaml`
- Run on CPU explicitly: Set `device: cpu` in the config file


In [None]:
# Run the training pipeline with hybrid embeddings (structural + semantic)
!python -m graph_lp.train --config configs/full.yaml --variant hybrid

## Optional: Run with the small test config

Use the lightweight test configuration to verify the pipeline quickly:

In [None]:
# Quick test run (small/fast)
!python -m graph_lp.train --config configs/test.yaml --variant hybrid