[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/r-kowalczyk/graph-link-prediction/blob/main/notebooks/colab_runner.ipynb)

# Graph Link Prediction - Google Colab Runner

This notebook provides a ready-to-use environment for running the hybrid graph link prediction pipeline on Google Colab with GPU acceleration.

## What This Notebook Does

This notebook trains a link prediction model that combines:
- **Structural embeddings**: Node2Vec captures graph topology and connectivity patterns
- **Semantic embeddings**: Transformer models capture textual descriptions of nodes

The pipeline predicts biomedical links (edges) between nodes based on both structural and semantic information.

Execute the cells below in order:
1. **Install Dependencies**: Installs PyTorch Geometric and required packages
2. **Install Package**: Installs the graph-link-prediction package from GitHub
3. **Run Training**: Executes the full training pipeline


After training completes, check the `artifacts/` directory for:
- `metrics.json`: Performance metrics (ROC-AUC, PR-AUC, F1, Precision@k, etc.)
- `curves/roc.png`: ROC curve visualisation
- `curves/pr.png`: Precision-Recall curve visualisation

## Configuration

The default configuration uses:
- **Variant**: `hybrid` (combines structural + semantic embeddings)
- **Config file**: `configs/full.yaml`
- **Model**: `bioformers/bioformer-16L` (biomedical transformer)

To use different settings, modify the training command in the last cell or create a custom config file.


## Prerequisites - Get the data and save to your Google Drive

1) Download the data folder from [here](https://drive.google.com/drive/folders/1XbGVQiNid29Mxt1tjR7gitfxaVwJvc0t?usp=sharing).

2) Save the folder in the root of your Google Drive (see configs for expected paths).

3) Run the below cell to mount your Google Drive (make it available to the notebook).

In [None]:
from google.colab import drive

drive.mount("/content/drive")

## Step 1: Install PyTorch Geometric Dependencies

This cell installs PyTorch Geometric and its required dependencies. These are needed for the Node2Vec structural embedding algorithm, which leverages GPU acceleration when available.

The installation automatically detects your PyTorch and CUDA versions to install compatible wheels.


In [None]:
# Pin PyTorch to 2.4.0+cu121 because PyTorch Geometric publishes wheels for this pair, which prevents slow source builds in Colab.
%pip install -q torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

import torch

def install_pytorch_geometric():
    """Install PyTorch Geometric wheels for the pinned torch 2.4.0 CUDA 12.1 stack.

    Parameters: None
    Uses the existing torch installation to derive the wheel feed.
    Expects torch to be preinstalled at 2.4.0 with CUDA 12.1.
    Installs binary wheels so the GPU runtime stays ready without long builds.
    """
    torch_version = torch.__version__.split("+")[0]
    cuda_version = torch.version.cuda.replace(".", "") if torch.version.cuda else None
    base_url = "https://data.pyg.org/whl"

    # Use the wheel feed that matches the pinned torch and CUDA versions so pip downloads binaries instead of compiling from source.
    if cuda_version:
        print(
            f"Installing PyTorch Geometric for PyTorch {torch_version} with CUDA {cuda_version}"
        )
        %pip install -q torch-scatter -f {base_url}/torch-{torch_version}+cu{cuda_version}.html
        %pip install -q torch-sparse -f {base_url}/torch-{torch_version}+cu{cuda_version}.html
        %pip install -q torch-cluster -f {base_url}/torch-{torch_version}+cu{cuda_version}.html
        %pip install -q torch-spline-conv -f {base_url}/torch-{torch_version}+cu{cuda_version}.html
        %pip install -q pyg-lib -f {base_url}/torch-{torch_version}+cu{cuda_version}.html
    else:
        print(f"Installing PyTorch Geometric for PyTorch {torch_version} (CPU only)")
        %pip install -q torch-scatter -f {base_url}/torch-{torch_version}+cpu.html
        %pip install -q torch-sparse -f {base_url}/torch-{torch_version}+cpu.html
        %pip install -q torch-cluster -f {base_url}/torch-{torch_version}+cpu.html
        %pip install -q torch-spline-conv -f {base_url}/torch-{torch_version}+cpu.html
        %pip install -q pyg-lib -f {base_url}/torch-{torch_version}+cpu.html

    %pip install -q torch-geometric
    print("PyTorch Geometric installation complete!")


install_pytorch_geometric()

## Step 2: Install the Graph Link Prediction Package

This cell installs the `graph-link-prediction` package from GitHub. The package includes all the training, evaluation, and embedding code needed to run the pipeline.

In [None]:
%pip install -q git+https://github.com/r-kowalczyk/graph-link-prediction.git

In [None]:
# Prepare configs locally (full.yaml, test.yaml)
!mkdir -p configs
!curl -sSLo configs/full.yaml https://raw.githubusercontent.com/r-kowalczyk/graph-link-prediction/main/configs/full.yaml
!curl -sSLo configs/test.yaml https://raw.githubusercontent.com/r-kowalczyk/graph-link-prediction/main/configs/test.yaml


## Step 3: Run the Training Pipeline

This cell executes the full training pipeline. It will:

1. **Load data** from the configured data directory (`graph-link-prediction/data/` by default)
2. **Generate embeddings**:
   - Structural embeddings using Node2Vec (runs on GPU if available)
   - Semantic embeddings using the transformer model (downloads model on first run)
3. **Train models**:
   - Logistic Regression baseline
   - MLP with hyperparameter search
4. **Evaluate** on validation and test sets
5. **Save results** to `artifacts/<timestamp>/`:
   - `metrics.json`: All performance metrics
   - `curves/roc.png`: ROC curve
   - `curves/pr.png`: Precision-Recall curve

**Note**: The first run will download the transformer model (`bioformers/bioformer-16L`), which may take a few minutes. Subsequent runs will use the cached model.

### Configuration Options

You can modify the training command below to:
- Use a different variant: `--variant structural` (structural only) or `--variant semantic` (semantic only)
- Use a different config file: `--config configs/full.yaml`
- Run on CPU explicitly: Set `device: cpu` in the config file


In [None]:
# Run the training pipeline with hybrid embeddings (structural + semantic)
!python -m graph_lp.train --config configs/full.yaml --variant hybrid

## Optional: Run with the small test config

Use the lightweight test configuration to verify the pipeline quickly:

In [None]:
# Quick test run (small/fast)
!python -m graph_lp.train --config configs/test.yaml --variant hybrid