[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/r-kowalczyk/graph-link-prediction/blob/main/notebooks/colab_runner.ipynb)

# Graph Link Prediction - Google Colab Runner

This notebook runs the project using the same CLI as the local quickstart (`graph-lp`), but with a Colab oriented configuration that expects the full dataset on Google Drive and can use a GPU.

## Two ways to run

- **Quickstart sanity run (CPU, bundled data)**: uses `configs/quickstart.yaml` and finishes quickly (same as README quickstart).
- **Full run (Drive data, optional GPU)**: uses `configs/full.yaml` and trains the full hybrid pipeline.

Both modes use the same CLI commands. The only thing that changes is the config file, which controls the data paths, model choice, and output directory.

## What you get

After training completes, check the output directory for:

- `metrics.json`: performance metrics (ROC-AUC, PR-AUC, F1, Precision@k, and more)
- `config_used.yaml`: the exact config text used for the run
- `curves/roc.png`: ROC curve visualisation
- `curves/pr.png`: Precision-Recall curve visualisation


## Prerequisites

If you want to run the full pipeline on the full dataset, you need the data on Google Drive.

1) Download the data folder from [here](https://drive.google.com/drive/folders/1XbGVQiNid29Mxt1tjR7gitfxaVwJvc0t?usp=sharing).

2) Save the folder in the root of your Google Drive (see `configs/full.yaml` for the expected paths).

3) Run the next cell to mount your Google Drive so the notebook can read the CSV files.

If you only want to run the quickstart sanity run, you can skip the data download, because `configs/quickstart.yaml` uses a tiny bundled dataset in the repository.


In [None]:
from google.colab import drive

drive.mount("/content/drive")

## Step 1: Install PyTorch Geometric Dependencies

This cell installs PyTorch Geometric and its required dependencies. These are needed for the Node2Vec structural embedding algorithm, which is used by the full hybrid run.

If you only plan to run the quickstart sanity run (semantic-only), you can skip this step.

The installation automatically detects your PyTorch and CUDA versions to install compatible wheels.


In [None]:
# Pin PyTorch to 2.8.0 because PyTorch Geometric publishes wheels for this pair,
# which prevents slow source builds in Colab.
import subprocess

def _has_nvidia_runtime() -> bool:
    try:
        subprocess.run(
            ["nvidia-smi"],
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL,
            check=True,
        )
        return True
    except Exception:
        return False

# Pin PyTorch to 2.8.0. Use CUDA wheels when available; otherwise use CPU wheels.
if _has_nvidia_runtime():
    torch_index_url = "https://download.pytorch.org/whl/cu126"
    pyg_wheel_url = "https://data.pyg.org/whl/torch-2.8.0%2Bcu126.html"
    print("Detected NVIDIA runtime. Installing PyTorch 2.8.0 (CUDA 12.6 wheels).")
else:
    torch_index_url = "https://download.pytorch.org/whl/cpu"
    pyg_wheel_url = "https://data.pyg.org/whl/torch-2.8.0%2Bcpu.html"
    print("No NVIDIA runtime detected. Installing PyTorch 2.8.0 (CPU wheels).")

# Note: build tags (+cu126 / +cpu) are present in wheel filenames, but pip can match them with torch==2.8.0.
%pip install -q torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url {torch_index_url}

# Install PyTorch Geometric binary wheels that match the pinned torch build (avoids source builds).
%pip install -q pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f {pyg_wheel_url}
%pip install -q torch-geometric

import torch
print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)

## Step 2: Install the repository (editable)

This cell clones the repository and installs it in editable mode.

Using an editable install keeps the notebook consistent with the local README workflow and makes the config files (`configs/*.yaml`) available without downloading them separately.


In [None]:
!git clone https://github.com/r-kowalczyk/graph-link-prediction.git
%cd graph-link-prediction

# Install in editable mode so the notebook and the repository code stay in sync.
%pip install -q -e .


In [None]:
# Confirm the CLI is available and the repository configs are present.
!graph-lp --help
!ls -lah configs


## Step 3: Quickstart sanity run (recommended)

This cell runs a small start to finish training and evaluation.

- **Config**: `configs/quickstart.yaml`
- **Data**: bundled CSV files in the repository
- **Device**: CPU
- **Output**: `artifacts_quickstart/<timestamp>/`

To check the runtime and dependencies are working before starting the full run.


In [None]:
# Quickstart sanity run (small CPU run on bundled data)
!graph-lp train --config configs/quickstart.yaml --device cpu --seed 42
!graph-lp evaluate --config configs/quickstart.yaml --device cpu


## Step 4: Full run (Drive data, optional GPU)

Assuming you are in Google Colab and have credits to spend on hardware acceleration, you can run the full pipeline using the below cell.

This cell runs the full hybrid pipeline using the dataset you placed on Google Drive (see Prerequisites for access to the full data).

- **Config**: `configs/full.yaml`
- **Data**: expected under `/content/drive/MyDrive/graph-link-prediction-files/data/full`
- **Variant**: `hybrid`
- **Device**: set to `auto` so CUDA is used when available

The output directory is set explicitly so artefacts are stored on Drive rather than only in the Colab runtime.


In [None]:
# Full training run (hybrid embeddings, Drive dataset)
full_output_directory = "/content/drive/MyDrive/graph-link-prediction-files/artifacts_full"

!graph-lp train --config configs/full.yaml --variant hybrid --device auto --seed 42 --output-dir {full_output_directory}
!graph-lp evaluate --config configs/full.yaml --device auto --output-dir {full_output_directory}
