# End-to-end PyTorch Geometric training, validation, and testing tutorial for a TopologicPy Dataset

## Expected Folder Contents

- `graphs.csv`  
- `nodes.csv`  
- `edges.csv`  
- `meta.yaml`  

---

## This Script

1. Loads the dataset with the PyG helper class  
2. Builds a graph-level GNN model (graph classification by default)  
3. Trains with a train/val split  
4. Evaluates on validation and test sets  
5. Visualizes learning curves, confusion matrix, and prints evaluation metrics  

---

## Notes

- **Requires:** `topologicpy > 0.9.4`, `torch`, `torch_geometric`, `pandas`, `pyyaml`, `numpy`, `plotly`, `scikit-learn`
- **Example Datasets can found at** `https://github.com/wassimj/topologicpy/tree/main/assets/MachineLearning`

### Installation Example

```bash
pip install torch pandas pyyaml numpy plotly scikit-learn
# then install torch-geometric following their official instructions for your OS/CUDA


In [1]:
# This cell is not needed if you have pip installed topologicpy
import sys
sys.path.append("C:/Users/sarwj/OneDrive - Cardiff University/Documents/GitHub/topologicpy/src")

### Import the needed libraries and add a utility function

In [2]:
from __future__ import annotations
import os
from pathlib import Path
import json
import yaml
from topologicpy.PyG import PyG
from topologicpy.Helper import Helper

def pretty_print_metrics(title: str, metrics: dict) -> None:
    print("\n" + "=" * 80)
    print(title)
    print("=" * 80)
    for k in sorted(metrics.keys()):
        v = metrics[k]
        if isinstance(v, float):
            print(f"{k:30s}: {v:.6f}")
        else:
            print(f"{k:30s}: {v}")
    print("=" * 80 + "\n")

### Check TopologicPy Version

In [3]:
print("The script is compatible with TopologicPy v0.9.4 or newer.")
print(Helper.Version())

The script is compatible with TopologicPy v0.9.4 or newer.
The version that you are using (0.9.5) is EQUAL TO the latest version available on PyPI.


### Specify the Location of the Training Dataset

In [4]:
dataset_dir = Path(r"C:\Users\sarwj\OneDrive - Cardiff University\Documents\GitHub\topologicpy\assets\MachineLearning\training_dataset").resolve()

### Load the CSV Dataset (The Example Has Categorical Labels, Task is Graph-level Classification)

In [5]:
# Optional: read meta.yaml (purely informational)
meta_path = dataset_dir / "meta.yaml"
if meta_path.exists():
    meta = yaml.safe_load(meta_path.read_text(encoding="utf-8"))
    print("Loaded meta.yaml:")
    print(json.dumps(meta, indent=2))
else:
    print("meta.yaml not found (this is fine).")

# This dataset has graphs.csv with a categorical 'label' column -> graph classification.
pyg = PyG.ByCSVPath(
    path=str(dataset_dir),
    level="graph",              # "graph" | "node" | "edge" | "link"
    task="classification",      # "classification" | "regression" | "link_prediction"
    graphLabelType="categorical",
    nodeLabelType="categorical",
    edgeLabelType="categorical",
    # If your headers differ, override here (your attached CSVs match defaults):
    # graphIDHeader="graph_id", graphLabelHeader="label",
    # nodeIDHeader="node_id", nodeLabelHeader="label",
    # edgeSRCHeader="src_id", edgeDSTHeader="dst_id", edgeLabelHeader="label",
)

Loaded meta.yaml:
{
  "dataset_name": "topologic_training_dataset",
  "edge_data": [
    {
      "file_name": "edges.csv"
    }
  ],
  "node_data": [
    {
      "file_name": "nodes.csv"
    }
  ],
  "graph_data": {
    "file_name": "graphs.csv"
  }
}


### Set Hyperparameters

In [None]:
pyg.SetHyperparameters(
    # splitting / determinism
    cv="holdout",
    split=(0.80, 0.10, 0.10),   # train/val/test
    random_state=42,
    shuffle=True,

    # training
    epochs=50,
    batch_size=64,
    lr=1e-3,
    weight_decay=1e-4,
    optimizer="adamw",
    gradient_clip_norm=1.0,
    early_stopping=True,
    early_stopping_patience=12,
    use_gpu=True,              # will use CUDA if available

    # model
    conv="sage",               # "sage" | "gcn" | "gatv2"
    hidden_dims=(128, 128),    # depth = len(hidden_dims)
    activation="relu",         # "relu" | "gelu" | "elu"
    dropout=0.20,
    batch_norm=True,
    residual=True,
    pooling="mean",            # "mean" | "max" | "add" (graph-level only)
)
# Print a compact summary of the current config and inferred dims/classes
print("PyG config summary:")
print(pyg.Summary())

PyG config summary:
{'level': 'graph', 'task': 'classification', 'graph_label_type': 'categorical', 'node_label_type': 'categorical', 'edge_label_type': 'categorical', 'cv': 'holdout', 'split': (0.8, 0.1, 0.1), 'k_folds': 5, 'conv': 'sage', 'hidden_dims': (128, 128), 'activation': 'relu', 'dropout': 0.2, 'batch_norm': True, 'residual': True, 'pooling': 'mean', 'epochs': 5, 'batch_size': 64, 'lr': 0.001, 'weight_decay': 0.0001, 'optimizer': 'adamw', 'gradient_clip_norm': 1.0, 'early_stopping': True, 'early_stopping_patience': 12, 'device': 'cuda:0', 'num_graphs': 1496, 'num_outputs': 5}


### Train the Model

In [7]:
history = pyg.Train()  # returns dict of per-epoch curves (loss + metrics when available)
print(history)

{'train_loss': [1.0008644587115239, 0.4690109585460864, 0.27457169639436824, 0.20895084032886907, 0.16015207924340902], 'val_loss': [0.9058072765668234, 0.35710372527440387, 0.20246005554993948, 0.1602249046166738, 0.12145740414659183]}


### Validate the Model

In [8]:
val_metrics = pyg.Validate()
pretty_print_metrics("Validation metrics", val_metrics)



Validation metrics
val_accuracy                  : 0.953020
val_f1                        : 0.931666
val_precision                 : 0.913870
val_recall                    : 0.953020



### Test the Model

In [9]:
test_metrics = pyg.Test()
pretty_print_metrics("Test metrics", test_metrics)


Test metrics
test_accuracy                 : 0.973510
test_f1                       : 0.960911
test_precision                : 0.949484
test_recall                   : 0.973510



### Plot the Training and Validation Loss Curves

In [10]:
fig_hist = pyg.PlotHistory()
fig_hist.show()

### Plot the Confusion Matrix (For Categorical Labels)

In [11]:
fig_cm = pyg.PlotConfusionMatrix(split="train")
fig_cm.show()

[432, 0, 0, 0, 1]
[2, 428, 0, 0, 0]
[0, 0, 0, 0, 61]
[0, 0, 0, 46, 0]
[0, 0, 0, 0, 226]


### Save the Model

In [12]:
pyg.SaveModel(r"C:\Users\sarwj\OneDrive - Cardiff University\Desktop\pyg_model.pt")

# ----- PHASE 2: PREDICTION OF UNSEEN DATASET ------

### Load Testing Dataset

In [13]:
dataset_dir = Path(r"C:\Users\sarwj\OneDrive - Cardiff University\Documents\GitHub\topologicpy\assets\MachineLearning\testing_dataset_2").resolve()

pyg_2 = PyG.ByCSVPath(
    path=str(dataset_dir),
    level="graph",              # "graph" | "node" | "edge" | "link"
    task="classification",      # "classification" | "regression" | "link_prediction"
    graphLabelType="categorical",
    nodeLabelType="categorical",
    edgeLabelType="categorical",
    # If your headers differ, override here (your attached CSVs match defaults):
    # graphIDHeader="graph_id", graphLabelHeader="label",
    # nodeIDHeader="node_id", nodeLabelHeader="label",
    # edgeSRCHeader="src_id", edgeDSTHeader="dst_id", edgeLabelHeader="label",
)

### Load the Pre-trained Model

In [14]:
pyg_2.LoadModel(r"C:\Users\sarwj\OneDrive - Cardiff University\Desktop\pyg_model.pt")

### Make the Whole Dataset a Testing Dataset

In [15]:
pyg_2.SetHyperparameters(split=(0.0, 0.0, 1.0), shuffle=False)  # all graphs become test
print("PyG config summary:")
print(pyg.Summary())

PyG config summary:
{'level': 'graph', 'task': 'classification', 'graph_label_type': 'categorical', 'node_label_type': 'categorical', 'edge_label_type': 'categorical', 'cv': 'holdout', 'split': (0.8, 0.1, 0.1), 'k_folds': 5, 'conv': 'sage', 'hidden_dims': (128, 128), 'activation': 'relu', 'dropout': 0.2, 'batch_norm': True, 'residual': True, 'pooling': 'mean', 'epochs': 5, 'batch_size': 64, 'lr': 0.001, 'weight_decay': 0.0001, 'optimizer': 'adamw', 'gradient_clip_norm': 1.0, 'early_stopping': True, 'early_stopping_patience': 12, 'device': 'cuda:0', 'num_graphs': 1496, 'num_outputs': 5}


### Predict the Dataset

In [16]:
pred_results = pyg_2.Predict()
indices = pred_results['index'].tolist()
predictions = pred_results['pred'].tolist()
probabilities = pred_results['prob'].tolist()
matrix = []
for i, idx in enumerate(indices):
    
    print(idx, ",", predictions[i], ",", round(max(probabilities[i]), 2))

0 , 0 , 0.99
1 , 0 , 0.99
2 , 0 , 0.99
3 , 0 , 0.99
4 , 4 , 0.91


### Plot the Confusion Matrix (For Categorical Labels Only)

In [17]:
fig_cm = pyg_2.PlotConfusionMatrix(split="all")
fig_cm.show()

[4, 0]
[0, 1]
