# GraphMER-SE: TPU Training on Google Colab

**Dataset:** 30,826 triples (99.10% quality)  
**Hardware:** TPU v2-8 (8 cores)  
**Status:** Production-ready

---

## Setup Instructions

1. **Change Runtime to TPU:**
   - Runtime → Change runtime type → TPU → Save

2. **Run cells in order** (1 → 2 → 3 → ...)

3. **Wait for each cell to complete** before moving to next

---

## Cell 1: Verify TPU Access

In [None]:
# Verify TPU is available
try:
    import torch_xla
    import torch_xla.core.xla_model as xm
    device = xm.xla_device()
    cores = xm.xrt_world_size()
    print(f"✅ TPU device available: {device}")
    print(f"✅ TPU cores: {cores}")
    if cores != 8:
        print("⚠️  Warning: Expected 8 TPU cores, got", cores)
except ImportError:
    print("❌ torch-xla not found!")
    print("\nPlease change runtime to TPU:")
    print("Runtime → Change runtime type → TPU → Save")
    raise

## Cell 2: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')
print("✅ Google Drive mounted")

## Cell 3: Extract Project from Drive

In [None]:
import os

# Extract package
print("📦 Extracting project...")
!tar -xzf /content/drive/MyDrive/GraphMER/graphmer_colab.tar.gz -C /content/

# Change to project directory
os.chdir('/content/colab_deploy')
print("✅ Project extracted")
print(f"📁 Working directory: {os.getcwd()}")

# List contents
!ls -lh

## Cell 4: Verify Data Files

In [None]:
# Check data directory
print("📊 Data files:")
!ls -lh data/

print("\n📈 Triple count:")
!wc -l data/enhanced_multilang.jsonl

# Verify count
import subprocess
result = subprocess.run(['wc', '-l', 'data/enhanced_multilang.jsonl'], 
                       capture_output=True, text=True)
count = int(result.stdout.split()[0])

if count >= 30000:
    print(f"✅ Data verified: {count:,} triples (exceeds 30k requirement)")
else:
    print(f"❌ Insufficient data: {count:,} triples (need ≥30,000)")

## Cell 5: Install Dependencies

In [None]:
print("📦 Installing dependencies...")
!pip install -q transformers datasets pyyaml networkx
print("✅ Dependencies installed")

## Cell 6: Validate Knowledge Graph Quality

In [None]:
print("🔍 Running knowledge graph validation...\n")

!python src/ontology/kg_validator.py \
  data/enhanced_multilang.jsonl \
  data/enhanced_multilang.entities.jsonl \
  docs/specs/ontology_spec.yaml

print("\n✅ Data validation complete")
print("Expected: domain_range_ratio ≥ 0.99, inherits_acyclic: True")

## Cell 7: Verify TPU Configuration

In [None]:
import torch_xla
import torch_xla.core.xla_model as xm

print("🔧 TPU Configuration:")
print(f"✅ TPU device: {xm.xla_device()}")
print(f"✅ TPU cores: {xm.xrt_world_size()}")
print(f"✅ torch-xla version: {torch_xla.__version__}")

# Test TPU computation
import torch
device = xm.xla_device()
x = torch.randn(3, 3).to(device)
y = torch.randn(3, 3).to(device)
z = x @ y
print(f"✅ TPU computation test passed")

## Cell 8: Create Output Directories

In [None]:
# Create directories for outputs and checkpoints
!mkdir -p /content/drive/MyDrive/GraphMER/outputs
!mkdir -p /content/drive/MyDrive/GraphMER/checkpoints

print("✅ Output directories created:")
print("   📁 /content/drive/MyDrive/GraphMER/outputs")
print("   📁 /content/drive/MyDrive/GraphMER/checkpoints")
print("\n💾 Checkpoints will be saved to Drive every 500 steps")

## Cell 9: Run Training Test (100 steps, ~5 minutes)

**This is a smoke test to verify everything works before full training.**

In [None]:
print("🚀 Starting training test (100 steps, ~5 minutes)...\n")

!python scripts/train.py \
  --config configs/train_colab.yaml \
  --steps 100 \
  --output_dir /content/drive/MyDrive/GraphMER/outputs

print("\n✅ Training test complete!")
print("📊 Check metrics in next cell")

## Cell 10: View Training Metrics

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load metrics
df = pd.read_csv('/content/drive/MyDrive/GraphMER/outputs/train_metrics.csv')

print("📊 Last 10 training steps:")
print(df.tail(10).to_string(index=False))

# Plot metrics
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Loss plot
ax1.plot(df['step'], df['train_loss'], 'b-', linewidth=2)
ax1.set_title('Training Loss', fontsize=14, fontweight='bold')
ax1.set_xlabel('Step', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.grid(True, alpha=0.3)

# Accuracy plot
ax2.plot(df['step'], df['val_acc'], 'g-', linewidth=2)
ax2.set_title('Validation Accuracy', fontsize=14, fontweight='bold')
ax2.set_xlabel('Step', fontsize=12)
ax2.set_ylabel('Accuracy', fontsize=12)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n✅ Metrics visualized")
print(f"📈 Final validation accuracy: {df['val_acc'].iloc[-1]:.4f}")

## Cell 11: Full Training Run (1000 steps, ~15-20 minutes)

**Run this after smoke test passes. Adjust `--steps` as needed.**

In [None]:
print("🚀 Starting full training (1000 steps, ~15-20 minutes)...\n")

!python scripts/train.py \
  --config configs/train_colab.yaml \
  --steps 1000 \
  --output_dir /content/drive/MyDrive/GraphMER/outputs \
  --checkpoint_dir /content/drive/MyDrive/GraphMER/checkpoints

print("\n✅ Full training complete!")
print("📊 Checkpoints saved to Drive")
print("📈 Re-run Cell 10 to view updated metrics")

## Cell 12: Resume Training from Checkpoint

**Use this if session expires (12-hour limit on Colab Free).**

In [None]:
# Find latest checkpoint
import glob
checkpoints = glob.glob('/content/drive/MyDrive/GraphMER/checkpoints/checkpoint_*.pt')
if checkpoints:
    latest = sorted(checkpoints)[-1]
    print(f"📁 Latest checkpoint: {latest}")
    
    # Extract step number
    import re
    match = re.search(r'checkpoint_(\d+)\.pt', latest)
    if match:
        step = int(match.group(1))
        print(f"📊 Resuming from step: {step}")
        
        # Resume training
        !python scripts/train.py \
          --config configs/train_colab.yaml \
          --resume_from {latest} \
          --steps 2000 \
          --output_dir /content/drive/MyDrive/GraphMER/outputs \
          --checkpoint_dir /content/drive/MyDrive/GraphMER/checkpoints
        
        print("\n✅ Training resumed and continued!")
else:
    print("❌ No checkpoints found. Run Cell 11 first.")

## Cell 13: Check Files in Drive

In [None]:
print("📁 Files in GraphMER folder:\n")
!ls -lh /content/drive/MyDrive/GraphMER/

print("\n📊 Outputs:")
!ls -lh /content/drive/MyDrive/GraphMER/outputs/

print("\n💾 Checkpoints:")
!ls -lh /content/drive/MyDrive/GraphMER/checkpoints/

---

## 📚 Documentation

- **Setup Guide:** See `COLAB_TPU_SETUP.md` in project
- **Validation Results:** 30,826 triples, 99.10% quality
- **Expected Performance:** 1500-2000 tokens/sec after warmup

## 💡 Tips

- **Session Limit:** Colab Free = 12 hours, Pro = 24 hours
- **Checkpoints:** Auto-saved every 500 steps to Drive
- **Resume:** Use Cell 12 if session expires
- **Monitor:** Re-run Cell 10 anytime to see updated plots

## 🚀 Status

**Production-ready for TPU training!**

---

*Generated: 2025-10-20*  
*System: GraphMER-SE v1.0*