# 🚀 Sportlight HRNet Training - Google Colab

**NovaVista Atlas v2 - Egyptian League Analytics**

This notebook trains the Sportlight HRNet model for soccer field keypoint detection.

**Original Paper:** SoccerNet Camera Calibration Challenge 2023 (1st Place)

**Performance:** 73.22% Accuracy, 75.59% Completeness

**Dataset:** SoccerNet Camera Calibration (downloaded automatically)

**GPU Required:** T4 (16GB) - Available on Free Colab

---

**Training Optimizations Built-in:**
- ✅ Mixed Precision (AMP) - Faster + less memory
- ✅ Early Stopping - 32 epochs patience
- ✅ Auto Checkpoints - Every 2 epochs
- ✅ Best Model Saving - Val loss + metrics
- ✅ Memory Optimized - Fits T4 16GB GPU

## 📋 Setup Checklist

Before running:
- [ ] Enable GPU: Runtime → Change runtime type → GPU (T4)
- [ ] Run all cells in order
- [ ] Dataset will be downloaded automatically (~5-10 GB)
- [ ] Training takes 6-10 hours

## 1️⃣ Check GPU

In [None]:
import torch

print("🔍 GPU Information:\n")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"\n✅ GPU ready for training!")
else:
    print("\n❌ No GPU found!")
    print("Please enable GPU: Runtime → Change runtime type → T4 GPU")

## 2️⃣ Clone Sportlight Repository

In [None]:
print("📥 Cloning Sportlight repository...\n")

!git clone https://github.com/NikolasEnt/soccernet-calibration-sportlight.git
%cd soccernet-calibration-sportlight

print("\n✅ Repository cloned!")
!pwd

## 3️⃣ Install Dependencies

In [None]:
print("⚙️ Installing dependencies...\n")

# Core ML libraries
!pip install -q torch torchvision
!pip install -q opencv-python
!pip install -q hydra-core
!pip install -q omegaconf
!pip install -q albumentations
!pip install -q scipy
!pip install lsq-ellipse
!pip install -q scikit-image

# Argus framework (install from GitHub)
!pip install -q git+https://github.com/lRomul/argus.git

# SoccerNet API for dataset download
!pip install -q SoccerNet

print("\n✅ Dependencies installed!")

## 4️⃣ Download SoccerNet Dataset

**This will download the official SoccerNet Camera Calibration dataset:**
- Training set
- Validation set
- Test set

**Size:** ~5-10 GB

**Time:** 10-20 minutes depending on connection

In [None]:
from SoccerNet.Downloader import SoccerNetDownloader
import os

print("📦 Downloading SoccerNet Camera Calibration dataset...")
print("⏰ This may take 10-20 minutes\n")

# Create dataset directory
dataset_dir = "/content/soccernet_data"
os.makedirs(dataset_dir, exist_ok=True)

# Initialize downloader
downloader = SoccerNetDownloader(LocalDirectory=dataset_dir)

# Download camera calibration data (correct task name)
print("Downloading calibration-2023 dataset...")
downloader.downloadDataTask(
    task="calibration-2023",
    split=["train", "valid", "test"]
)

print("\n✅ Dataset downloaded!")

# Verify download
print("\n📊 Dataset structure:")
!ls -la {dataset_dir}

## 4.5️⃣ Extract Dataset Archives

In [None]:
import zipfile
from pathlib import Path

DATASET_BASE = "/content/soccernet_data/calibration-2023"
dataset_path = Path(DATASET_BASE)

print("📦 Extracting dataset archives...\n")

# Find all zip files
zip_files = list(dataset_path.glob("*.zip"))

if not zip_files:
    print("⚠️ No zip files found. Dataset may already be extracted.")
else:
    print(f"Found {len(zip_files)} archives to extract\n")
    
    for zip_path in zip_files:
        print(f"📂 Extracting {zip_path.name}...")
        
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(dataset_path)
        
        print(f"   ✅ Done")
    
    print("\n✅ All archives extracted!")

# Show extracted structure
print("\n📊 Extracted structure:")
!ls -la {DATASET_BASE}

## 5️⃣ Verify Dataset Structure

In [None]:
import os
import json
from pathlib import Path

DATASET_BASE = "/content/soccernet_data/calibration-2023"

print("🔍 Verifying dataset...\n")

for split in ['train', 'valid']:
    split_path = Path(DATASET_BASE) / split
    
    if not split_path.exists():
        print(f"⚠️ {split} folder not found at {split_path}")
        continue
    
    jpg_files = list(split_path.glob('**/*.jpg'))
    json_files = list(split_path.glob('**/*.json'))
    
    print(f"📂 {split}:")
    print(f"   Images: {len(jpg_files)}")
    print(f"   JSONs: {len(json_files)}")
    
    # Check first JSON structure
    if json_files:
        sample_json = json_files[0]
        try:
            with open(sample_json) as f:
                data = json.load(f)
            print(f"   Sample JSON keys: {list(data.keys())}")
            print(f"   Sample file: {sample_json.name}")
        except Exception as e:
            print(f"   Error reading JSON: {e}")
    print()

print("✅ Dataset verification complete!")
print(f"\n📍 Dataset location: {DATASET_BASE}")

## 6️⃣ Setup Data Directories for Sportlight

In [None]:
import os

print("📂 Setting up Sportlight data directories...\n")

# Create workdir structure expected by Sportlight
os.makedirs("/workdir/data/dataset", exist_ok=True)
os.makedirs("/workdir/data/experiments", exist_ok=True)

# Create symbolic links to dataset
# Update DATASET_BASE if needed from previous cell
!ln -sf {DATASET_BASE}/train /workdir/data/dataset/train
!ln -sf {DATASET_BASE}/valid /workdir/data/dataset/valid

# Verify links
print("Created symbolic links:")
!ls -la /workdir/data/dataset/

# Test access
train_files = !ls /workdir/data/dataset/train | head -5
print(f"\nSample train files:")
for f in train_files:
    print(f"  {f}")

print("\n✅ Data directories ready!")

## 7️⃣ Modify Training Config for Colab

**Adjustments for 16GB T4 GPU:**
- Batch size: 8 → 4 (fits in 16GB)
- Input size: 960×540 → 720×405 (reduce memory)
- Workers: 8 → 2 (Colab CPU limitation)

In [None]:
import yaml
from pathlib import Path

print("🔧 Modifying training config for Colab...\n")

config_path = "src/models/hrnet/train_config.yaml"

# Read config
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Show original settings
print("📋 Original settings:")
print(f"   Batch size: {config['data_params']['batch_size']}")
print(f"   Input size: {config['data_params']['input_size']}")
print(f"   Workers: {config['data_params']['num_workers']}")

# Modify for Colab (16GB T4)
config['data_params']['batch_size'] = 4  # Reduced from 8
config['data_params']['input_size'] = [720, 405]  # Reduced from [960, 540]
config['data_params']['num_workers'] = 2  # Reduced from 8

# Adjust prediction sizes accordingly
config['model']['params']['loss']['pred_size'] = [203, 360]  # 405/2, 720/2
config['model']['params']['prediction_transform']['size'] = [405, 720]

# Performance optimizations
config['model']['params']['amp'] = True  # AMP for faster training + less memory

# Early stopping already configured (32 epochs patience)
# Checkpoints save every 2 epochs automatically
# Best models saved based on validation metrics

# Save modified config
with open(config_path, 'w') as f:
    yaml.dump(config, f)

print("\n✅ Modified settings:")
print(f"   Batch size: {config['data_params']['batch_size']}")
print(f"   Input size: {config['data_params']['input_size']}")
print(f"   Workers: {config['data_params']['num_workers']}")
print(f"   AMP enabled: {config['model']['params']['amp']}")
print("\n✅ Config optimized for Colab T4 GPU!")

## 7️⃣ (Optional) Quick Test Run

**Test the pipeline first with 5 epochs (~30 minutes)**

Skip this if you want to start full training immediately.

In [None]:
# Quick 5-epoch test to validate setup
import yaml

config_path = "src/models/hrnet/train_config.yaml"
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Test mode: 5 epochs
config['train_params']['max_epochs'] = 5

with open(config_path, 'w') as f:
    yaml.dump(config, f)

print("⚡ Quick test mode: 5 epochs (~30 min)")
print("This validates your setup before full training")
print("\nAfter test, reset max_epochs to 200 for full training")

## 8️⃣ Start Training

**⏰ Expected Duration:** 6-10 hours on T4 GPU

**Training will:**
- Run for max 200 epochs
- Early stopping after 32 epochs without improvement
- Save checkpoints every 2 epochs to `/workdir/data/experiments/`
- Save best model based on validation metrics

**⚠️ Important:** 
- Keep this tab open (or use Colab Pro for background execution)
- Free Colab has 12-hour limit
- Training can be resumed from checkpoints if interrupted

In [None]:
import os
import sys

# Set working directory and Python paths
repo_path = "/content/soccernet-calibration-sportlight"
os.chdir(repo_path)
sys.path.insert(0, repo_path)

# Set Hydra config path
config_dir = f"{repo_path}/src/models/hrnet"
os.environ['HYDRA_CONFIG_DIR'] = config_dir

print("🚀 Starting Sportlight HRNet training...")
print("⏰ Expected time: 6-10 hours")
print("📊 Logs will appear below")
print("💾 Models saved to /workdir/data/experiments/\n")
print("-" * 70)

!python -m src.models.hrnet.train --config-dir={config_dir} --config-name=train_config

## 1️⃣1️⃣ Find Best Model

In [None]:
import glob
from pathlib import Path
import os

print("🔍 Looking for trained models...\n")

# Find experiment folder
exp_folders = glob.glob("/workdir/data/experiments/HRNet_57_*")

if not exp_folders:
    print("❌ No experiment folders found")
    print("Training may not have started or completed yet")
else:
    exp_folder = exp_folders[0]
    print(f"📂 Experiment folder: {Path(exp_folder).name}\n")
    
    # Find all models
    all_models = glob.glob(f"{exp_folder}/*.pth")
    
    if all_models:
        print(f"📦 Found {len(all_models)} model checkpoint(s)\n")
        
        # Find best EvalAI model (best performance on validation)
        evalai_models = [m for m in all_models if 'evalai' in m]
        
        # Find best PCKs model (best keypoint accuracy)
        pcks_models = [m for m in all_models if 'pcks' in m]
        
        if evalai_models:
            best_evalai = sorted(evalai_models)[-1]
            model_size = os.path.getsize(best_evalai) / (1024**2)
            print(f"✅ Best EvalAI model: {Path(best_evalai).name}")
            print(f"   Size: {model_size:.1f} MB")
            print(f"   Path: {best_evalai}\n")
            BEST_MODEL_PATH = best_evalai
            
        if pcks_models:
            best_pcks = sorted(pcks_models)[-1]
            print(f"✅ Best PCKs model: {Path(best_pcks).name}")
            print(f"   Path: {best_pcks}\n")
            
        # Show all models
        print("📋 All saved models:")
        for model in sorted(all_models):
            model_size = os.path.getsize(model) / (1024**2)
            print(f"   - {Path(model).name} ({model_size:.1f} MB)")
            
    else:
        print("❌ No models found in experiment folder")
        print("Training may still be in progress")

print("\n" + "=" * 70)

## 1️⃣2️⃣ Download Best Model

In [None]:
from google.colab import files
import shutil
from pathlib import Path

print("💾 Downloading trained model...\n")

try:
    # Check if model exists
    if 'BEST_MODEL_PATH' not in globals():
        raise NameError("No model found")
    
    model_name = Path(BEST_MODEL_PATH).name
    model_size = os.path.getsize(BEST_MODEL_PATH) / (1024**2)
    
    print(f"📦 Model: {model_name}")
    print(f"📏 Size: {model_size:.1f} MB")
    print(f"\n📥 Starting download...\n")
    
    # Download to local machine
    files.download(BEST_MODEL_PATH)
    
    print("\n✅ Model downloaded successfully!")
    print("\n📋 Model info:")
    print(f"   Filename: {model_name}")
    print(f"   Size: {model_size:.1f} MB")
    print(f"   Location: Your Downloads folder")
    
    print("\n📝 Next steps:")
    print("   1. Move model to your project: atlas/v2/models/")
    print("   2. Test on Egyptian League frames")
    print("   3. Integrate into Atlas pipeline")
    
except NameError:
    print("❌ No model found to download")
    print("Training may not have completed yet")
    print("Run the 'Find Best Model' cell first")
except Exception as e:
    print(f"❌ Error: {e}")

## 1️⃣3️⃣ Training Summary

In [None]:
print("📈 Training Summary")
print("=" * 70)

try:
    # Read log file
    log_files = glob.glob("/workdir/data/experiments/*/log.txt")
    if log_files:
        log_path = sorted(log_files)[-1]
        with open(log_path) as f:
            log_lines = f.readlines()
        
        # Extract key metrics from last 20 lines
        print("\n📊 Final training metrics:\n")
        for line in log_lines[-20:]:
            if 'Epoch' in line or 'loss' in line or 'evalai' in line:
                print(line.strip())
        
        print("\n" + "=" * 70)
        print("\n✅ Training complete!")
        
    else:
        print("No training logs found")
        
    print("\n📋 Next steps:")
    print("   1. ✅ Model trained and downloaded")
    print("   2. 📊 Test on Egyptian League frames (Phase 2)")
    print("   3. 🔧 Integrate into Atlas v2 pipeline (Phase 3)")
    print("   4. ✅ Production validation (Phase 5)")
    
except Exception as e:
    print(f"Error reading logs: {e}")

## 🆘 Troubleshooting

### Out of Memory Error

If you get `CUDA out of memory`, run this to reduce memory usage:

In [None]:
# Emergency memory reduction
import yaml

config_path = "src/models/hrnet/train_config.yaml"
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

print("⚠️ Applying emergency memory reduction...\n")

# Further reduce settings
config['data_params']['batch_size'] = 2  # From 4 to 2
config['data_params']['input_size'] = [640, 360]  # From [720, 405]
config['model']['params']['loss']['pred_size'] = [180, 320]
config['model']['params']['prediction_transform']['size'] = [360, 640]

with open(config_path, 'w') as f:
    yaml.dump(config, f)

print("✅ Config updated:")
print(f"   Batch size: 2 (was 4)")
print(f"   Input size: 640×360 (was 720×405)")
print("\n⚠️ Training will be slower but more stable")
print("Restart the training cell now")

### Resume from Checkpoint

If training was interrupted, resume from last checkpoint:

In [None]:
import glob
import yaml

print("🔍 Looking for checkpoints...\n")

checkpoints = glob.glob("/workdir/data/experiments/*/save-*.pth")

if checkpoints:
    latest_checkpoint = sorted(checkpoints)[-1]
    print(f"✅ Found checkpoint: {Path(latest_checkpoint).name}\n")
    
    # Update config to resume from checkpoint
    config_path = "src/models/hrnet/train_config.yaml"
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    
    config['model']['params']['pretrain'] = latest_checkpoint
    
    with open(config_path, 'w') as f:
        yaml.dump(config, f)
    
    print("✅ Config updated to resume from checkpoint")
    print(f"   Checkpoint: {latest_checkpoint}")
    print("\n📝 Run the training cell again to continue training")
else:
    print("❌ No checkpoints found")
    print("Training must be started from scratch")

---

## ✅ Training Complete!

**What you have now:**
- ✅ Trained HRNet keypoint detection model
- ✅ Model downloaded to your computer
- ✅ Ready for Phase 2: Testing on Egyptian League

**Performance expectations:**
- Keypoint accuracy: 70-75%
- Completeness: 75-80%
- Model size: ~200-300 MB

**Next steps:**
1. Move model to: `atlas/v2/models/sportlight_hrnet.pth`
2. Test on 50+ Egyptian League frames
3. Measure completeness and accuracy
4. Integrate into Atlas v2 pipeline

---

**Training Method:** SoccerNet dataset (official)

**Training Time:** 6-10 hours on T4

**Solution:** Sportlight (SoccerNet 2023 Winner)

**Date:** October 2025