# 🚀 Sportlight HRNet Training - Google Colab

**NovaVista Atlas v2 - Egyptian League Analytics**

This notebook trains the Sportlight HRNet model for soccer field keypoint detection.

**Original Paper:** SoccerNet Camera Calibration Challenge 2023 (1st Place)

**Performance:** 73.22% Accuracy, 75.59% Completeness

**GPU Required:** T4 (16GB) - Available on Free Colab

## 📋 Setup Checklist

Before running:
- [ ] Upload SoccerNet dataset to Google Drive
- [ ] Organize as: `MyDrive/soccernet_dataset/train`, `valid`, `test`
- [ ] Each folder should have paired `.jpg` and `.json` files
- [ ] Enable GPU: Runtime → Change runtime type → GPU (T4)

## 1️⃣ Mount Google Drive

In [None]:
from google.colab import drive
import os

print("📦 Mounting Google Drive...")
drive.mount('/content/drive')
print("✅ Drive mounted!")

## 2️⃣ Verify Dataset

In [None]:
import os
import json
from pathlib import Path

# Update this path to your dataset location
DATASET_BASE = "/content/drive/MyDrive/soccernet_dataset"

print("🔍 Checking dataset...\n")

for split in ['train', 'valid']:
    split_path = Path(DATASET_BASE) / split
    
    if not split_path.exists():
        print(f"❌ {split} folder not found at {split_path}")
        continue
    
    jpg_files = list(split_path.glob('*.jpg'))
    json_files = list(split_path.glob('*.json'))
    
    print(f"📂 {split}:")
    print(f"   Images: {len(jpg_files)}")
    print(f"   JSONs: {len(json_files)}")
    
    # Check first JSON structure
    if json_files:
        sample_json = json_files[0]
        with open(sample_json) as f:
            data = json.load(f)
        print(f"   Sample JSON keys: {list(data.keys())}")
    print()

print("✅ Dataset verification complete!")

## 3️⃣ Clone Sportlight Repository

In [None]:
!git clone https://github.com/NikolasEnt/soccernet-calibration-sportlight.git
%cd soccernet-calibration-sportlight
!pwd

## 4️⃣ Install Dependencies

In [None]:
print("⚙️ Installing dependencies...\n")

!pip install -q torch torchvision
!pip install -q opencv-python
!pip install -q hydra-core
!pip install -q argus-learn
!pip install -q omegaconf
!pip install -q albumentations
!pip install -q scipy
!pip install -q scikit-image

print("\n✅ Dependencies installed!")

## 5️⃣ Setup Data Directories

In [None]:
import os

print("📂 Setting up data directories...\n")

# Create workdir structure
os.makedirs("/workdir/data/dataset", exist_ok=True)
os.makedirs("/workdir/data/experiments", exist_ok=True)

# Create symbolic links to dataset
!ln -sf {DATASET_BASE}/train /workdir/data/dataset/train
!ln -sf {DATASET_BASE}/valid /workdir/data/dataset/valid

# Verify links
!ls -la /workdir/data/dataset/

print("\n✅ Data directories ready!")

## 6️⃣ Modify Training Config for Colab

**Adjustments:**
- Batch size: 8 → 4 (fits in 16GB GPU)
- Input size: 960×540 → 720×405 (reduce memory)
- Workers: 8 → 2 (Colab CPU limitation)

In [None]:
import yaml
from pathlib import Path

print("🔧 Modifying training config for Colab...\n")

config_path = "src/models/hrnet/train_config.yaml"

# Read config
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Show original settings
print("📋 Original settings:")
print(f"   Batch size: {config['data_params']['batch_size']}")
print(f"   Input size: {config['data_params']['input_size']}")
print(f"   Workers: {config['data_params']['num_workers']}")

# Modify for Colab
config['data_params']['batch_size'] = 4  # Reduced from 8
config['data_params']['input_size'] = [720, 405]  # Reduced from [960, 540]
config['data_params']['num_workers'] = 2  # Reduced from 8

# Adjust prediction sizes accordingly
config['model']['params']['loss']['pred_size'] = [203, 360]  # 405/2, 720/2
config['model']['params']['prediction_transform']['size'] = [405, 720]

# Enable AMP (Automatic Mixed Precision) for memory efficiency
config['model']['params']['amp'] = True

# Save modified config
with open(config_path, 'w') as f:
    yaml.dump(config, f)

print("\n✅ Modified settings:")
print(f"   Batch size: {config['data_params']['batch_size']}")
print(f"   Input size: {config['data_params']['input_size']}")
print(f"   Workers: {config['data_params']['num_workers']}")
print("\n✅ Config updated for Colab!")

## 7️⃣ Check GPU

In [None]:
import torch

print("🔍 GPU Information:\n")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"\n✅ Ready to train!")
else:
    print("\n❌ No GPU found!")
    print("Please enable GPU: Runtime → Change runtime type → T4 GPU")

## 8️⃣ Start Training

**Expected Duration:** 6-10 hours on T4 GPU

**Training will:**
- Run for max 200 epochs
- Early stopping after 32 epochs without improvement
- Save checkpoints every 2 epochs
- Save best model based on validation loss

**Note:** Keep this tab open or use Colab Pro for background execution

In [None]:
print("🚀 Starting training...")
print("⏰ Expected time: 6-10 hours")
print("📊 Monitor progress below\n")
print("-" * 60)

!python src/models/hrnet/train.py

## 9️⃣ View Training Progress (Optional)

Run this in a separate cell while training

In [None]:
# View latest log file
import glob

log_files = glob.glob("/workdir/data/experiments/*/log.txt")
if log_files:
    latest_log = sorted(log_files)[-1]
    print(f"📊 Reading: {latest_log}\n")
    !tail -n 50 {latest_log}
else:
    print("No log files found yet")

## 🔟 Find Best Model

In [None]:
import glob
from pathlib import Path

print("🔍 Looking for trained models...\n")

# Find experiment folder
exp_folders = glob.glob("/workdir/data/experiments/HRNet_57_*")

if not exp_folders:
    print("❌ No experiment folders found")
else:
    exp_folder = exp_folders[0]
    print(f"📂 Experiment: {exp_folder}\n")
    
    # Find all models
    all_models = glob.glob(f"{exp_folder}/*.pth")
    
    if all_models:
        print(f"📦 Found {len(all_models)} model(s):\n")
        
        # Find best EvalAI model
        evalai_models = [m for m in all_models if 'evalai' in m]
        if evalai_models:
            best_model = sorted(evalai_models)[-1]
            print(f"✅ Best model: {Path(best_model).name}")
            print(f"   Full path: {best_model}")
            
            # Store for next cell
            BEST_MODEL_PATH = best_model
        else:
            print("⚠️ No 'evalai' models found, showing all:")
            for model in all_models:
                print(f"   - {Path(model).name}")
    else:
        print("❌ No models found in experiment folder")

## 1️⃣1️⃣ Download Model

In [None]:
from google.colab import files
import shutil
from pathlib import Path

print("💾 Saving trained model...\n")

try:
    # Copy to Google Drive
    drive_save_path = "/content/drive/MyDrive/sportlight_models"
    os.makedirs(drive_save_path, exist_ok=True)
    
    model_name = Path(BEST_MODEL_PATH).name
    drive_model_path = f"{drive_save_path}/{model_name}"
    
    shutil.copy(BEST_MODEL_PATH, drive_model_path)
    print(f"✅ Saved to Google Drive: {drive_model_path}")
    
    # Also download to local computer
    print("\n📥 Downloading to your computer...")
    files.download(BEST_MODEL_PATH)
    
    print("\n✅ Model saved successfully!")
    print(f"\n📊 Model info:")
    print(f"   Name: {model_name}")
    print(f"   Size: {Path(BEST_MODEL_PATH).stat().st_size / 1e6:.1f} MB")
    
except NameError:
    print("❌ No model found to save. Training may not have completed.")
except Exception as e:
    print(f"❌ Error: {e}")

## 📊 Training Summary

In [None]:
print("📈 Training Summary\n")
print("=" * 60)

try:
    # Read log file
    log_path = glob.glob("/workdir/data/experiments/*/log.txt")[0]
    with open(log_path) as f:
        log_lines = f.readlines()
    
    # Extract key metrics from last few lines
    print("Last 10 log entries:\n")
    for line in log_lines[-10:]:
        print(line.strip())
    
    print("\n" + "=" * 60)
    print("\n✅ Training complete!")
    print("\n📋 Next steps:")
    print("1. Test model on Egyptian League frames")
    print("2. Integrate into Atlas v2 pipeline")
    print("3. Validate performance metrics")
    
except:
    print("No training logs found")

## 🆘 Troubleshooting

### Out of Memory Error

If you get `CUDA out of memory`, run this cell to further reduce batch size:

In [None]:
# Emergency memory reduction
config_path = "src/models/hrnet/train_config.yaml"
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Further reduce
config['data_params']['batch_size'] = 2  # From 4 to 2
config['data_params']['input_size'] = [640, 360]  # From [720, 405]
config['model']['params']['loss']['pred_size'] = [180, 320]
config['model']['params']['prediction_transform']['size'] = [360, 640]

with open(config_path, 'w') as f:
    yaml.dump(config, f)

print("✅ Config reduced for lower memory usage")
print("⚠️ Training will be slower but more stable")

### Session Timeout

**Free Colab timeout:** 12 hours

**Solutions:**
1. Use Colab Pro ($10/month) for 24-hour sessions
2. Resume from checkpoint if interrupted
3. Train in multiple sessions

**To resume training:**

In [None]:
# Resume from last checkpoint
import glob

checkpoints = glob.glob("/workdir/data/experiments/*/save-*.pth")
if checkpoints:
    latest_checkpoint = sorted(checkpoints)[-1]
    print(f"Found checkpoint: {latest_checkpoint}")
    
    # Update config to use checkpoint
    config_path = "src/models/hrnet/train_config.yaml"
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    
    config['model']['params']['pretrain'] = latest_checkpoint
    
    with open(config_path, 'w') as f:
        yaml.dump(config, f)
    
    print("✅ Config updated to resume from checkpoint")
    print("Run training cell again to continue")
else:
    print("No checkpoints found")

---

## ✅ Training Complete!

**What you have now:**
- Trained HRNet keypoint detection model (~200 MB)
- Model saved in Google Drive and downloaded locally
- Ready for Phase 2: Testing on Egyptian League

**Next steps:**
1. Test on Egyptian League frames (see `PHASE_2_TESTING.md`)
2. Integrate into Atlas v2 pipeline
3. Production validation

---

**Notebook:** `sportlight_colab_training.ipynb`

**Author:** NovaVista Atlas Team

**Date:** October 2025