# NASA Cloud-ML Training on Google Colab

**Paper-Quality Baseline & Ablation Studies**

This notebook is optimized for producing publication-ready results on Google Colab's T4 GPU (15GB VRAM).

## 🚨 IMPORTANT: CUDA Graph Fix Applied

**If you previously got "RuntimeError: static input data pointer changed":**
- ✅ **FIXED** - Configs updated to use `torch_compile_mode: "default"` (compatible with gradient checkpointing)
- ✅ **NEW** - Added `colab_full_stable.yaml` config (no torch.compile, maximum stability)
- See troubleshooting section below for details

## Config Quick Reference

| Config | Model | torch.compile | Batch | Memory | Speed | Stability | TIER 1 |
|--------|-------|---------------|-------|--------|-------|-----------|--------|
| **colab_optimized_full_tuned.yaml** | 64/128/256 | ✅ (default) | 20 | 10-12GB | ⚡ Fast | ✅ Good | ✅ YES |
| **colab_optimized_full.yaml** | 64/128/256 | ✅ (default) | 20 | 9-10GB | ⚡ Fast | ✅ Good | ❌ No |
| **colab_full_stable.yaml** | 64/128/256 | ❌ | 16 | 8-9GB | Normal | ✅✅ Best | ❌ No |
| **colab_optimized.yaml** | 32/64/128 | ❌ | 16 | 7-8GB | Normal | ✅ Good | ❌ No |

**🎯 TIER 1 READY:** Use **colab_optimized_full_tuned.yaml** (Option A-Tuned) for literature-backed improvements (+15-25% R² expected)!

**Recommendation:** Start with **colab_optimized_full_tuned.yaml** (Option A-Tuned). If you get OOM errors, reduce `batch_size` to 16 or use **colab_optimized.yaml**.

## What This Notebook Does

1. **TIER 1 Training** (NEW!) - Literature-backed improvements: multi-scale attention + self-supervised pre-training
2. **Strong Baseline Training** - Train a high-quality model with optimal hyperparameters
3. **Comprehensive Ablation Studies** - Systematic evaluation of each component
4. **GPU-Optimized** - Maximizes T4 GPU utilization (~10-12GB usage vs default 3.7GB)
5. **Reproducible Results** - Fixed seeds, detailed logging, automatic checkpointing

## Training Pipeline

**TIER 1 Training (NEW - colab_optimized_full_tuned.yaml):**
- **Self-Supervised Phase** (20 epochs): Encoder learns spatial features via image reconstruction
- **Supervised Pre-training**: Model learns on primary flight (30Oct24)
- **Final Training**: Fine-tunes on all flights with overweighting
- **Validation**: Held-out flight (12Feb25) for unbiased evaluation

**Baseline Training (colab_optimized_full.yaml):**
- **Pre-training Phase**: Model learns on primary flight (30Oct24) - establishes feature representations
- **Final Training Phase**: Fine-tunes on all flights with overweighting to retain learned features
- **Validation**: Held-out flight (12Feb25) for unbiased evaluation
- **LOO Cross-Validation** (Optional): Each flight held out once for robust generalization assessment

## Expected Runtime

- **TIER 1 Training**: ~3-4 hours (20 epochs pre-training + 50 epochs supervised)
- **Baseline Training**: ~2-3 hours (50 epochs with early stopping)
- **Full Ablation Suite**: ~6-8 hours (8 experiments × ~45 min each)
- **With LOO CV**: ~8-12 hours (depending on flights)

## Prerequisites

- Google Drive with data uploaded to `/MyDrive/CloudML/data/`
- Each flight folder must contain: `.h5` (IRAI), `.hdf5` (CPL), `.hdf` (navigation)
- ~2GB Drive space for models and results

---

## Setup (Run Once Per Session)

In [None]:
# ============================================================================
# STEP 1: Mount Google Drive
# ============================================================================
from google.colab import drive
import os

drive.mount('/content/drive')

# Create project directories
!mkdir -p /content/drive/MyDrive/CloudML/data
!mkdir -p /content/drive/MyDrive/CloudML/models
!mkdir -p /content/drive/MyDrive/CloudML/plots
!mkdir -p /content/drive/MyDrive/CloudML/logs

print("✓ Google Drive mounted successfully")
print("✓ Project directories created")

In [None]:
# ============================================================================
# STEP 2: Clone/Update Repository
# ============================================================================
%cd /content

if not os.path.exists('/content/repo'):
    print('Cloning repository...')
    !git clone https://github.com/rylanmalarchick/cloudMLPublic.git repo
else:
    print('Repository exists. Pulling latest changes...')
    %cd /content/repo
    !git pull origin main

%cd /content/repo
print("✓ Repository ready")

In [None]:
# ============================================================================
# STEP 3: Install Dependencies
# ============================================================================
print("Installing dependencies (this may take 5-10 minutes)...\n")

# Install PyTorch with CUDA 12.1 support
!pip install --quiet torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

# Clean up potential conflicts
!pip uninstall -y -q mamba-ssm causal-conv1d 2>/dev/null

# Install core dependencies
!pip install --quiet h5py==3.14.0 netCDF4==1.7.2 pyhdf==0.11.6 scikit-learn matplotlib plotly pyyaml

# Install advanced components
!pip install --quiet torch_geometric==2.5.3
!pip install --quiet causal-conv1d==1.4.0
!pip install --quiet mamba-ssm==2.2.2

print("\n✓ All dependencies installed successfully")
print("\nVerifying installation...")

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# ============================================================================
# STEP 4: Verify Data
# ============================================================================
import os

data_dir = '/content/drive/MyDrive/CloudML/data/'

# Expected flights
flights = ['10Feb25', '30Oct24', '04Nov24', '23Oct24', '18Feb25', '12Feb25']

print("Checking data availability...\n")
missing_data = []

for flight in flights:
    flight_path = os.path.join(data_dir, flight)
    if os.path.exists(flight_path):
        files = os.listdir(flight_path)
        has_h5 = any(f.endswith('.h5') for f in files)
        has_hdf5 = any(f.endswith('.hdf5') for f in files)
        has_hdf = any(f.endswith('.hdf') for f in files)
        
        if has_h5 and has_hdf5 and has_hdf:
            print(f"✓ {flight}: All files present")
        else:
            print(f"⚠ {flight}: Missing files (h5={has_h5}, hdf5={has_hdf5}, hdf={has_hdf})")
            missing_data.append(flight)
    else:
        print(f"✗ {flight}: Folder not found")
        missing_data.append(flight)

if missing_data:
    print(f"\n⚠ WARNING: {len(missing_data)} flight(s) missing data")
    print("Training will proceed with available flights only.")
else:
    print("\n✓ All data verified successfully!")

---
## Baseline Training - Choose Your Configuration

**🎯 NEW: TIER 1 Implementation Ready!**

**TIER 1 improvements** (based on validated literature):
- ✅ Multi-scale temporal attention (captures cross-view relationships at different scales)
- ✅ Self-supervised pre-training (encoder learns features before supervised training)
- ✅ Increased temporal frames (7 instead of 5 for better spatial coverage)
- ✅ Expected improvement: +15-25% R² (from -0.09 to 0.15-0.25)

See `TIER1_READY.md` for full details.

---

**Five options available:**

### Option A-Tuned: Full Model with TUNED Hyperparameters (RECOMMENDED) ⭐⭐
- **NEW**: Tuned based on first run analysis (negative R², erratic val loss)
- Model: **64/128/256 channels** (full capacity)
- **Improvements:** Lower LR (0.0005), reduced warmup (500), gentler overweighting (2.0x), tighter early stopping
- GPU Memory: ~9-10GB
- **Use this if:** First run gave poor results (R² < 0)

### Option A: Full Model with Optimizations
- Model: **64/128/256 channels** (full capacity)
- Optimizations: Gradient checkpointing + torch.compile (default mode)
- GPU Memory: ~9-10GB
- Batch size: 20
- Speed: **15-25% faster** than no compile
- **Use this for:** First attempt, original hyperparameters

### Option B: Full Model - Maximum Stability
- Model: **64/128/256 channels** (full capacity)
- Optimizations: Gradient checkpointing only (NO torch.compile)
- GPU Memory: ~8-9GB
- Batch size: 16
- **Use this if:** Option A gives "static input data pointer changed" errors

### Option C: Memory-Optimized Model (SAFE FALLBACK)
- Model: 32/64/128 channels (50% smaller)
- GPU Memory: ~7-8GB
- Batch size: 16
- **Use this if:** Options A or B give OOM errors

**Recommended:** Start with **Option A-Tuned** (improved based on first run analysis).

In [None]:
# ============================================================================
# OPTION A-TUNED: TIER 1 TRAINING (RECOMMENDED) ⭐⭐
# ============================================================================

# STEP 1: Pull latest Tier 1 code
print("Pulling latest Tier 1 code...")
%cd /content/repo
!git pull origin main
print("✓ Code updated\n")

# STEP 2: Verify Tier 1 modules exist
import os
if os.path.exists('/content/repo/src/pretraining.py'):
    print("✓ Tier 1 self-supervised pretraining module found")
else:
    print("⚠ WARNING: Tier 1 module not found - running baseline only")

# STEP 3: Start training
import datetime

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
baseline_name = f"tier1_tuned_{timestamp}"

print("\n" + "="*80)
print("TIER 1 TRAINING: TUNED HYPERPARAMETERS + LITERATURE IMPROVEMENTS")
print("="*80)
print(f"Experiment ID: {baseline_name}")
print(f"Config: colab_optimized_full_tuned.yaml")
print(f"Model: 64/128/256 channels (FULL)")
print(f"\nTIER 1 FEATURES ENABLED:")
print(f"  ✅ Self-supervised pre-training (20 epochs reconstruction)")
print(f"  ✅ Multi-scale temporal attention (4 heads)")
print(f"  ✅ Increased temporal frames (7 frames)")
print(f"  ✅ Expected R² improvement: +15-25%")
print(f"\nTuned Parameters:")
print(f"  - Learning Rate: 0.0005 (reduced from 0.001)")
print(f"  - Warmup Steps: 500 (reduced from 2000)")
print(f"  - Overweight Factor: 2.0 (reduced from 3.5)")
print(f"  - Early Stopping Patience: 10 (reduced from 15)")
print(f"Expected Runtime: 3-4 hours (includes Tier 1 pre-training)")
print(f"Expected GPU Usage: ~11-13GB (batch_size=20, 7 frames)")
print(f"Target: R² > 0.15, MAE < 0.30 km")
print("="*80)
print("\nTraining started... Monitor GPU with: !nvidia-smi\n")
print("Watch for: 'TIER 1: SELF-SUPERVISED PRE-TRAINING ENABLED' banner\n")

%cd /content/repo
!python main.py \
    --config configs/colab_optimized_full_tuned.yaml \
    --save_name {baseline_name} \
    --epochs 50

print("\n" + "="*80)
print("TIER 1 TRAINING COMPLETE!")
print("="*80)
print(f"Model saved to: /content/drive/MyDrive/CloudML/models/trained/{baseline_name}.pth")
print(f"Pre-trained encoder: /content/drive/MyDrive/CloudML/models/pretrained/")
print(f"Results saved to: /content/drive/MyDrive/CloudML/plots/")
print(f"Logs saved to: /content/drive/MyDrive/CloudML/logs/")
print(f"\nCompare with baseline to see Tier 1 improvements!")
print("\nCheck TensorBoard: %load_ext tensorboard")
print("                   %tensorboard --logdir /content/drive/MyDrive/CloudML/logs/tensorboard/")

In [None]:
# ============================================================================
# OPTION A: FULL MODEL + OPTIMIZATIONS (ORIGINAL)
# ============================================================================
import datetime

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
baseline_name = f"baseline_full_{timestamp}"

print("="*80)
print("TRAINING FULL MODEL WITH OPTIMIZATIONS")
print("="*80)
print(f"Experiment ID: {baseline_name}")
print(f"Config: colab_optimized_full.yaml")
print(f"Model: 64/128/256 channels (FULL)")
print(f"Optimizations: Gradient Checkpointing + torch.compile('default' mode)")
print(f"Expected Runtime: 2-2.5 hours (faster with compile)")
print(f"Expected GPU Usage: ~9-10GB (batch_size=20)")
print(f"CUDA Graph Fix: Using 'default' compile mode (compatible with checkpointing)")
print("="*80)
print("\nTraining started... Monitor GPU with: !nvidia-smi\n")

!python main.py \
    --config configs/colab_optimized_full.yaml \
    --save_name {baseline_name} \
    --epochs 50

print("\n" + "="*80)
print("FULL MODEL TRAINING COMPLETE!")
print("="*80)
print(f"Model saved to: /content/drive/MyDrive/CloudML/models/trained/{baseline_name}.pth")
print(f"Results saved to: /content/drive/MyDrive/CloudML/logs/")
print(f"Logs saved to: /content/drive/MyDrive/CloudML/plots/")
print("\nCheck TensorBoard: %load_ext tensorboard")
print("                   %tensorboard --logdir /content/drive/MyDrive/CloudML/logs/tensorboard/")

In [None]:
# ============================================================================
# OPTION B: FULL MODEL - MAXIMUM STABILITY (NEW - NO torch.compile)
# ============================================================================
import datetime

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
baseline_name = f"baseline_full_stable_{timestamp}"

print("="*80)
print("TRAINING FULL MODEL - STABLE MODE")
print("="*80)
print(f"Experiment ID: {baseline_name}")
print(f"Config: colab_full_stable.yaml")
print(f"Model: 64/128/256 channels (FULL)")
print(f"Optimizations: Gradient Checkpointing only (NO torch.compile)")
print(f"Expected Runtime: ~3 hours (no compile speedup)")
print(f"Expected GPU Usage: ~8-9GB (batch_size=16)")
print(f"Stability: MAXIMUM (no CUDA graph issues)")
print("="*80)
print("\nTraining started... Monitor GPU with: !nvidia-smi\n")

!python main.py \
    --config configs/colab_full_stable.yaml \
    --save_name {baseline_name} \
    --epochs 50

print("\n" + "="*80)
print("FULL MODEL STABLE TRAINING COMPLETE!")
print("="*80)
print(f"Model saved to: /content/drive/MyDrive/CloudML/models/trained/{baseline_name}.pth")
print(f"Results saved to: /content/drive/MyDrive/CloudML/plots/")
print(f"Logs saved to: /content/drive/MyDrive/CloudML/logs/")
print("\nCheck TensorBoard: %load_ext tensorboard")
print("                   %tensorboard --logdir /content/drive/MyDrive/CloudML/logs/tensorboard/")

In [None]:
# ============================================================================
# OPTION C: MEMORY-OPTIMIZED MODEL (FALLBACK IF OOM)
# ============================================================================
import datetime

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
baseline_name = f"baseline_memopt_{timestamp}"

print("="*80)
print("TRAINING MEMORY-OPTIMIZED MODEL")
print("="*80)
print(f"Experiment ID: {baseline_name}")
print(f"Config: colab_optimized.yaml")
print(f"Model: 32/64/128 channels (memory-optimized)")
print(f"Expected Runtime: 2.5-3 hours")
print(f"Expected GPU Usage: ~7-8GB (batch_size=16)")
print("="*80)
print("\nTraining started... Monitor GPU with: !nvidia-smi\n")

!python main.py \
    --config configs/colab_optimized.yaml \
    --save_name {baseline_name} \
    --epochs 50

print("\n" + "="*80)
print("MEMORY-OPTIMIZED TRAINING COMPLETE!")
print("="*80)
print(f"Model saved to: /content/drive/MyDrive/CloudML/models/trained/{baseline_name}.pth")
print(f"Results saved to: /content/drive/MyDrive/CloudML/plots/")
print(f"Logs saved to: /content/drive/MyDrive/CloudML/logs/")

---
## Troubleshooting Training Issues

### Common Error: "RuntimeError: static input data pointer changed"

**What it means:** CUDA graph incompatibility between torch.compile and gradient checkpointing.

**When it happens:** Usually during epoch 2+ after compilation completes.

**Solution:** Use Option B (Full Model - Stable) which disables torch.compile, OR the configs have been updated:
- `colab_optimized_full.yaml` now uses `torch_compile_mode: "default"` (fixed)
- `colab_full_stable.yaml` disables torch.compile entirely (most stable)

**Quick fix if error occurs:**
```python
# Stop current run and use stable config:
!python main.py --config configs/colab_full_stable.yaml --save_name baseline_stable --epochs 50
```

### Config Selection Decision Tree:

1. **Start here:** Option A (`colab_optimized_full.yaml`)
   - ✅ If it works: Fastest training, best results
   - ❌ If "static input data pointer changed": Go to step 2
   - ❌ If OOM error: Go to step 3

2. **CUDA graph error:** Use Option B (`colab_full_stable.yaml`)
   - ✅ Same model capacity, no compile issues
   - ❌ If OOM error: Go to step 3

3. **Out of Memory:** Use Option C (`colab_optimized.yaml`)
   - ✅ Guaranteed to fit on T4
   - ⚠️ Smaller model = may need more training

### Monitoring Tips:

```python
# Check GPU memory usage:
!nvidia-smi

# Watch training progress:
!tail -f /content/drive/MyDrive/CloudML/logs/training.log

# Kill training if needed:
# Runtime → Interrupt execution (or press stop button)
```

**See full documentation:** `docs/CUDA_GRAPH_FIX.md`

---
## Ablation Studies (Systematic Component Evaluation)

These experiments isolate the contribution of each component by removing or modifying one aspect at a time.

### Ablation Suite:

1. **Angles Mode - Zenith Only**: Tests if solar azimuth angle adds value
2. **No Spatial Attention**: Evaluates spatial attention contribution
3. **No Temporal Attention**: Evaluates temporal attention contribution
4. **No Attention (Both)**: Tests full attention mechanism value
5. **No Augmentation**: Measures data augmentation impact
6. **Simple MAE Loss**: Compares Huber vs MAE loss
7. **Fewer Temporal Frames**: Tests temporal context importance (3 vs 5 frames)
8. **CNN Architecture**: Compares Transformer vs simple CNN baseline

**Total Runtime**: ~6-8 hours for all 8 ablations

**Run this to execute all ablations sequentially:**

In [None]:
# ============================================================================
# ABLATION STUDIES - SYSTEMATIC EVALUATION
# ============================================================================
import datetime
import json

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

# Define ablation experiments
ablations = [
    {
        'name': 'ablation_angles_sza_only',
        'description': 'Use only solar zenith angle (no azimuth)',
        'args': '--angles_mode sza_only',
        'expected_impact': 'Slight performance drop if SAA provides useful info'
    },
    {
        'name': 'ablation_no_spatial_attention',
        'description': 'Disable spatial attention mechanism',
        'args': '--no-use_spatial_attention',
        'expected_impact': 'Moderate drop - spatial attention focuses on clouds'
    },
    {
        'name': 'ablation_no_temporal_attention',
        'description': 'Disable temporal attention mechanism',
        'args': '--no-use_temporal_attention',
        'expected_impact': 'Moderate drop - temporal attention weighs informative frames'
    },
    {
        'name': 'ablation_no_attention',
        'description': 'Disable both attention mechanisms',
        'args': '--no-use_spatial_attention --no-use_temporal_attention',
        'expected_impact': 'Significant drop - demonstrates attention value'
    },
    {
        'name': 'ablation_no_augmentation',
        'description': 'Disable data augmentation',
        'args': '--no-augment',
        'expected_impact': 'Small drop - augmentation helps generalization'
    },
    {
        'name': 'ablation_mae_loss',
        'description': 'Use simple MAE loss instead of Huber',
        'args': '--loss_type mae',
        'expected_impact': 'Slight drop - Huber is robust to outliers'
    },
    {
        'name': 'ablation_fewer_temporal',
        'description': 'Reduce temporal frames from 5 to 3',
        'args': '--temporal_frames 3',
        'expected_impact': 'Moderate drop - less temporal context'
    },
    {
        'name': 'ablation_cnn_baseline',
        'description': 'Simple CNN without transformer',
        'args': '--architecture_name cnn --batch_size 24',  # CNN can handle larger batches
        'expected_impact': 'Significant drop - demonstrates transformer superiority'
    },
]

# Save ablation plan
ablation_log_path = '/content/drive/MyDrive/CloudML/ablation_plan.json'
with open(ablation_log_path, 'w') as f:
    json.dump(ablations, f, indent=2)

print("="*80)
print("SYSTEMATIC ABLATION STUDIES")
print("="*80)
print(f"Total experiments: {len(ablations)}")
print(f"Estimated total time: {len(ablations) * 45} minutes (~{len(ablations) * 45 / 60:.1f} hours)")
print(f"Results will be saved to: /content/drive/MyDrive/CloudML/")
print("="*80)
print()

# Execute each ablation
results_summary = []

for i, ablation in enumerate(ablations, 1):
    print(f"\n{'='*80}")
    print(f"ABLATION {i}/{len(ablations)}: {ablation['name']}")
    print(f"{'='*80}")
    print(f"Description: {ablation['description']}")
    print(f"Expected: {ablation['expected_impact']}")
    print(f"{'='*80}\n")
    
    save_name = f"{ablation['name']}_{timestamp}"
    
    # Run experiment
    !python main.py \
        --config configs/colab_optimized.yaml \
        --save_name {save_name} \
        --epochs 40 \
        --batch_size 16 \
        --temporal_frames 5 \
        {ablation['args']}
    
    print(f"\n✓ Completed: {ablation['name']}\n")
    
    results_summary.append({
        'ablation': ablation['name'],
        'description': ablation['description'],
        'save_name': save_name
    })

# Save summary
summary_path = '/content/drive/MyDrive/CloudML/ablation_summary.json'
with open(summary_path, 'w') as f:
    json.dump(results_summary, f, indent=2)

print("\n" + "="*80)
print("ALL ABLATIONS COMPLETE!")
print("="*80)
print(f"Summary saved to: {summary_path}")
print(f"\nNext steps:")
print("1. Run the aggregation cell below to compile results")
print("2. Check plots in: /content/drive/MyDrive/CloudML/plots/")
print("3. Review metrics for paper Table 1")

---
## Optional: Leave-One-Out Cross-Validation

For the most rigorous evaluation, run LOO CV where each flight is held out once.

**Warning**: This takes 8-12 hours for 6 flights!

Only run this if:
- You have Colab Pro (longer sessions)
- You need LOO results for the paper
- You can monitor the session

In [None]:
# ============================================================================
# LEAVE-ONE-OUT CROSS-VALIDATION (OPTIONAL)
# ============================================================================
import datetime

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
loo_name = f"loo_cv_{timestamp}"

print("="*80)
print("LEAVE-ONE-OUT CROSS-VALIDATION")
print("="*80)
print("This will train 6 models (one per flight held out)")
print("Expected Runtime: 8-12 hours")
print("⚠ WARNING: Long training session - ensure you have Colab Pro or can monitor")
print("="*80)
print()

# Confirm before running
confirm = input("Type 'yes' to proceed with LOO CV: ")

if confirm.lower() == 'yes':
    !python main.py \
        --config configs/colab_optimized.yaml \
        --save_name {loo_name} \
        --epochs 40 \
        --batch_size 16 \
        --temporal_frames 5 \
        --loo \
        --loo_epochs 40
    
    print("\n✓ LOO Cross-Validation Complete!")
    print(f"Results saved to: /content/drive/MyDrive/CloudML/")
else:
    print("LOO CV skipped.")

---
## Results Aggregation & Analysis

Compile all results into a summary table for your paper.

In [None]:
# ============================================================================
# AGGREGATE RESULTS FOR PAPER
# ============================================================================
import pandas as pd
import glob
import os

print("Aggregating results...\n")

# Find all CSV result files
results_dir = '/content/drive/MyDrive/CloudML/logs/csv/'
csv_files = glob.glob(os.path.join(results_dir, '*.csv'))

if csv_files:
    all_results = []
    
    for csv_file in csv_files:
        df = pd.read_csv(csv_file)
        exp_name = os.path.basename(csv_file).replace('.csv', '')
        df['experiment'] = exp_name
        all_results.append(df)
    
    # Combine all results
    combined = pd.concat(all_results, ignore_index=True)
    
    # Save combined results
    output_path = '/content/drive/MyDrive/CloudML/all_results_combined.csv'
    combined.to_csv(output_path, index=False)
    
    print(f"✓ Combined {len(csv_files)} result files")
    print(f"✓ Saved to: {output_path}")
    print("\nSummary Statistics by Experiment:")
    print("="*80)
    
    # Group by experiment and show key metrics
    summary = combined.groupby('experiment').agg({
        'mae': 'mean',
        'rmse': 'mean',
        'r2': 'mean'
    }).round(4)
    
    print(summary)
    print("\n✓ Use this table for your paper!")
    
else:
    print("No result files found. Make sure experiments have completed.")

In [None]:
# ============================================================================
# CREATE PAPER-READY COMPARISON TABLE
# ============================================================================
import pandas as pd
import json

# Load ablation summary
summary_path = '/content/drive/MyDrive/CloudML/ablation_summary.json'

if os.path.exists(summary_path):
    with open(summary_path, 'r') as f:
        ablations = json.load(f)
    
    print("Paper-Ready Ablation Table")
    print("="*100)
    print(f"{'Experiment':<40} | {'Description':<50}")
    print("="*100)
    
    for abl in ablations:
        print(f"{abl['ablation']:<40} | {abl['description']:<50}")
    
    print("="*100)
    print("\nTo get metrics for each experiment, check the CSV files in:")
    print("/content/drive/MyDrive/CloudML/logs/csv/")
    print("\nOr run the aggregation script:")
    print("!python scripts/aggregate_results.py")
else:
    print("Ablation summary not found. Run ablations first.")

In [None]:
# ============================================================================
# OPTION D: Custom Configuration
# ============================================================================
# Edit config file manually and run with custom settings

%cd /content/repo
# !python main.py --config configs/YOUR_CUSTOM_CONFIG.yaml --save_name custom_run --epochs 50

---
## Monitor Training Progress

### 🎯 TIER 1 Training Stages

If you ran **Option A-Tuned**, you'll see two training phases:

**Phase 1: Self-Supervised Pre-training (20 epochs, ~30-45 min)**
- Watch reconstruction loss decrease (target: < 0.01)
- Encoder learns spatial features from images
- Should converge in first 10-15 epochs

**Phase 2: Supervised Training (50 epochs, ~2-3 hours)**
- Normal supervised training with pre-trained encoder
- Watch R² improve (target: > 0.15)
- Should see faster initial convergence vs baseline

---

In [None]:
# ============================================================================
# GPU MONITORING (Run in parallel with training)
# ============================================================================
import time
from IPython.display import clear_output

def monitor_gpu(duration=300, interval=5):
    """Monitor GPU usage for specified duration"""
    for i in range(duration // interval):
        clear_output(wait=True)
        print(f"GPU Monitoring (updating every {interval}s, {i*interval}/{duration}s elapsed)\n")
        !nvidia-smi --query-gpu=timestamp,memory.used,memory.total,utilization.gpu,temperature.gpu --format=csv
        time.sleep(interval)

# Run for 5 minutes
monitor_gpu(duration=300, interval=10)

---
## TensorBoard - View Training Curves

**Before running TensorBoard, verify that log files exist:**

In [None]:
# ============================================================================
# VERIFY TENSORBOARD LOGS EXIST
# ============================================================================
import os

tb_dir = "/content/drive/MyDrive/CloudML/logs/tensorboard/"

if os.path.exists(tb_dir):
    runs = [d for d in os.listdir(tb_dir) if os.path.isdir(os.path.join(tb_dir, d))]
    if runs:
        print(f"✓ Found {len(runs)} TensorBoard run(s):")
        for run in sorted(runs):
            run_path = os.path.join(tb_dir, run)
            files = os.listdir(run_path)
            event_files = [f for f in files if 'events.out.tfevents' in f]
            print(f"  - {run}: {len(event_files)} event file(s)")
        print(f"\n✓ Ready to launch TensorBoard!")
    else:
        print(f"✗ TensorBoard directory exists but is empty: {tb_dir}")
        print("  Run a training session first.")
else:
    print(f"✗ TensorBoard directory not found: {tb_dir}")
    print("  Make sure:")
    print("  1. You've run a training session")
    print("  2. Files are saving to Google Drive (check config paths)")
    print("  3. Google Drive is mounted")

In [None]:
# ============================================================================
# TENSORBOARD (View training curves)
# ============================================================================
# Run this cell to launch TensorBoard in the notebook
%load_ext tensorboard
%tensorboard --logdir /content/drive/MyDrive/CloudML/logs/tensorboard/

# If you see "No dashboards are active", it means:
# 1. No training runs have been completed yet, OR
# 2. Files are not saving to Drive (check verification cell above)

In [None]:
# ============================================================================
# LIST ALL TRAINED MODELS
# ============================================================================
import os
from datetime import datetime

models_dir = '/content/drive/MyDrive/CloudML/models/trained/'

if os.path.exists(models_dir):
    models = sorted(os.listdir(models_dir))
    
    print(f"\nTrained Models ({len(models)} total)")
    print("="*100)
    print(f"{'Model Name':<60} {'Size (MB)':<15} {'Modified':<20}")
    print("="*100)
    
    for model in models:
        path = os.path.join(models_dir, model)
        size_mb = os.path.getsize(path) / (1024 * 1024)
        mtime = datetime.fromtimestamp(os.path.getmtime(path))
        print(f"{model:<60} {size_mb:>10.1f} MB   {mtime.strftime('%Y-%m-%d %H:%M')}")
    
    print("="*100)
else:
    print("Models directory not found.")

In [None]:
# ============================================================================
# DOWNLOAD RESULTS (Optional - already in Drive)
# ============================================================================
from google.colab import files

# Zip and download results
print("Zipping results for download...")

!cd /content/drive/MyDrive/CloudML && \
    zip -r results_export.zip plots/ logs/csv/ models/trained/ *.json *.csv 2>/dev/null

print("\n✓ Results zipped")
print("Downloading... (this may take a few minutes)")

files.download('/content/drive/MyDrive/CloudML/results_export.zip')

print("\n✓ Download complete!")

---
## Quick Reference

### File Locations
- **Models**: `/content/drive/MyDrive/CloudML/models/trained/`
- **Plots**: `/content/drive/MyDrive/CloudML/plots/`
- **Logs**: `/content/drive/MyDrive/CloudML/logs/`
- **Metrics (CSV)**: `/content/drive/MyDrive/CloudML/logs/csv/`
- **TensorBoard**: `/content/drive/MyDrive/CloudML/logs/tensorboard/`

### Key Metrics for Paper
- **MAE** (Mean Absolute Error): Primary metric in km
- **RMSE** (Root Mean Squared Error): Penalizes large errors
- **R²** (Coefficient of Determination): Model fit quality
- **MAPE** (Mean Absolute Percentage Error): Relative error

### Recommended Paper Structure
1. **Baseline Results**: Report MAE, RMSE, R² from baseline training
2. **Ablation Table**: Show Δ metrics for each ablation vs baseline
3. **LOO Results**: If available, show per-flight generalization
4. **Qualitative**: Include attention maps, error distributions

### Troubleshooting
- **OOM Error**: Reduce `batch_size` to 24 or 16
- **Slow Training**: Check `!nvidia-smi` - GPU should be >80% utilized
- **Session Timeout**: Enable Colab Pro or run shorter experiments
- **Missing Data**: Check `/content/drive/MyDrive/CloudML/data/` structure

### Support
- **Documentation**: See `README.md`, `COLAB_SETUP.md`, `GPU_OPTIMIZATION.md`
- **Issues**: https://github.com/rylanmalarchick/cloudMLPublic/issues
- **Email**: rylan1012@gmail.com