# NASA Cloud-ML Training on Google Colab

**Paper-Quality Baseline & Ablation Studies**

This notebook is optimized for producing publication-ready results on Google Colab's T4 GPU (15GB VRAM).

## What This Notebook Does

1. **Strong Baseline Training** - Train a high-quality model with optimal hyperparameters
2. **Comprehensive Ablation Studies** - Systematic evaluation of each component
3. **GPU-Optimized** - Maximizes T4 GPU utilization (~10-12GB usage vs default 3.7GB)
4. **Reproducible Results** - Fixed seeds, detailed logging, automatic checkpointing

## Training Pipeline

- **Pre-training Phase**: Model learns on primary flight (30Oct24) - establishes feature representations
- **Final Training Phase**: Fine-tunes on all flights with overweighting to retain learned features
- **Validation**: Held-out flight (12Feb25) for unbiased evaluation
- **LOO Cross-Validation** (Optional): Each flight held out once for robust generalization assessment

## Expected Runtime

- **Baseline Training**: ~2-3 hours (50 epochs with early stopping)
- **Full Ablation Suite**: ~6-8 hours (8 experiments × ~45 min each)
- **With LOO CV**: ~8-12 hours (depending on flights)

## Prerequisites

- Google Drive with data uploaded to `/MyDrive/CloudML/data/`
- Each flight folder must contain: `.h5` (IRAI), `.hdf5` (CPL), `.hdf` (navigation)
- ~2GB Drive space for models and results

---

## Setup (Run Once Per Session)

In [None]:
# ============================================================================
# STEP 1: Mount Google Drive
# ============================================================================
from google.colab import drive
import os

drive.mount('/content/drive')

# Create project directories
!mkdir -p /content/drive/MyDrive/CloudML/data
!mkdir -p /content/drive/MyDrive/CloudML/models
!mkdir -p /content/drive/MyDrive/CloudML/plots
!mkdir -p /content/drive/MyDrive/CloudML/logs

print("✓ Google Drive mounted successfully")
print("✓ Project directories created")

In [None]:
# ============================================================================
# STEP 2: Clone/Update Repository
# ============================================================================
%cd /content

if not os.path.exists('/content/repo'):
    print('Cloning repository...')
    !git clone https://github.com/rylanmalarchick/cloudMLPublic.git repo
else:
    print('Repository exists. Pulling latest changes...')
    %cd /content/repo
    !git pull origin main

%cd /content/repo
print("✓ Repository ready")

In [None]:
# ============================================================================
# STEP 3: Install Dependencies
# ============================================================================
print("Installing dependencies (this may take 5-10 minutes)...\n")

# Install PyTorch with CUDA 12.1 support
!pip install --quiet torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

# Clean up potential conflicts
!pip uninstall -y -q mamba-ssm causal-conv1d 2>/dev/null

# Install core dependencies
!pip install --quiet h5py==3.14.0 netCDF4==1.7.2 pyhdf==0.11.6 scikit-learn matplotlib plotly pyyaml

# Install advanced components
!pip install --quiet torch_geometric==2.5.3
!pip install --quiet causal-conv1d==1.4.0
!pip install --quiet mamba-ssm==2.2.2

print("\n✓ All dependencies installed successfully")
print("\nVerifying installation...")

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# ============================================================================
# STEP 4: Verify Data
# ============================================================================
import os

data_dir = '/content/drive/MyDrive/CloudML/data/'

# Expected flights
flights = ['10Feb25', '30Oct24', '04Nov24', '23Oct24', '18Feb25', '12Feb25']

print("Checking data availability...\n")
missing_data = []

for flight in flights:
    flight_path = os.path.join(data_dir, flight)
    if os.path.exists(flight_path):
        files = os.listdir(flight_path)
        has_h5 = any(f.endswith('.h5') for f in files)
        has_hdf5 = any(f.endswith('.hdf5') for f in files)
        has_hdf = any(f.endswith('.hdf') for f in files)
        
        if has_h5 and has_hdf5 and has_hdf:
            print(f"✓ {flight}: All files present")
        else:
            print(f"⚠ {flight}: Missing files (h5={has_h5}, hdf5={has_hdf5}, hdf={has_hdf})")
            missing_data.append(flight)
    else:
        print(f"✗ {flight}: Folder not found")
        missing_data.append(flight)

if missing_data:
    print(f"\n⚠ WARNING: {len(missing_data)} flight(s) missing data")
    print("Training will proceed with available flights only.")
else:
    print("\n✓ All data verified successfully!")

---
## Baseline Training (High-Quality Model)

This section trains the **strongest possible baseline** for your paper.

### Configuration:
- **Batch Size**: 32 (optimized for T4 GPU)
- **Temporal Frames**: 5 (strong temporal context)
- **Epochs**: 50 (with early stopping)
- **Architecture**: Transformer with spatial & temporal attention
- **Data Augmentation**: Enabled
- **All Features**: Both SZA and SAA angles, full preprocessing pipeline

### Expected Performance:
- GPU Utilization: ~10-12GB (75-80%)
- Training Time: ~2-3 hours
- Results: State-of-the-art CBH prediction

**Run this cell and come back in 2-3 hours!**

In [None]:
# ============================================================================
# BASELINE TRAINING - THE STRONGEST MODEL
# ============================================================================
import datetime

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
baseline_name = f"baseline_paper_{timestamp}"

print("="*80)
print("TRAINING PAPER BASELINE MODEL")
print("="*80)
print(f"Experiment ID: {baseline_name}")
print(f"Config: colab_optimized.yaml")
print(f"Expected Runtime: 2-3 hours")
print(f"Expected GPU Usage: ~10-11GB (batch_size=24)")
print("="*80)
print("\nTraining started... Monitor GPU with: !nvidia-smi\n")

!python main.py \
    --config configs/colab_optimized.yaml \
    --save_name {baseline_name} \
    --epochs 50 \
    --batch_size 24 \
    --temporal_frames 5 \
    --learning_rate 0.001 \
    --weight_decay 0.04 \
    --early_stopping_patience 15 \
    --use_spatial_attention \
    --use_temporal_attention \
    --augment \
    --angles_mode both \
    --loss_type huber \
    --architecture_name transformer

print("\n" + "="*80)
print("BASELINE TRAINING COMPLETE!")
print("="*80)
print(f"Model saved to: /content/drive/MyDrive/CloudML/models/trained/{baseline_name}.pth")
print(f"Results saved to: /content/drive/MyDrive/CloudML/plots/")
print(f"Logs saved to: /content/drive/MyDrive/CloudML/logs/")
print("\nCheck TensorBoard: %load_ext tensorboard")
print("                   %tensorboard --logdir /content/drive/MyDrive/CloudML/logs/")

---
## Ablation Studies (Systematic Component Evaluation)

These experiments isolate the contribution of each component by removing or modifying one aspect at a time.

### Ablation Suite:

1. **Angles Mode - Zenith Only**: Tests if solar azimuth angle adds value
2. **No Spatial Attention**: Evaluates spatial attention contribution
3. **No Temporal Attention**: Evaluates temporal attention contribution
4. **No Attention (Both)**: Tests full attention mechanism value
5. **No Augmentation**: Measures data augmentation impact
6. **Simple MAE Loss**: Compares Huber vs MAE loss
7. **Fewer Temporal Frames**: Tests temporal context importance (3 vs 5 frames)
8. **CNN Architecture**: Compares Transformer vs simple CNN baseline

**Total Runtime**: ~6-8 hours for all 8 ablations

**Run this to execute all ablations sequentially:**

In [None]:
# ============================================================================
# ABLATION STUDIES - SYSTEMATIC EVALUATION
# ============================================================================
import datetime
import json

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

# Define ablation experiments
ablations = [
    {
        'name': 'ablation_angles_sza_only',
        'description': 'Use only solar zenith angle (no azimuth)',
        'args': '--angles_mode sza_only',
        'expected_impact': 'Slight performance drop if SAA provides useful info'
    },
    {
        'name': 'ablation_no_spatial_attention',
        'description': 'Disable spatial attention mechanism',
        'args': '--no-use_spatial_attention',
        'expected_impact': 'Moderate drop - spatial attention focuses on clouds'
    },
    {
        'name': 'ablation_no_temporal_attention',
        'description': 'Disable temporal attention mechanism',
        'args': '--no-use_temporal_attention',
        'expected_impact': 'Moderate drop - temporal attention weighs informative frames'
    },
    {
        'name': 'ablation_no_attention',
        'description': 'Disable both attention mechanisms',
        'args': '--no-use_spatial_attention --no-use_temporal_attention',
        'expected_impact': 'Significant drop - demonstrates attention value'
    },
    {
        'name': 'ablation_no_augmentation',
        'description': 'Disable data augmentation',
        'args': '--no-augment',
        'expected_impact': 'Small drop - augmentation helps generalization'
    },
    {
        'name': 'ablation_mae_loss',
        'description': 'Use simple MAE loss instead of Huber',
        'args': '--loss_type mae',
        'expected_impact': 'Slight drop - Huber is robust to outliers'
    },
    {
        'name': 'ablation_fewer_temporal',
        'description': 'Reduce temporal frames from 5 to 3',
        'args': '--temporal_frames 3',
        'expected_impact': 'Moderate drop - less temporal context'
    },
    {
        'name': 'ablation_cnn_baseline',
        'description': 'Simple CNN without transformer',
        'args': '--architecture_name cnn --batch_size 48',  # CNN can handle larger batches
        'expected_impact': 'Significant drop - demonstrates transformer superiority'
    },
]

# Save ablation plan
ablation_log_path = '/content/drive/MyDrive/CloudML/ablation_plan.json'
with open(ablation_log_path, 'w') as f:
    json.dump(ablations, f, indent=2)

print("="*80)
print("SYSTEMATIC ABLATION STUDIES")
print("="*80)
print(f"Total experiments: {len(ablations)}")
print(f"Estimated total time: {len(ablations) * 45} minutes (~{len(ablations) * 45 / 60:.1f} hours)")
print(f"Results will be saved to: /content/drive/MyDrive/CloudML/")
print("="*80)
print()

# Execute each ablation
results_summary = []

for i, ablation in enumerate(ablations, 1):
    print(f"\n{'='*80}")
    print(f"ABLATION {i}/{len(ablations)}: {ablation['name']}")
    print(f"{'='*80}")
    print(f"Description: {ablation['description']}")
    print(f"Expected: {ablation['expected_impact']}")
    print(f"{'='*80}\n")
    
    save_name = f"{ablation['name']}_{timestamp}"
    
    # Run experiment
    !python main.py \
        --config configs/colab_optimized.yaml \
        --save_name {save_name} \
        --epochs 40 \
        --batch_size 24 \
        --temporal_frames 5 \
        {ablation['args']}
    
    print(f"\n✓ Completed: {ablation['name']}\n")
    
    results_summary.append({
        'ablation': ablation['name'],
        'description': ablation['description'],
        'save_name': save_name
    })

# Save summary
summary_path = '/content/drive/MyDrive/CloudML/ablation_summary.json'
with open(summary_path, 'w') as f:
    json.dump(results_summary, f, indent=2)

print("\n" + "="*80)
print("ALL ABLATIONS COMPLETE!")
print("="*80)
print(f"Summary saved to: {summary_path}")
print(f"\nNext steps:")
print("1. Run the aggregation cell below to compile results")
print("2. Check plots in: /content/drive/MyDrive/CloudML/plots/")
print("3. Review metrics for paper Table 1")

---
## Optional: Leave-One-Out Cross-Validation

For the most rigorous evaluation, run LOO CV where each flight is held out once.

**Warning**: This takes 8-12 hours for 6 flights!

Only run this if:
- You have Colab Pro (longer sessions)
- You need LOO results for the paper
- You can monitor the session

In [None]:
# ============================================================================
# LEAVE-ONE-OUT CROSS-VALIDATION (OPTIONAL)
# ============================================================================
import datetime

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
loo_name = f"loo_cv_{timestamp}"

print("="*80)
print("LEAVE-ONE-OUT CROSS-VALIDATION")
print("="*80)
print("This will train 6 models (one per flight held out)")
print("Expected Runtime: 8-12 hours")
print("⚠ WARNING: Long training session - ensure you have Colab Pro or can monitor")
print("="*80)
print()

# Confirm before running
confirm = input("Type 'yes' to proceed with LOO CV: ")

if confirm.lower() == 'yes':
    !python main.py \
        --config configs/colab_optimized.yaml \
        --save_name {loo_name} \
        --epochs 40 \
        --batch_size 24 \
        --temporal_frames 5 \
        --loo \
        --loo_epochs 40
    
    print("\n✓ LOO Cross-Validation Complete!")
    print(f"Results saved to: /content/drive/MyDrive/CloudML/")
else:
    print("LOO CV skipped.")

---
## Results Aggregation & Analysis

Compile all results into a summary table for your paper.

In [None]:
# ============================================================================
# AGGREGATE RESULTS FOR PAPER
# ============================================================================
import pandas as pd
import glob
import os

print("Aggregating results...\n")

# Find all CSV result files
results_dir = '/content/drive/MyDrive/CloudML/logs/csv/'
csv_files = glob.glob(os.path.join(results_dir, '*.csv'))

if csv_files:
    all_results = []
    
    for csv_file in csv_files:
        df = pd.read_csv(csv_file)
        exp_name = os.path.basename(csv_file).replace('.csv', '')
        df['experiment'] = exp_name
        all_results.append(df)
    
    # Combine all results
    combined = pd.concat(all_results, ignore_index=True)
    
    # Save combined results
    output_path = '/content/drive/MyDrive/CloudML/all_results_combined.csv'
    combined.to_csv(output_path, index=False)
    
    print(f"✓ Combined {len(csv_files)} result files")
    print(f"✓ Saved to: {output_path}")
    print("\nSummary Statistics by Experiment:")
    print("="*80)
    
    # Group by experiment and show key metrics
    summary = combined.groupby('experiment').agg({
        'mae': 'mean',
        'rmse': 'mean',
        'r2': 'mean'
    }).round(4)
    
    print(summary)
    print("\n✓ Use this table for your paper!")
    
else:
    print("No result files found. Make sure experiments have completed.")

In [None]:
# ============================================================================
# CREATE PAPER-READY COMPARISON TABLE
# ============================================================================
import pandas as pd
import json

# Load ablation summary
summary_path = '/content/drive/MyDrive/CloudML/ablation_summary.json'

if os.path.exists(summary_path):
    with open(summary_path, 'r') as f:
        ablations = json.load(f)
    
    print("Paper-Ready Ablation Table")
    print("="*100)
    print(f"{'Experiment':<40} | {'Description':<50}")
    print("="*100)
    
    for abl in ablations:
        print(f"{abl['ablation']:<40} | {abl['description']:<50}")
    
    print("="*100)
    print("\nTo get metrics for each experiment, check the CSV files in:")
    print("/content/drive/MyDrive/CloudML/logs/csv/")
    print("\nOr run the aggregation script:")
    print("!python scripts/aggregate_results.py")
else:
    print("Ablation summary not found. Run ablations first.")

---
## Utilities & Monitoring

In [None]:
# ============================================================================
# GPU MONITORING (Run in parallel with training)
# ============================================================================
import time
from IPython.display import clear_output

def monitor_gpu(duration=300, interval=5):
    """Monitor GPU usage for specified duration"""
    for i in range(duration // interval):
        clear_output(wait=True)
        print(f"GPU Monitoring (updating every {interval}s, {i*interval}/{duration}s elapsed)\n")
        !nvidia-smi --query-gpu=timestamp,memory.used,memory.total,utilization.gpu,temperature.gpu --format=csv
        time.sleep(interval)

# Run for 5 minutes
monitor_gpu(duration=300, interval=10)

In [None]:
# ============================================================================
# TENSORBOARD (View training curves)
# ============================================================================
%load_ext tensorboard
%tensorboard --logdir /content/drive/MyDrive/CloudML/logs/tensorboard/

In [None]:
# ============================================================================
# LIST ALL TRAINED MODELS
# ============================================================================
import os
from datetime import datetime

models_dir = '/content/drive/MyDrive/CloudML/models/trained/'

if os.path.exists(models_dir):
    models = sorted(os.listdir(models_dir))
    
    print(f"\nTrained Models ({len(models)} total)")
    print("="*100)
    print(f"{'Model Name':<60} {'Size (MB)':<15} {'Modified':<20}")
    print("="*100)
    
    for model in models:
        path = os.path.join(models_dir, model)
        size_mb = os.path.getsize(path) / (1024 * 1024)
        mtime = datetime.fromtimestamp(os.path.getmtime(path))
        print(f"{model:<60} {size_mb:>10.1f} MB   {mtime.strftime('%Y-%m-%d %H:%M')}")
    
    print("="*100)
else:
    print("Models directory not found.")

In [None]:
# ============================================================================
# DOWNLOAD RESULTS (Optional - already in Drive)
# ============================================================================
from google.colab import files

# Zip and download results
print("Zipping results for download...")

!cd /content/drive/MyDrive/CloudML && \
    zip -r results_export.zip plots/ logs/csv/ models/trained/ *.json *.csv 2>/dev/null

print("\n✓ Results zipped")
print("Downloading... (this may take a few minutes)")

files.download('/content/drive/MyDrive/CloudML/results_export.zip')

print("\n✓ Download complete!")

---
## Quick Reference

### File Locations
- **Models**: `/content/drive/MyDrive/CloudML/models/trained/`
- **Plots**: `/content/drive/MyDrive/CloudML/plots/`
- **Logs**: `/content/drive/MyDrive/CloudML/logs/`
- **Metrics (CSV)**: `/content/drive/MyDrive/CloudML/logs/csv/`
- **TensorBoard**: `/content/drive/MyDrive/CloudML/logs/tensorboard/`

### Key Metrics for Paper
- **MAE** (Mean Absolute Error): Primary metric in km
- **RMSE** (Root Mean Squared Error): Penalizes large errors
- **R²** (Coefficient of Determination): Model fit quality
- **MAPE** (Mean Absolute Percentage Error): Relative error

### Recommended Paper Structure
1. **Baseline Results**: Report MAE, RMSE, R² from baseline training
2. **Ablation Table**: Show Δ metrics for each ablation vs baseline
3. **LOO Results**: If available, show per-flight generalization
4. **Qualitative**: Include attention maps, error distributions

### Troubleshooting
- **OOM Error**: Reduce `batch_size` to 24 or 16
- **Slow Training**: Check `!nvidia-smi` - GPU should be >80% utilized
- **Session Timeout**: Enable Colab Pro or run shorter experiments
- **Missing Data**: Check `/content/drive/MyDrive/CloudML/data/` structure

### Support
- **Documentation**: See `README.md`, `COLAB_SETUP.md`, `GPU_OPTIMIZATION.md`
- **Issues**: https://github.com/rylanmalarchick/cloudMLPublic/issues
- **Email**: rylan1012@gmail.com