# MAPPO Traffic Light Training - Automated Checkpoint Management

This notebook automates the entire training workflow on Kaggle:
- **Background training** with `nohup` (doesn't block notebook)
- **Automatic checkpoint packaging** (monitors and zips new checkpoints)
- **Dataset folder creation** (ready for Kaggle download)
- **Progress monitoring** (check logs anytime)

## Benefits
‚úÖ Train for 9 hours uninterrupted  
‚úÖ No manual zip creation needed  
‚úÖ Checkpoints auto-packaged every hour  
‚úÖ Download anytime from Output tab  

---

## 1Ô∏è‚É£ Install Dependencies & Setup

Install required packages and verify GPU availability.

In [None]:
# Install SUMO
!apt-get update -qq
!apt-get install -y sumo sumo-tools sumo-doc > /dev/null 2>&1

# Set SUMO_HOME
import os
os.environ['SUMO_HOME'] = '/usr/share/sumo'

# Install Python packages
!pip install -q traci torch tensorboard

# Verify GPU
import torch
print(f"‚úÖ PyTorch version: {torch.__version__}")
print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)}")
    print(f"‚úÖ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2Ô∏è‚É£ Upload Code Files

Upload your training files to Kaggle. You need:
- `s1/mappo_k1_implementation.py` (Phase 1 improved version)
- `s1/k1.net.xml`
- `s1/k1_routes_3h_varying.rou.xml`
- `s1/k1_3h_varying.sumocfg`
- `kaggle_auto_checkpoint.py` (checkpoint manager)

**Option A:** Upload as Kaggle Dataset  
**Option B:** Use file upload widget below

In [None]:
# Create s1 directory structure
!mkdir -p s1 mappo_models

# If files are in a Kaggle dataset, copy them
# Example: !cp -r /kaggle/input/traffic-light-code/* .

# Or use this to verify files exist
import os
required_files = [
    's1/mappo_k1_implementation.py',
    's1/k1.net.xml',
    's1/k1_routes_3h_varying.rou.xml',
    's1/k1_3h_varying.sumocfg',
    'kaggle_auto_checkpoint.py'
]

print("üìÅ Checking required files:")
for f in required_files:
    exists = os.path.exists(f)
    symbol = "‚úÖ" if exists else "‚ùå"
    print(f"{symbol} {f}")

# If using file upload widget:
# from google.colab import files
# uploaded = files.upload()  # Upload kaggle_auto_checkpoint.py and mappo files

## 3Ô∏è‚É£ Start Training in Background

Use `nohup` to run training in background. This allows:
- Training continues even if you close the notebook tab
- You can run other cells while training
- Logs are saved to `training.log`

In [None]:
# Start training in background (50 episodes for Phase 1 testing)
!nohup python s1/mappo_k1_implementation.py --num-episodes 50 --device cuda > training.log 2>&1 &

# Wait a moment for process to start
import time
time.sleep(3)

# Check if training started
!ps aux | grep mappo_k1

print("\n‚úÖ Training started in background!")
print("üìä Monitor progress: !tail -50 training.log")
print("üõë Stop training: !pkill -f mappo_k1_implementation")

## 4Ô∏è‚É£ Start Automatic Checkpoint Manager

This runs in a separate process and:
- Checks for new checkpoints every hour
- Automatically zips them into dataset folder
- Keeps only the last 3 checkpoints (saves space)
- Runs independently from training

In [None]:
# Start checkpoint monitor in background
# Checks every hour (3600s) and keeps last 3 checkpoints
!nohup python kaggle_auto_checkpoint.py --monitor --interval 3600 --keep-last 3 > checkpoint_monitor.log 2>&1 &

time.sleep(2)

# Check if monitor started
!ps aux | grep kaggle_auto_checkpoint

print("\n‚úÖ Checkpoint monitor started!")
print("üì¶ Monitors: mappo_models/checkpoint_*")
print("üìÇ Zips to: /kaggle/working/mappo-checkpoint-dataset/")
print("üìä Check status: !tail -20 checkpoint_monitor.log")

## 5Ô∏è‚É£ Monitor Training Progress

Run these cells anytime to check status. Training runs independently in background.

In [None]:
# Check latest training logs (last 50 lines)
!tail -50 training.log

In [None]:
# Check checkpoint monitor logs
!tail -20 checkpoint_monitor.log

In [None]:
# Check running processes
print("üîç Training process:")
!ps aux | grep mappo_k1_implementation | grep -v grep

print("\nüîç Checkpoint monitor process:")
!ps aux | grep kaggle_auto_checkpoint | grep -v grep

In [None]:
# List all checkpoints and their sizes
import os
import glob
from datetime import datetime

print("üì¶ Checkpoint Status:")
print("="*70)

# Original checkpoints
checkpoints = sorted(glob.glob("mappo_models/checkpoint_*"))
if checkpoints:
    print(f"\nüóÇÔ∏è  Raw Checkpoints (mappo_models/):")
    for ckpt in checkpoints[-5:]:  # Show last 5
        size = sum(os.path.getsize(os.path.join(ckpt, f)) 
                   for f in os.listdir(ckpt) if os.path.isfile(os.path.join(ckpt, f)))
        size_mb = size / (1024*1024)
        mod_time = datetime.fromtimestamp(os.path.getmtime(ckpt))
        print(f"  ‚Ä¢ {os.path.basename(ckpt)}: {size_mb:.1f} MB - {mod_time.strftime('%H:%M:%S')}")

# Zipped checkpoints
zips = sorted(glob.glob("/kaggle/working/mappo-checkpoint-dataset/*.zip"))
if zips:
    print(f"\nüì¶ Packaged Checkpoints (dataset folder):")
    total_size = 0
    for zip_path in zips:
        size_mb = os.path.getsize(zip_path) / (1024*1024)
        total_size += size_mb
        mod_time = datetime.fromtimestamp(os.path.getmtime(zip_path))
        print(f"  ‚Ä¢ {os.path.basename(zip_path)}: {size_mb:.1f} MB - {mod_time.strftime('%H:%M:%S')}")
    print(f"\n  Total size: {total_size:.1f} MB")
else:
    print("\n‚è≥ No checkpoints packaged yet (monitor runs every hour)")

print("="*70)

## 6Ô∏è‚É£ Manual Checkpoint Packaging (Optional)

If you want to package a specific checkpoint immediately instead of waiting for the hourly check:

In [None]:
# Package latest checkpoint immediately
!python kaggle_auto_checkpoint.py

# Or package a specific checkpoint:
# !python kaggle_auto_checkpoint.py --checkpoint-dir mappo_models/checkpoint_time_20251124_120000

## 7Ô∏è‚É£ Download Checkpoints

Get download links for all packaged checkpoints. Click the links to download from Kaggle Output tab.

In [None]:
from IPython.display import FileLink, display
import glob
import os

print("üì• Download Links for Packaged Checkpoints:")
print("="*70)

zips = sorted(glob.glob("/kaggle/working/mappo-checkpoint-dataset/*.zip"))

if zips:
    for zip_path in zips:
        size_mb = os.path.getsize(zip_path) / (1024*1024)
        print(f"\nüì¶ {os.path.basename(zip_path)} ({size_mb:.1f} MB)")
        display(FileLink(zip_path))
else:
    print("\n‚è≥ No checkpoint zips available yet")
    print("   The monitor packages checkpoints every hour")
    print("   Or run the manual packaging cell above")

print("\n" + "="*70)

## 8Ô∏è‚É£ Stop Training (When Done)

Stop training and checkpoint monitor when you're finished.

In [None]:
# Stop training process
!pkill -f mappo_k1_implementation

# Stop checkpoint monitor
!pkill -f kaggle_auto_checkpoint

# Wait a moment
import time
time.sleep(2)

# Verify stopped
print("üõë Stopping processes...")
!ps aux | grep -E "mappo_k1|kaggle_auto" | grep -v grep

print("\n‚úÖ All processes stopped")
print("üìä Final training log:")
!tail -20 training.log

---

## üìã Workflow Summary

### What Happens Automatically:

1. **Training runs in background** via `nohup`
   - Doesn't block notebook
   - Continues even if you close tab
   - Saves to `training.log`

2. **Checkpoint monitor runs independently**
   - Checks every hour for new checkpoints
   - Automatically zips complete checkpoints
   - Saves to `/kaggle/working/mappo-checkpoint-dataset/`
   - Keeps only last 3 to save space

3. **You can monitor anytime**
   - Check logs: `!tail training.log`
   - List checkpoints: Run cell 5
   - Download: Run cell 7

### No Manual Work Required! üéâ

**Before:** Save ‚Üí Wait ‚Üí Find checkpoint ‚Üí Create zip ‚Üí Move to dataset folder ‚Üí Download

**Now:** Just monitor progress and download when ready!

---

## üöÄ Pro Tips

- **Resume training:** Add `--resume-checkpoint path/to/checkpoint` in cell 3
- **Change check interval:** Modify `--interval 3600` (default 1 hour)
- **Keep more checkpoints:** Change `--keep-last 3` to higher number
- **Quick package:** Run cell 6 for immediate packaging (don't wait for hourly check)

---