# Reinforcement Learning Training on Kaggle (GPU-Optimized)

This notebook trains **PPO** and **Rainbow DQN** agents on MiniGrid maze environments.

## Key Optimizations for GPU Utilization

| Change | Before | After | Impact |
|--------|--------|-------|--------|
| DQN batch_size | 32 | 256 | 8x more GPU work per step |
| DQN train_freq | 4 | 1 | Train every step |
| PPO batch_size | 64 | 512 | 8x more GPU work per step |
| num_envs | 4 | 16 | More parallel data |
| Dataset prefetch | No | Yes | Overlaps I/O with compute |

## Setup Instructions

1. **Enable GPU**: Settings > Accelerator > **GPU P100** (recommended)
2. **Enable Internet**: Settings > Internet > On (for cloning repo)
3. **Run all cells** in order

## Training Time Estimates (P100 GPU, Optimized)

| Algorithm | Timesteps | Estimated Time |
|-----------|-----------|----------------|
| PPO       | 500K      | ~30-45 min     |
| PPO       | 1M        | ~1-1.5 hours   |
| DQN       | 500K      | ~45-60 min     |
| DQN       | 1M        | ~1.5-2 hours   |

---
## 1. Environment Setup

In [None]:
# Verify GPU availability
!nvidia-smi

In [None]:
# Install dependencies (TensorFlow is pre-installed on Kaggle)
# Using specific versions compatible with Kaggle's Python 3.10 environment
!pip install -q \
    minigrid==3.0.0 \
    gymnasium>=1.1.1 \
    tensorflow-probability \
    pydantic-settings>=2.7.0 \
    loguru>=0.7.3 \
    imageio>=2.37.0 \
    rich>=13.3.3 \
    pyyaml>=6.0.2 \
    click>=8.1.0

In [None]:
# Verify TensorFlow GPU support
import tensorflow as tf

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {tf.config.list_physical_devices('GPU')}")

# Enable memory growth to avoid OOM
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
    print(f"Memory growth enabled for {len(gpus)} GPU(s)")

---
## 2. Clone Repository

**Option A**: Clone from GitHub (if public repo)

**Option B**: Upload as Kaggle Dataset (for private repos)

In [None]:
# === OPTION A: Clone from GitHub ===
# Uncomment and modify the URL if your repo is public

# !git clone https://github.com/YOUR_USERNAME/reinforce_minigrid.git
# %cd reinforce_minigrid

# === OPTION B: Upload as Kaggle Dataset ===
# 1. Create a new Kaggle Dataset with your project files
# 2. Add the dataset to this notebook via "Add Data"
# 3. Uncomment below:

# !cp -r /kaggle/input/reinforce-minigrid/* /kaggle/working/
# %cd /kaggle/working

In [None]:
# === OPTION C: Upload directly (for testing) ===
# Upload your project as a zip file and extract it

import os
from pathlib import Path

# Check if running on Kaggle
IS_KAGGLE = Path('/kaggle').exists()

if IS_KAGGLE:
    WORK_DIR = Path('/kaggle/working')
    # List available input datasets
    print("Available datasets:")
    !ls -la /kaggle/input/
else:
    # Local development
    WORK_DIR = Path('.').resolve()
    
print(f"\nWorking directory: {WORK_DIR}")

---
## 3. Project Setup (Manual Upload Alternative)

If you can't clone/use datasets, this cell creates the minimal project structure.

**Skip this section if you cloned the repo successfully.**

In [None]:
# Create directory structure (only if needed)
import os

dirs_to_create = [
    'configs',
    'models',
    'maze/envs',
    'reinforce/core',
    'reinforce/ppo',
    'reinforce/dqn',
    'reinforce/config',
]

for dir_path in dirs_to_create:
    os.makedirs(dir_path, exist_ok=True)
    # Create __init__.py files
    init_path = os.path.join(dir_path, '__init__.py')
    if not os.path.exists(init_path):
        open(init_path, 'w').close()

print("Directory structure created!")
!find . -type d -name '__pycache__' -prune -o -type f -name '*.py' -print | head -20

---
## 4. Configuration

Kaggle-optimized training settings (reduced for GPU memory and time constraints).

In [None]:
# GPU-Optimized Training Configuration for Kaggle T4 x2
# Key changes: larger batch sizes, more envs, train_freq=1

KAGGLE_CONFIG = """
# Kaggle-optimized configuration for T4 x2 GPUs
# Maximizes GPU utilization with larger batches and vectorized operations

algorithm: ppo

environment:
  seed: 42

training:
  total_timesteps: 1000000
  steps_per_update: 256
  num_envs: 16  # Increased for more parallel data collection

# PPO hyperparameters - optimized for GPU throughput
ppo:
  learning_rate: 2.5e-4
  gamma: 0.99
  lambda: 0.95
  clip_param: 0.2
  entropy_coef: 0.01
  vf_coef: 0.5
  epochs: 4
  batch_size: 512  # Increased for better GPU utilization
  max_grad_norm: 0.5
  use_lr_annealing: true
  use_value_clipping: false

# Rainbow DQN - GPU-optimized settings
dqn:
  learning_rate: 6.25e-5
  gamma: 0.99
  n_step: 3
  num_atoms: 51
  v_min: -10.0
  v_max: 10.0
  buffer_size: 100000
  batch_size: 256  # CRITICAL: 8x larger for GPU utilization
  target_update_freq: 2000  # More frequent with larger batches
  learning_starts: 5000  # Start learning sooner
  train_freq: 1  # Train every step (was 4)
  priority_alpha: 0.6
  priority_beta_start: 0.4
  priority_beta_frames: 100000
  use_noisy: true
  use_dueling: true
  use_double: true
  use_per: true

# RND for intrinsic motivation (PPO only)
rnd:
  enabled: true
  feature_dim: 512
  learning_rate: 1e-4
  intrinsic_reward_scale: 1.0
  update_proportion: 0.25
  intrinsic_reward_coef: 0.5

# Exploration strategies (PPO only)
exploration:
  use_epsilon_greedy: true
  epsilon_start: 0.3
  epsilon_end: 0.01
  epsilon_decay_steps: 300000
  use_ucb: true
  ucb_coefficient: 0.5
  use_adaptive_entropy: true
  target_entropy_ratio: 0.5
  entropy_lr: 0.01
  min_entropy_coef: 0.001
  max_entropy_coef: 0.1

logging:
  log_interval: 1
  save_interval: 10
  save_path: "models/kaggle_model"
  load_path: null
"""

# Write config file
with open('configs/kaggle_training.yaml', 'w') as f:
    f.write(KAGGLE_CONFIG)

print("GPU-optimized configuration saved!")
print("\nKey optimizations:")
print("  - DQN batch_size: 256 (was 32)")
print("  - DQN train_freq: 1 (was 4)")
print("  - num_envs: 16 (was 4)")
print("  - Multi-GPU: Auto-enabled via MirroredStrategy")

---
## 5. Verify Installation

In [None]:
# Verify all imports work
try:
    import gymnasium
    import minigrid
    import tensorflow as tf
    import tensorflow_probability as tfp
    import keras
    from pydantic import BaseModel
    import yaml
    from loguru import logger
    
    print("All imports successful!")
    print(f"  - gymnasium: {gymnasium.__version__}")
    print(f"  - tensorflow: {tf.__version__}")
    print(f"  - keras: {keras.__version__}")
except ImportError as e:
    print(f"Import error: {e}")
    print("Please run the dependency installation cell again.")

In [None]:
# Verify project modules are importable
import sys
sys.path.insert(0, '.')

try:
    from maze.envs import BaseMaze, EasyMaze, MediumMaze, HardMaze
    from reinforce.factory import create_agent
    from reinforce.config.config_loader import load_config
    
    print("Project modules imported successfully!")
except ImportError as e:
    print(f"Project import error: {e}")
    print("Make sure the project files are in the working directory.")

In [None]:
# Test environment creation
from minigrid.wrappers import RGBImgPartialObsWrapper, ImgObsWrapper
from maze.envs import BaseMaze

env = BaseMaze(render_mode="rgb_array")
env = RGBImgPartialObsWrapper(env)
env = ImgObsWrapper(env)

obs, _ = env.reset(seed=42)
print(f"Observation shape: {obs.shape}")
print(f"Action space: {env.action_space}")
env.close()

print("\nEnvironment test passed!")

---
## 6. Train PPO Agent (GPU-Optimized)

**Estimated time**: ~30-45 min for 500K timesteps on P100 GPU

Key optimizations applied:
- **batch_size=512** (8x larger than default 64)
- **steps_per_update=256** (more data per update)
- **num_envs=16** (more parallel environments)
- **Dataset prefetching**: Overlaps data loading with GPU compute

In [None]:
# PPO Training with GPU-optimized settings
# Using 16 envs and batch_size=512 for maximum GPU utilization
!python -m reinforce.train \
    --config configs/kaggle_training.yaml \
    --algorithm ppo \
    --total-timesteps 500000 \
    --num-envs 16 \
    --batch-size 512 \
    --save-path models/kaggle_ppo

In [None]:
# Check saved PPO models
!ls -la models/kaggle_ppo* 2>/dev/null || echo "No PPO models found yet"

---
## 7. Train Rainbow DQN Agent (GPU-Optimized)

**Estimated time**: ~45-60 min for 500K timesteps on P100 GPU

Key optimizations applied:
- **batch_size=256** (8x larger than default 32)
- **train_freq=1** (train every step, not every 4)
- **num_envs=16** (more parallel environments)
- **Vectorized buffer**: Batch transition storage

In [None]:
# Rainbow DQN Training with GPU-optimized settings
# batch_size=256, train_freq=1, 16 envs for maximum GPU utilization
!python -m reinforce.train \
    --config configs/kaggle_training.yaml \
    --algorithm dqn \
    --total-timesteps 500000 \
    --num-envs 16 \
    --buffer-size 100000 \
    --learning-starts 5000 \
    --save-path models/kaggle_dqn

In [None]:
# Check saved DQN models
!ls -la models/kaggle_dqn* 2>/dev/null || echo "No DQN models found yet"

---
## 8. Extended Training (Optional)

If you have time remaining, continue training with more timesteps.

In [None]:
# Extended PPO training (1M timesteps)
# Uncomment to run:

# !python -m reinforce.train \
#     --config configs/kaggle_training.yaml \
#     --algorithm ppo \
#     --total-timesteps 1000000 \
#     --num-envs 4 \
#     --save-path models/kaggle_ppo_extended

In [None]:
# Extended DQN training (1M timesteps)
# Uncomment to run:

# !python -m reinforce.train \
#     --config configs/kaggle_training.yaml \
#     --algorithm dqn \
#     --total-timesteps 1000000 \
#     --num-envs 4 \
#     --save-path models/kaggle_dqn_extended

---
## 9. Visualize Trained Agents

In [None]:
# Find the latest saved models
import glob
from pathlib import Path

def find_latest_model(prefix: str, algorithm: str) -> str | None:
    """Find the most recent model checkpoint."""
    if algorithm == 'ppo':
        pattern = f"{prefix}*_policy.keras"
    else:
        # ##>: DQN saves _online.keras and _target.keras, not _q_network.keras.
        pattern = f"{prefix}*_online.keras"
    
    files = glob.glob(pattern)
    if not files:
        return None
    
    # ##>: Get the latest by modification time.
    latest = max(files, key=lambda x: Path(x).stat().st_mtime)
    
    # ##>: Strip the correct suffix based on what was found.
    if latest.endswith('_policy.keras'):
        return latest.rsplit('_policy.keras', 1)[0]
    elif latest.endswith('_online.keras'):
        return latest.rsplit('_online.keras', 1)[0]
    elif latest.endswith('_target.keras'):
        return latest.rsplit('_target.keras', 1)[0]
    return None

ppo_model = find_latest_model('models/kaggle_ppo', 'ppo')
dqn_model = find_latest_model('models/kaggle_dqn', 'dqn')

print(f"Latest PPO model: {ppo_model}")
print(f"Latest DQN model: {dqn_model}")

In [None]:
# Visualize PPO agent (creates GIF)
if ppo_model:
    !python -m reinforce.visualize \
        --model-prefix {ppo_model} \
        --algorithm ppo \
        --level easy \
        --episodes 3 \
        --output models/ppo_demo.gif
else:
    print("No PPO model found. Train first!")

In [None]:
# Visualize DQN agent (creates GIF)
if dqn_model:
    !python -m reinforce.visualize \
        --model-prefix {dqn_model} \
        --algorithm dqn \
        --level easy \
        --episodes 3 \
        --output models/dqn_demo.gif
else:
    print("No DQN model found. Train first!")

In [None]:
# Display GIFs in notebook
from IPython.display import Image, display
from pathlib import Path

for gif_path in ['models/ppo_demo.gif', 'models/dqn_demo.gif']:
    if Path(gif_path).exists():
        print(f"\n{gif_path}:")
        display(Image(filename=gif_path))

---
## 10. Download Models

Save your trained models before the Kaggle session expires!

In [None]:
# List all saved models
!echo "=== All Saved Models ==="
!ls -lah models/*.keras 2>/dev/null || echo "No .keras models found"
!echo ""
!echo "=== Total Size ==="
!du -sh models/ 2>/dev/null || echo "models/ directory not found"

In [None]:
# Create a zip archive for easy download
import shutil
from datetime import datetime

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
archive_name = f'trained_models_{timestamp}'

# Create zip of models directory
shutil.make_archive(archive_name, 'zip', '.', 'models')

print(f"Archive created: {archive_name}.zip")
!ls -lh {archive_name}.zip

In [None]:
# For Kaggle: Models in /kaggle/working/ are automatically available for download
# Click "Save Version" -> "Save & Run All" to persist outputs

print("To download models from Kaggle:")
print("1. Click 'Save Version' at top right")
print("2. Select 'Save & Run All (Commit)'")
print("3. After completion, go to the Output tab")
print("4. Download individual files or the zip archive")
print("")
print("Files available for download:")
!ls -la *.zip models/*.keras models/*.gif 2>/dev/null

---
## 11. Resume Training (Next Session)

To continue training in a new Kaggle session:

1. Upload your saved models as a Kaggle Dataset
2. Add the dataset to your notebook
3. Run the cells below

In [None]:
# Copy models from dataset to working directory (if resuming)
# Uncomment and modify path as needed:

# !cp /kaggle/input/YOUR_DATASET_NAME/models/*.keras models/
# !ls -la models/

In [None]:
# Resume PPO training from checkpoint
# Uncomment to run:

# CHECKPOINT = "models/kaggle_ppo_ts500000_stage1"  # Adjust to your checkpoint

# !python -m reinforce.train \
#     --config configs/kaggle_training.yaml \
#     --algorithm ppo \
#     --total-timesteps 1000000 \
#     --num-envs 4 \
#     --load-path {CHECKPOINT} \
#     --save-path models/kaggle_ppo_resumed

---
## Troubleshooting

### Common Issues

**1. Out of Memory (OOM)**
```bash
# Reduce batch_size first
--batch-size 256  # or 128

# Then reduce num_envs if still OOM
--num-envs 8
```

**2. Low GPU Utilization (< 30%)**
```bash
# Increase batch_size (most impactful)
--batch-size 512  # or 1024

# For DQN, ensure train_freq=1
# (already set in kaggle_training.yaml)
```

**3. Training Too Slow**
```bash
# Check GPU utilization first
!nvidia-smi

# If GPU utilization is high but still slow,
# increase batch_size to reduce overhead
--batch-size 512

# Reduce logging frequency
--log-interval 10
```

**4. Import Errors**
```bash
# Reinstall dependencies
!pip install --upgrade minigrid gymnasium tensorflow
```

**5. Session Timeout**
- Save checkpoints frequently (--save-interval 5)
- Use background execution: Save notebook, close tab
- Resume from last checkpoint in new session

### GPU Optimization Reference

| Setting | Default | Optimized | Impact |
|---------|---------|-----------|--------|
| PPO batch_size | 64 | 512 | 8x GPU work |
| DQN batch_size | 32 | 256 | 8x GPU work |
| DQN train_freq | 4 | 1 | 4x training ops |
| num_envs | 4 | 16 | 4x parallel data |
| Prefetch | No | Yes | ~10-20% speed |

In [None]:
# Monitor GPU usage during training
# Run this in a separate cell while training to verify optimization worked

print("=== GPU Utilization Check ===")
print("Target: >30% GPU utilization on P100\n")

!nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv

print("\n=== Expected Results After Optimization ===")
print("GPU: 30-50% utilization (was 5%)")
print("\nIf still low, try increasing batch_size further (512, 1024)")

---
## Summary

This notebook provides GPU-optimized training for Kaggle P100:

### Optimizations Applied

**Both PPO and DQN:**
- ✅ Larger batch sizes (8x default)
- ✅ More parallel environments (16 vs 4)
- ✅ Dataset prefetching (overlaps I/O with compute)

**DQN-specific:**
- ✅ `train_freq=1` (train every step vs every 4)
- ✅ Vectorized batch storage

### Expected GPU Utilization

| Metric | Before | After |
|--------|--------|-------|
| GPU utilization | 5% | 30-50% |
| Training time | 2-3 hours | 45-60 min |

**Next Steps:**
1. Run training with optimized settings
2. Monitor GPU with `nvidia-smi` cell
3. Download the zip archive
4. Use models locally with `make visualize`