# Offline Goal-Conditioned RL Experiment Setup

This notebook demonstrates a complete setup for offline goal-conditioned RL experiments using a simple 2D navigation environment.

## Components:
1. **Toy Environment**: Simple 2D navigation with goal conditioning
2. **Trajectory Generator**: Collect offline trajectories with different policies
3. **Data Loader**: Composable PyTorch data loader for offline RL
4. **Goal Conditioning**: Support for goal relabeling and different sampling strategies


In [2]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import torch
from typing import List, Dict, Any

# Import our custom modules
from toy_env import make_toy_env, Toy2DNavigationEnv, GoalConditionedWrapper
from trajectory_generator import TrajectoryGenerator, create_expert_dataset, ExpertPolicy
from data_loader import OfflineRLDataset, GoalConditionedDataset, OfflineRLDataLoader, create_data_loaders

print("All imports successful!")


All imports successful!


## 1. Environment Setup

Let's create and test our toy 2D navigation environment.


In [3]:
# Create the environment
env = make_toy_env(
    grid_size=10,
    max_steps=50,
    goal_conditioned=True,
    action_noise=0.1
)

print(f"Environment created!")
print(f"Observation space: {env.observation_space}")
print(f"Action space: {env.action_space}")

# Test the environment
obs, info = env.reset()
print(f"\nInitial observation: {obs}")
print(f"Goal: {info['goal']}")
print(f"Distance to goal: {info['distance_to_goal']:.2f}")

# Take a few random steps
for step in range(5):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    print(f"Step {step}: action={action}, obs={obs}, reward={reward}, done={terminated or truncated}")
    
    if terminated or truncated:
        print(f"Episode ended. Success: {info['success']}")
        break


Environment created!
Observation space: Box(0.0, 10.0, (4,), float32)
Action space: Discrete(5)

Initial observation: [0.781773  7.1862245 2.93081   2.539869 ]
Goal: [2.93081  2.539869]
Distance to goal: 5.12
Step 0: action=1, obs=[0.93446875 6.4198904  2.93081    2.539869  ], reward=0.0, done=False
Step 1: action=4, obs=[0.9492548 6.4899616 2.93081   2.539869 ], reward=0.0, done=False
Step 2: action=1, obs=[0.89271706 5.504059   2.93081    2.539869  ], reward=0.0, done=False
Step 3: action=0, obs=[0.91625524 6.35176    2.93081    2.539869  ], reward=0.0, done=False
Step 4: action=2, obs=[0.       6.357038 2.93081  2.539869], reward=0.0, done=False


## 2. Data Collection

Now let's generate offline trajectories using different policies.


In [4]:
# Generate expert dataset
print("Generating expert dataset...")
expert_trajectories = create_expert_dataset(
    env=env,
    num_trajectories=500,
    save_path="expert_dataset.pkl",
    noise=0.2  # Add some noise to make it more realistic
)

# Generate random dataset for comparison
print("\nGenerating random dataset...")
random_generator = TrajectoryGenerator(env)
random_trajectories = random_generator.generate_dataset(
    num_trajectories=500,
    max_steps=50,
    save_path="random_dataset.pkl"
)

# Print statistics
def print_dataset_stats(trajectories, name):
    success_rate = sum(traj.success for traj in trajectories) / len(trajectories)
    avg_length = np.mean([traj.length for traj in trajectories])
    print(f"{name}:")
    print(f"  Success rate: {success_rate:.2%}")
    print(f"  Average length: {avg_length:.1f}")
    print(f"  Total trajectories: {len(trajectories)}")

print_dataset_stats(expert_trajectories, "Expert Dataset")
print_dataset_stats(random_trajectories, "Random Dataset")


Generating expert dataset...
Generating 500 trajectories...
Generated 0/500 trajectories
Generated 100/500 trajectories
Generated 200/500 trajectories
Generated 300/500 trajectories
Generated 400/500 trajectories
Generated 500 trajectories
Dataset saved to expert_dataset.pkl
Expert dataset created:
  Success rate: 50.80%
  Average length: 25.9
  Total trajectories: 500

Generating random dataset...
Generating 500 trajectories...
Generated 0/500 trajectories
Generated 100/500 trajectories
Generated 200/500 trajectories
Generated 300/500 trajectories
Generated 400/500 trajectories
Generated 500 trajectories
Dataset saved to random_dataset.pkl
Expert Dataset:
  Success rate: 50.80%
  Average length: 25.9
  Total trajectories: 500
Random Dataset:
  Success rate: 35.20%
  Average length: 35.1
  Total trajectories: 500


## 3. Data Loading and Preprocessing

Create PyTorch data loaders for training offline RL algorithms.


In [5]:
# Create data loaders for expert dataset
print("Creating data loaders...")

# Standard goal-conditioned dataset
train_loader, val_loader = create_data_loaders(
    trajectories=expert_trajectories,
    train_ratio=0.8,
    batch_size=64,
    goal_conditioned=True,
    goal_relabeling=True,
    goal_sampling_strategy='future'  # HER-style goal relabeling
)

print(f"Train loader: {len(train_loader)} batches")
print(f"Val loader: {len(val_loader)} batches")

# Test a batch
batch = next(iter(train_loader))
print(f"\nBatch shapes:")
print(f"  Observations: {batch.observations.shape}")
print(f"  Actions: {batch.actions.shape}")
print(f"  Next observations: {batch.next_observations.shape}")
print(f"  Rewards: {batch.rewards.shape}")
print(f"  Terminals: {batch.terminals.shape}")
print(f"  Goals: {batch.goals.shape}")

# Show some example transitions
print(f"\nExample transitions:")
for i in range(3):
    print(f"  Transition {i}:")
    print(f"    State: {batch.observations[i][:2]}")
    print(f"    Goal: {batch.goals[i]}")
    print(f"    Action: {batch.actions[i]}")
    print(f"    Next state: {batch.next_observations[i][:2]}")
    print(f"    Reward: {batch.rewards[i]}")
    print(f"    Terminal: {batch.terminals[i]}")
    print()


Creating data loaders...
Train loader: 164 batches
Val loader: 39 batches

Batch shapes:
  Observations: torch.Size([64, 4])
  Actions: torch.Size([64])
  Next observations: torch.Size([64, 4])
  Rewards: torch.Size([64])
  Terminals: torch.Size([64])
  Goals: torch.Size([64, 2])

Example transitions:
  Transition 0:
    State: tensor([0.1382, 0.0000])
    Goal: tensor([ 0.0952, -0.8507])
    Action: 1
    Next state: tensor([0.3081, 0.0000])
    Reward: 0.0
    Terminal: False

  Transition 1:
    State: tensor([0.1063, 0.0000])
    Goal: tensor([-0.3411, -0.4085])
    Action: 1
    Next state: tensor([0., 0.])
    Reward: 0.0
    Terminal: False

  Transition 2:
    State: tensor([0.0071, 0.0000])
    Goal: tensor([ 0.0198, -0.6909])
    Action: 1
    Next state: tensor([0.0911, 0.0000])
    Reward: 0.0
    Terminal: False





## 4. Different Goal Sampling Strategies

Compare different goal sampling strategies for goal-conditioned RL.


In [6]:
# Compare different goal sampling strategies
strategies = ['uniform', 'future', 'final']

for strategy in strategies:
    print(f"\n=== {strategy.upper()} Goal Sampling ===")
    
    # Create dataset with specific strategy
    dataset = GoalConditionedDataset(
        trajectories=expert_trajectories[:100],  # Use subset for faster processing
        goal_relabeling=True,
        goal_sampling_strategy=strategy
    )
    
    # Create data loader
    loader = OfflineRLDataLoader(dataset, batch_size=32, shuffle=False)
    
    # Analyze goal distribution
    all_goals = []
    for batch in loader:
        all_goals.extend(batch.goals.numpy())
    
    all_goals = np.array(all_goals)
    
    print(f"  Goal range: [{all_goals.min():.2f}, {all_goals.max():.2f}]")
    print(f"  Goal std: {all_goals.std():.2f}")
    print(f"  Unique goals: {len(np.unique(all_goals, axis=0))}")
    
    # Show some example goals
    print(f"  Sample goals: {all_goals[:5]}")



=== UNIFORM Goal Sampling ===
  Goal range: [-1.00, 0.98]
  Goal std: 0.55
  Unique goals: 96
  Sample goals: [[-0.56691396  0.8602849 ]
 [-0.47928172 -0.16742727]
 [-0.3488859  -0.36081022]
 [ 0.7733705  -0.86309946]
 [ 0.97817135 -0.5976935 ]]

=== FUTURE Goal Sampling ===
  Goal range: [-1.00, 0.98]
  Goal std: 0.54
  Unique goals: 100
  Sample goals: [[ 0.77988     0.3095437 ]
 [ 0.77988     0.3095437 ]
 [ 0.77988     0.3095437 ]
 [ 0.53413177  0.70619184]
 [-0.26943734 -0.09547954]]

=== FINAL Goal Sampling ===
  Goal range: [-1.00, 0.98]
  Goal std: 0.54
  Unique goals: 100
  Sample goals: [[ 0.77988     0.3095437 ]
 [ 0.77988     0.3095437 ]
 [ 0.77988     0.3095437 ]
 [ 0.53413177  0.70619184]
 [-0.26943734 -0.09547954]]


## 5. Visualization

Visualize the environment and some example trajectories.


In [None]:
# Visualize some example trajectories
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

# Plot expert trajectories
for i in range(3):
    traj = expert_trajectories[i]
    ax = axes[i]
    
    # Extract positions
    positions = np.array(traj.observations)[:, :2]
    goal = traj.goals[0]
    
    # Plot trajectory
    ax.plot(positions[:, 0], positions[:, 1], 'b-', alpha=0.7, linewidth=2)
    ax.scatter(positions[0, 0], positions[0, 1], c='green', s=100, marker='o', label='Start', zorder=5)
    ax.scatter(positions[-1, 0], positions[-1, 1], c='red', s=100, marker='s', label='End', zorder=5)
    ax.scatter(goal[0], goal[1], c='orange', s=100, marker='*', label='Goal', zorder=5)
    
    ax.set_xlim(0, 10)
    ax.set_ylim(0, 10)
    ax.set_title(f'Expert Trajectory {i+1} (Success: {traj.success})')
    ax.legend()
    ax.grid(True, alpha=0.3)

# Plot random trajectories
for i in range(3):
    traj = random_trajectories[i]
    ax = axes[i + 3]
    
    # Extract positions
    positions = np.array(traj.observations)[:, :2]
    goal = traj.goals[0]
    
    # Plot trajectory
    ax.plot(positions[:, 0], positions[:, 1], 'b-', alpha=0.7, linewidth=2)
    ax.scatter(positions[0, 0], positions[0, 1], c='green', s=100, marker='o', label='Start', zorder=5)
    ax.scatter(positions[-1, 0], positions[-1, 1], c='red', s=100, marker='s', label='End', zorder=5)
    ax.scatter(goal[0], goal[1], c='orange', s=100, marker='*', label='Goal', zorder=5)
    
    ax.set_xlim(0, 10)
    ax.set_ylim(0, 10)
    ax.set_title(f'Random Trajectory {i+1} (Success: {traj.success})')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print trajectory statistics
print("Trajectory Statistics:")
print(f"Expert success rate: {sum(traj.success for traj in expert_trajectories) / len(expert_trajectories):.2%}")
print(f"Random success rate: {sum(traj.success for traj in random_trajectories) / len(random_trajectories):.2%}")
print(f"Expert avg length: {np.mean([traj.length for traj in expert_trajectories]):.1f}")
print(f"Random avg length: {np.mean([traj.length for traj in random_trajectories]):.1f}")


## 6. Usage Example for Offline RL Algorithms

Here's how to use the data loader with your offline RL algorithm.


In [None]:
# Example usage for training an offline RL algorithm
def train_offline_rl_algorithm(train_loader, val_loader, num_epochs=10):
    """
    Example training loop for offline RL algorithm.
    Replace this with your actual algorithm (e.g., CQL, IQL, etc.)
    """
    print("Starting training...")
    
    for epoch in range(num_epochs):
        # Training
        train_loss = 0.0
        train_batches = 0
        
        for batch in train_loader:
            # Move to device if using GPU
            # batch = batch.to(device)
            
            # Your algorithm's training step here
            # loss = your_algorithm.train_step(batch)
            # train_loss += loss.item()
            train_batches += 1
        
        # Validation
        val_loss = 0.0
        val_batches = 0
        
        for batch in val_loader:
            # Move to device if using GPU
            # batch = batch.to(device)
            
            # Your algorithm's validation step here
            # loss = your_algorithm.val_step(batch)
            # val_loss += loss.item()
            val_batches += 1
        
        # Print progress (replace with actual loss calculation)
        avg_train_loss = train_loss / max(train_batches, 1)
        avg_val_loss = val_loss / max(val_batches, 1)
        
        print(f"Epoch {epoch+1}/{num_epochs}: "
              f"Train Loss: {avg_train_loss:.4f}, "
              f"Val Loss: {avg_val_loss:.4f}")

# Run example training
train_offline_rl_algorithm(train_loader, val_loader, num_epochs=5)

print("\nTraining complete! You can now implement your actual offline RL algorithm.")
print("The data loader provides:")
print("- Batched transitions with goal conditioning")
print("- Support for different goal sampling strategies")
print("- Easy integration with PyTorch training loops")
print("- Composable design for different algorithms")
