# CM (Collaborative Moving) Environment Tutorial

This notebook provides a comprehensive tutorial for the Collaborative Moving (CM) environment, where multiple agents must work together to push a box to a target location.

## Table of Contents
1. Environment Overview
2. Task Description
3. Action Space
4. Observation Space
5. CTDE Global State Space
6. Reward System
7. Usage Examples
8. Integration with MARL Algorithms

## 1. Environment Overview

The CM environment is a multi-agent collaborative task where agents must cooperate to push a box from its initial position to a goal location. This environment is designed to test cooperation, coordination, and planning capabilities of multi-agent reinforcement learning algorithms.

**Key Features:**
- 2-4 agents can collaborate
- Box pushing requires cooperation (higher success probability with more agents)
- Configurable grid sizes and difficulty levels
- Support for both single-agent and multi-agent scenarios
- Rich observation space including agent positions and relative information

In [4]:
# Import necessary libraries
import sys
import os
sys.path.append(os.path.abspath('..'))

import numpy as np
import matplotlib.pyplot as plt
from Env.CM.env_cm import create_cm_env
from Env.CM.env_cm_ctde import create_cm_ctde_env

## 2. Task Description

### Objective
Agents must cooperate to move a box from its starting position to a designated goal area.

### Cooperation Mechanism
- The box can only be moved successfully when agents push from different sides
- Success probability increases with the number of cooperating agents:
  - 1 agent: 50% success rate
  - 2 agents: 75% success rate
  - 3 agents: 90% success rate
  - 4 agents: 100% success rate

### Episode Termination
- Box reaches the goal area
- Maximum number of steps is exceeded

In [5]:
# Create environment and demonstrate basic functionality
env = create_cm_env(difficulty="normal", render_mode="")

print("Environment Information:")
env_info = env.get_env_info()
for key, value in env_info.items():
    print(f"  {key}: {value}")

print("\nDifficulty Configurations:")
print("  Available difficulties:", ['debug', 'easy', 'normal', 'hard'])
print("  CTDE difficulties:", ['debug_ctde', 'easy_ctde', 'normal_ctde', 'hard_ctde'])

Environment Information:
  n_agents: 2
  agent_ids: ['agent_0', 'agent_1']
  n_actions: 5
  obs_dims: {'agent_0': 8, 'agent_1': 8}
  act_dims: {'agent_0': 5, 'agent_1': 5}
  episode_limit: 100
  grid_size: 7
  box_size: 2

Difficulty Configurations:
  Available difficulties: ['debug', 'easy', 'normal', 'hard']
  CTDE difficulties: ['debug_ctde', 'easy_ctde', 'normal_ctde', 'hard_ctde']


## 3. Action Space

Each agent has 5 discrete actions:

| Action ID | Action Name | Description |
|-----------|-------------|-------------|
| 0 | STAY | Agent remains in current position |
| 1 | UP | Agent moves one grid cell up |
| 2 | DOWN | Agent moves one grid cell down |
| 3 | LEFT | Agent moves one grid cell left |
| 4 | RIGHT | Agent moves one grid cell right |

### Action Constraints
- Agents cannot move outside the grid boundaries
- Agents cannot move into the box (they push it instead)
- Multiple agents cannot occupy the same position

In [6]:
# Demonstrate action space
env.reset()

print("Action Space:")
print(f"  Number of actions: {env.n_actions}")
print(f"  Action space: {env.action_space}")

# Show available actions for each agent
for agent_id in env.agent_ids:
    avail_actions = env.get_avail_actions(agent_id)
    print(f"  {agent_id} available actions: {avail_actions}")

Action Space:
  Number of actions: 5
  Action space: Discrete(5)
  agent_0 available actions: [0, 1, 2, 3, 4]
  agent_1 available actions: [0, 1, 2, 3, 4]


## 4. Observation Space

Each agent receives a local observation containing:

**Vector format (length = 6 + 2*(n_agents-1)):**
- `self_x, self_y`: Agent's own position (2 values)
- `box_center_x, box_center_y`: Box center position (2 values)
- `goal_center_x, goal_center_y`: Goal center position (2 values)
- `relative_positions`: Relative positions of other agents (2*(n_agents-1) values)

For 2-agent environment, observation length = 8:
- [self_x, self_y, box_x, box_y, goal_x, goal_y, other_rel_x, other_rel_y]

**Normalization:**
- Observations are normalized by grid size when `normalize_observations=True`
- Values range from 0.0 to 1.0

In [7]:
# Demonstrate observation space
obs = env.reset()

print("Observation Space:")
print(f"  Observation space: {env.observation_space}")
print(f"  Observation dimension: {env.observation_space.shape[0]}")

print("\nInitial observations:")
for agent_id, observation in obs.items():
    print(f"  {agent_id}: {observation}")
    print(f"    Shape: {observation.shape}")
    print(f"    Min: {observation.min():.3f}, Max: {observation.max():.3f}")

Observation Space:
  Observation space: Box(0.0, 7.0, (8,), float32)
  Observation dimension: 8

Initial observations:
  agent_0: [ 0.5714286   0.2857143   0.35714287  0.35714287  0.21428572  0.64285713
 -0.2857143  -0.14285715]
    Shape: (8,)
    Min: -0.286, Max: 0.643
  agent_1: [0.2857143  0.14285715 0.35714287 0.35714287 0.21428572 0.64285713
 0.2857143  0.14285715]
    Shape: (8,)
    Min: 0.143, Max: 0.643


## 5. CTDE Global State Space

For Centralized Training with Decentralized Execution (CTDE), the environment provides a global state that includes information from all agents.

**Global State Components:**
- All agent positions (2 * n_agents values)
- Box position and size (3 values: x, y, size)
- Goal position and size (3 values: x, y, size)
- Relative positions between agents (2 * n_agents * (n_agents-1) values)

**Global State Types:**
- `concat`: Concatenation of all information (default)
- `mean`: Mean pooling of agent observations
- `max`: Max pooling of agent observations
- `attention`: Attention-based aggregation

In [8]:
# Demonstrate CTDE environment
ctde_env = create_cm_ctde_env(difficulty="normal_ctde", global_state_type="concat")

obs = ctde_env.reset()
global_state = ctde_env.get_global_state()

print("CTDE Environment:")
print(f"  Global state dimension: {len(global_state)}")
print(f"  Global state sample: {global_state[:10]}...")  # Show first 10 values

ctde_env.close()

CTDE Environment:
  Global state dimension: 16
  Global state sample: [ 0.2857143   0.14285715  0.35714287  0.35714287  0.07142857  0.21428572
 -0.14285715  0.42857143  0.14285715  0.5714286 ]...


## 6. Reward System

The reward system is designed to encourage cooperation and goal completion while discouraging random exploration.

**Reward Components:**
1. **Time Penalty**: -0.3 per step (encourages efficiency)
2. **Distance Improvement**: 0.3 × distance reduction (only for significant progress > 0.2)
3. **Box Movement**: 1.0 (only when box moves toward goal)
4. **Cooperation Bonus**: 1.5 × (n_pushing_agents - 1)
5. **Goal Completion**: 50.0 + efficiency bonus (up to +15.0)

**Reward Span**: ~80 units (from random exploration to goal completion)

**Key Design Principles:**
- Random exploration yields low positive rewards (~5-10)
- Meaningful progress provides moderate rewards (10-30)
- Goal completion provides the highest reward (50-65)
- Cooperation is explicitly rewarded

In [9]:
# Demonstrate reward system
print("Reward System Configuration:")
print(f"  Time penalty: {env.config.time_penalty}")
print(f"  Goal reward: {env.config.goal_reached_reward}")
print(f"  Cooperation reward: {env.config.cooperation_reward}")
print(f"  Box move reward: {env.config.box_move_reward_scale}")
print(f"  Distance reward scale: {env.config.distance_reward_scale}")

# Run a few episodes to show reward distribution
print("\nSample Episode Rewards:")
for episode in range(3):
    obs = env.reset()
    total_reward = 0
    step_count = 0
    
    for step in range(50):
        actions = {}
        for agent_id in env.agent_ids:
            avail_actions = env.get_avail_actions(agent_id)
            actions[agent_id] = np.random.choice(avail_actions)
        
        obs, rewards, dones, info = env.step(actions)
        step_reward = list(rewards.values())[0]
        total_reward += step_reward
        step_count += 1
        
        if any(dones.values()):
            print(f"  Episode {episode+1}: Reward={total_reward:.1f}, Steps={step_count}, Goal Reached=True")
            break
    else:
        print(f"  Episode {episode+1}: Reward={total_reward:.1f}, Steps={step_count}, Goal Reached=False")

Reward System Configuration:
  Time penalty: -0.3
  Goal reward: 50.0
  Cooperation reward: 1.5
  Box move reward: 1.0
  Distance reward scale: 0.3

Sample Episode Rewards:
  Episode 1: Reward=-8.3, Steps=50, Goal Reached=False
  Episode 2: Reward=66.1, Steps=18, Goal Reached=True
  Episode 3: Reward=-5.5, Steps=50, Goal Reached=False


## 7. Usage Examples

### Basic Environment Usage

In [10]:
# Basic usage example
def run_basic_episode(env, max_steps=100):
    """Run a single episode with random actions."""
    obs = env.reset()
    episode_reward = 0
    
    for step in range(max_steps):
        # Get random actions for all agents
        actions = {}
        for agent_id in env.agent_ids:
            avail_actions = env.get_avail_actions(agent_id)
            actions[agent_id] = np.random.choice(avail_actions)
        
        # Execute step
        obs, rewards, dones, info = env.step(actions)
        
        # Accumulate reward (all agents get same reward)
        step_reward = list(rewards.values())[0]
        episode_reward += step_reward
        
        # Check if episode is done
        if any(dones.values()):
            print(f"Episode completed in {step+1} steps!")
            print(f"Total reward: {episode_reward:.2f}")
            print(f"Goal reached: {info['agents_complete']}")
            break
    else:
        print(f"Episode timed out after {max_steps} steps")
        print(f"Total reward: {episode_reward:.2f}")
    
    return episode_reward, info

# Run a basic episode
print("Basic Environment Usage:")
reward, info = run_basic_episode(env)

Basic Environment Usage:
Episode completed in 100 steps!
Total reward: -19.19
Goal reached: False


### Visualization

In [17]:
import matplotlib as plt
# Visualization example (if matplotlib is available)
try:
    # Create environment with rendering
    vis_env = create_cm_env(difficulty="easy", render_mode="rgb_array")
    
    obs = vis_env.reset()
    
    # Get initial rendering
    frame = vis_env.render()
    
    if frame is not None:
        plt.figure(figsize=(8, 8))
        plt.imshow(frame)
        plt.title("CM Environment - Initial State")
        plt.axis('off')
        plt.show()
    else:
        print("Rendering not available in this environment")
    
    vis_env.close()
except Exception as e:
    print(f"Visualization failed: {e}")
    print("This is normal in headless environments.")

Visualization failed: 'module' object is not callable
This is normal in headless environments.


### Different Difficulty Levels

In [16]:
import time
# Compare different difficulty levels
difficulties = ['debug', 'easy', 'normal', 'hard']

print("Difficulty Level Comparison:")
for diff in difficulties:
    test_env = create_cm_env(difficulty=diff, render_mode="")
    info = test_env.get_env_info()
    
    print(f"\n{diff.upper()} difficulty:")
    print(f"  Grid size: {info['grid_size']}")
    print(f"  Agents: {info['n_agents']}")
    print(f"  Max steps: {info['episode_limit']}")
    print(f"  Box size: {info['box_size']}")
    time.sleep(5)
    
    test_env.close()

Difficulty Level Comparison:

DEBUG difficulty:
  Grid size: 5
  Agents: 2
  Max steps: 50
  Box size: 2

EASY difficulty:
  Grid size: 7
  Agents: 2
  Max steps: 100
  Box size: 2

NORMAL difficulty:
  Grid size: 7
  Agents: 2
  Max steps: 100
  Box size: 2

HARD difficulty:
  Grid size: 9
  Agents: 3
  Max steps: 150
  Box size: 2


## 8. Integration with MARL Algorithms

The CM environment is designed to work seamlessly with popular MARL algorithms like QMIX, VDN, MADDPG, etc.

In [18]:
# Example: Integration with MARL framework
try:
    from marl.src.envs import create_env_wrapper
    
    # Create MARL environment wrapper
    config = {
        'env': {
            'name': 'CM',
            'difficulty': 'normal',
            'global_state_type': 'concat'
        }
    }
    
    marl_env = create_env_wrapper(config)
    
    print("MARL Integration:")
    print(f"  Environment created successfully")
    print(f"  Agent IDs: {marl_env.agent_ids}")
    print(f"  N agents: {marl_env.n_agents}")
    
    # Test MARL environment
    obs, _ = marl_env.reset()
    global_state = marl_env.get_global_state()
    
    print(f"  Observation shape: {list(obs.values())[0].shape}")
    print(f"  Global state shape: {global_state.shape}")
    
    # Run a few steps
    for step in range(5):
        actions = {agent_id: np.random.randint(0, 5) for agent_id in marl_env.agent_ids}
        obs, rewards, dones, infos = marl_env.step(actions)
        
        if any(dones.values()):
            break
    
    marl_env.close()
    print("  MARL integration test passed!")
    
except ImportError:
    print("MARL framework not available - this is normal if not installed")
except Exception as e:
    print(f"MARL integration test failed: {e}")

MARL Integration:
  Environment created successfully
  Agent IDs: ['agent_0', 'agent_1']
  N agents: 2
  Observation shape: (8,)
  Global state shape: (16,)
  MARL integration test passed!


## Summary

The CM environment provides a rich testbed for multi-agent cooperation with the following key characteristics:

### Strengths:
- **Clear cooperation requirements**: Multiple agents needed for efficient box movement
- **Scalable difficulty**: From simple 2-agent to complex 4-agent scenarios
- **Rich observations**: Local and global information for CTDE algorithms
- **Well-designed rewards**: Balances exploration, cooperation, and goal completion
- **MARL-ready**: Seamless integration with popular MARL frameworks

### Use Cases:
- Testing cooperation mechanisms in MARL
- Benchmarking CTDE algorithms
- Studying multi-agent coordination
- Teaching multi-agent reinforcement learning

### Configuration Tips:
- Start with `debug` or `easy` difficulty for testing
- Use `normal` difficulty for standard benchmarks
- Try `hard` difficulty for challenging scenarios
- Use CTDE versions (`*_ctde`) for centralized training
- Adjust reward parameters in config files for custom requirements

In [19]:
# Clean up
env.close()
print("\nCM Environment Tutorial completed!")


CM Environment Tutorial completed!
