# Assignment: Implementing MADDPG with the TorchRL Toolkit
In this task, you’ll implement the MADDPG algorithm — a method used to train multiple agents to learn and collaborate effectively.

To make things smoother, we’ll be using TorchRL, a library that simplifies building and training RL agents.

The assignment has two main goals:

1. Help you understand the key ideas behind MADDPG, especially the idea of centralized training (agents learn together) and decentralized execution (they act independently).

2. Introduce you to important TorchRL.

We’ve already set up the basic structure for you. Your job is to complete the missing pieces marked as TODOs. Before each coding step, we’ll explain what the MADDPG concept is and how to apply it using the right TorchRL tools.

In [None]:
# Install dependencies
!pip3 install torchrl
!pip3 install vmas
!pip3 install tqdm
!apt-get update -y
!apt-get install -y x11-utils xvfb python3-opengl libgl1-mesa-glx libglu1-mesa
!pip install pyvirtualdisplay

##Imports

In [None]:
import copy
import tempfile
import torch
import torch.nn as nn
from matplotlib import pyplot as plt
import pyvirtualdisplay
from PIL import Image
import numpy as np

from tensordict import TensorDictBase
from tensordict.nn import TensorDictModule, TensorDictSequential
from torch import multiprocessing
from torchrl.collectors import SyncDataCollector
from torchrl.data import LazyMemmapStorage, RandomSampler, ReplayBuffer
from torchrl.envs import (
    check_env_specs,
    RewardSum,
    TransformedEnv,
    VmasEnv,
)
from torchrl.modules import (
    AdditiveGaussianModule,
    MLP,
    ProbabilisticActor,
    TanhDelta,
)
from torchrl.objectives import DDPGLoss, SoftUpdate, ValueEstimators
from tqdm import tqdm

# --- Setup Virtual Display ---
try:
    display = pyvirtualdisplay.Display(visible=False, size=(1400, 900))
    display.start()
    print("Virtual display started.")
except Exception as e:
    print(f"Could not start virtual display: {e}")

##Configuration

In [None]:
# General
seed = 0
is_fork = multiprocessing.get_start_method() == "fork"
device = torch.device(0) if torch.cuda.is_available() and not is_fork else torch.device("cpu")
torch.manual_seed(seed)
print(f"Using device: {device}")

# Vmas Environment
scenario_name = "navigation"
n_agents = 3
max_steps = 100  # Episode steps before done

# Sampling
frames_per_batch = 2000
n_iters = 1500
total_frames = frames_per_batch * n_iters
num_vmas_envs = frames_per_batch // max_steps

# Replay Buffer
memory_size = 2_000_000

# Training
n_optimiser_steps = 10
train_batch_size = 512
lr = 3e-4
max_grad_norm = 1.0

# DDPG Algorithm
gamma = 0.99
polyak_tau = 0.005

##Environment Setup

In [None]:
# Each agent is in its own group, so group_name == agent_name
custom_group_map = {f"agent_{i}": [f"agent_{i}"] for i in range(n_agents)}

# Create the vectorized Vmas environment
env = VmasEnv(
    scenario=scenario_name,
    num_envs=num_vmas_envs,
    continuous_actions=True,
    max_steps=max_steps,
    device=device,
    n_agents=n_agents,
    group_map=custom_group_map,
)

# Wrap the environment to sum rewards for each agent group
env = TransformedEnv(
    env,
    RewardSum(
        in_keys=env.reward_keys,
        reset_keys=["_reset"] * len(env.group_map.keys()),
    ),
)

# Print environment specs
print(f"group_map: {env.group_map}")
print("action_spec:", env.full_action_spec)
print("reward_spec:", env.full_reward_spec)
print("done_spec:", env.full_done_spec)
print("observation_spec:", env.observation_spec)

check_env_specs(env)

##Part 1 : Decentralized Actor

**1a. MADDPG Concept: The Agent's Brain**

In MADDPG, each agent has its own independent "actor" network. This network takes the agent's observation and decides which action to take. It's the "decentralized execution" part of the algorithm.

Your Task:

Implement the Actor Network using torchrl.modules.MLP. This will be a standard PyTorch nn.Module that serves as the brain for a single agent.

**1b. TorchRL vs TensorDictModule**

A standard nn.Module doesn't know how to interact with TorchRL's data structures. We need to wrap it with a TensorDictModule. This wrapper acts as an adapter, telling your MLP which data "key" to read its input from in the TensorDict and which "key" to write its output to.

Your Task:

Wrap your AgentMLP in a TensorDictModule.

##Policy Initialization

In [None]:
# Part 1: Create the Actor Network using torchrl.modules.MLP
policy_modules = {}
for group, agents in env.group_map.items():
    agent_modules = {}
    for agent in agents:
        ### TODO: PART 1a ###
        # Create an actor network using `torchrl.modules.MLP`.
        # - `in_features`: The dimension of the agent's observation.
        # - `out_features`: The dimension of the agent's action.
        # - `num_cells`: A list defining the hidden layer sizes (e.g., [256, 256]).
        # - `activation_class`: The activation function (e.g., nn.ReLU).
        ### YOUR CODE HERE ###
        obs_dim = env.observation_spec[agent, "observation"].shape[0]
        action_dim = env.full_action_spec[agent, "action"].shape[0]
        agent_modules[agent] = MLP(
            in_features=obs_dim,
            out_features=action_dim,
            num_cells=[256, 256],
            activation_class=nn.ReLU
        )

    ### TODO: PART 1b ###
    # Wrap the MLP actor in a TensorDictModule to handle I/O.
    # - The input should be the agent's observation: `(agent, "observation")`.
    # - The output should be the action parameters: `(agent, "param")`.
    ### YOUR CODE HERE ###
    agent_policy_modules = {}
    for agent in agents:
        agent_policy_modules[agent] = TensorDictModule(
            agent_modules[agent],
            in_keys=[(agent, "observation")],
            out_keys=[(agent, "param")]
        )
    policy_modules[group] = TensorDictSequential(*agent_policy_modules.values())


# Create Probabilistic Policies
policies = {}
for group, agents in env.group_map.items():
    agent_policies = []
    for agent in agents:
        agent_policies.append(
            ProbabilisticActor(
                module=policy_modules[group],
                spec=env.full_action_spec[agent, "action"],
                in_keys=[(agent, "param")],
                out_keys=[(agent, "action")],
                distribution_class=TanhDelta,
                distribution_kwargs={
                    "low": env.full_action_spec_unbatched[agent, "action"].space.low,
                    "high": env.full_action_spec_unbatched[agent, "action"].space.high,
                },
                return_log_prob=False,
            )
        )
    policies[group] = TensorDictSequential(*agent_policies)

# Create Target Policies for DDPG
target_policies = copy.deepcopy(policies)

# Create Exploration Policies: An AdditiveGaussianModule is appended to the policy to add noise for exploration
exploration_policies = {}
for group, _agents in env.group_map.items():
    first_actor = None
    for module in policies[group].modules():
        if isinstance(module, ProbabilisticActor):
            first_actor = module
            break
    if first_actor is None:
        raise RuntimeError("No ProbabilisticActor found in policies[group]")

    exploration_policy = TensorDictSequential(
        policies[group],
        AdditiveGaussianModule(
            spec=first_actor.spec,
            annealing_num_steps=total_frames // 3,
            action_key=(group, "action"),
            sigma_init=0.5,
            sigma_end=0.05,
        ),
    )
    exploration_policies[group] = exploration_policy

##Part 2: The Centralized Critic

**2a. MADDPG Concept: Using Global Information**

One of the key strengths of the MADDPG algorithm lies in its use of a centralized critic during training. Unlike the decentralized actors, which only have access to their individual observations, the centralized critic has access to the observations and actions of all agents. This broader perspective enables it to more accurately evaluate the quality of joint actions taken by the agents, which in turn leads to more stable and cooperative learning dynamics.

Your task:

Implement the CentralizedCritic using torchrl.modules.MLP. This network takes observations and actions of all agents, and return a scalar value representing the estimated state-action value. It's the "centralized training" part of the algorithm.

**2b. TorchRL Tool: Assembling Inputs with TensorDictModule**

How do we collect data from many different keys and feed it as one tensor to our critic? TensorDictModule can do more than just wrap a network; it can also perform operations. We can give it a list of in_keys and a lambda function to tell it how to combine them.

Your Task:

Create a TensorDictModule that gathers all observations and actions and concatenates them into a single tensor for the critic.

##Critic Network Initialization

In [None]:
# Part 2: Create the Centralized Critic using torchrl.modules.MLP
critics = {}
for group, agents in env.group_map.items():
    ### TODO: PART 2a ###
    # Create the centralized critic network using `torchrl.modules.MLP`.
    # - First, calculate `in_features`.
    # - `out_features` should be 1, as the critic outputs a single Q-value.
    ### YOUR CODE HERE ###
    agent_critic_modules = {}
    for agent in agents:
        # Calculate input features: sum of all agents' obs + actions
        obs_dim = env.observation_spec[agent, "observation"].shape[0]
        action_dim = env.full_action_spec[agent, "action"].shape[0]
        critic_in_features = sum(
            env.observation_spec[other_agent, "observation"].shape[0] + 
            env.full_action_spec[other_agent, "action"].shape[0]
            for other_agent in agents
        )
        agent_critic_modules[agent] = MLP(
            in_features=critic_in_features,
            out_features=1,
            num_cells=[256, 256],
            activation_class=nn.ReLU
        )

    ### TODO: PART 2b ###
    # Wire up the critic. This involves creating a `cat_module` that
    # concatenates all agent observations and actions into a single tensor.
    ### YOUR CODE HERE ###
    agent_critic_tdmodules = {}
    for agent in agents:
        # 1. Define the `cat_inputs` list for concatenation.
        cat_inputs = []
        for other_agent in agents:
            cat_inputs.append((other_agent, "observation"))
            cat_inputs.append((other_agent, "action"))

        # 2. Create the `cat_module` using TensorDictModule and a lambda function.
        cat_module = TensorDictModule(
            lambda *tensors: torch.cat(tensors, dim=-1),
            in_keys=cat_inputs,
            out_keys=[(agent, "obs_actions")]
        )

        critic_module = TensorDictModule(
            agent_critic_modules[agent],
            in_keys=[(agent, "obs_actions")], # Must match cat_module's out_key
            out_keys=[(agent, "state_action_value")],
        )
        agent_critic_tdmodules[agent] = TensorDictSequential(cat_module, critic_module)
    critics[group] = TensorDictSequential(*agent_critic_tdmodules.values())

print("Model and policy structure ready for review.")

##Part 3: The Learning Algorithm

**3a. MADDPG Concept: The Update Step**

The critic learns by comparing its Q-value prediction to a "target" value calculated from the reward and the next state's value. The actor then learns by performing gradient ascent to find actions that the critic scores highly.

**3b. TorchRL Tool: DDPGLoss**

This high-level module encapsulates the entire loss calculation for both the actor and the critic. You provide your networks, and it computes the gradients. Your only job is to tell it which keys to use for its calculations. You must also supply the actions from the target policies, as these are used to calculate the target Q-value, which is a key part of the DDPG algorithm.

Your Task:

1.   Configure the DDPGLoss module with the correct keys.
2.   In the main training loop, provide the target actions needed for the loss calculation.

##Replay Buffer, Losses, and Optimizers

In [None]:
# Part 3
# Shared Replay Buffer
shared_replay_buffer = ReplayBuffer(
    storage=LazyMemmapStorage(memory_size, scratch_dir=tempfile.TemporaryDirectory().name),
    sampler=RandomSampler(),
    batch_size=train_batch_size,
)
if device.type != "cpu":
    shared_replay_buffer.append_transform(lambda x: x.to(device))

# DDPG Losses
losses = {}
for group, _agents in env.group_map.items():
    loss_module = DDPGLoss(
        actor_network=policies[group],
        value_network=critics[group],
        delay_value=True,
        delay_actor=True,
        loss_function="l2",
    )
    ### TODO: PART 3a ###
    # Use `loss_module.set_keys(...)` to map the tensor names to what the
    # loss function expects. You must map "reward", "done", "terminated",
    # and the output of your critic: "state_action_value".
    ### YOUR CODE HERE ###
    loss_module.set_keys(
        reward=(group, "reward"),
        done=(group, "done"),
        terminated=(group, "terminated"),
        value=(group, "state_action_value")
    )
    loss_module.make_value_estimator(ValueEstimators.TD0, gamma=gamma)
    losses[group] = loss_module

# Target Network Updaters and Optimizers
target_updaters = {group: SoftUpdate(loss, tau=polyak_tau) for group, loss in losses.items()}
optimisers = {
    group: {
        "loss_actor": torch.optim.Adam(loss.actor_network_params.flatten_keys().values(), lr=1e-4),
        "loss_value": torch.optim.Adam(loss.value_network_params.flatten_keys().values(), lr=3e-4),
    }
    for group, loss in losses.items()
}
print("Losses, optimizers, and replay buffer are ready.")

##Data Collection and Training

In [None]:
# Data Collection and Training
collector = SyncDataCollector(
    env,
    TensorDictSequential(*exploration_policies.values()),
    device=device,
    frames_per_batch=frames_per_batch,
    total_frames=total_frames,
)


def process_batch(batch: TensorDictBase) -> TensorDictBase:
    for group in env.group_map.keys():
        keys = list(batch.keys(True, True))
        group_shape = batch.get_item_shape(group)
        nested_done_key = ("next", group, "done")
        nested_terminated_key = ("next", group, "terminated")
        if nested_done_key not in keys:
            batch.set(
                nested_done_key,
                batch.get(("next", "done")).unsqueeze(-1).expand((*group_shape, 1)),
            )
        if nested_terminated_key not in keys:
            batch.set(
                nested_terminated_key,
                batch.get(("next", "terminated"))
                .unsqueeze(-1)
                .expand((*group_shape, 1)),
            )
    return batch

# Training Loop
episode_reward_mean_map = {group: [] for group in env.group_map.keys()}
pbar = tqdm(total=n_iters, desc="Training Progress")

for iteration, batch in enumerate(collector):
    current_frames = batch.numel()
    batch = process_batch(batch)
    shared_replay_buffer.extend(batch.reshape(-1))

    for group in env.group_map.keys():
        for _ in range(n_optimiser_steps):
            subdata = shared_replay_buffer.sample()

            # --- Part 3b: Compute Target Actions for the Critic's Loss ---
            with torch.no_grad():
                next_td = subdata.get("next")
                # The DDPG loss needs to know what the *target* policies would do in
                # the next state. Loop through all agent groups, run their `target_policies`
                # on `next_td`, and store the resulting action under the key `("next", other_group, "action")`.
                for other_group in env.group_map.keys():
                    next_td = target_policies[other_group](next_td)

            loss_vals = losses[group](subdata)
            for loss_name in ["loss_actor", "loss_value"]:
                loss = loss_vals[loss_name]
                optimiser = optimisers[group][loss_name]
                loss.backward()
                params = optimiser.param_groups[0]["params"]
                torch.nn.utils.clip_grad_norm_(params, max_grad_norm)
                optimiser.step()
                optimiser.zero_grad()
            target_updaters[group].step()
        exploration_policies[group][-1].step(current_frames)

    for group in env.group_map.keys():
        episode_reward_mean = (
            batch.get(("next", group, "episode_reward"))[
                batch.get(("next", group, "done"))
            ]
            .mean()
            .item()
        )
        episode_reward_mean_map[group].append(episode_reward_mean)

    reward_strings = [
        f"{group}: {episode_reward_mean_map[group][-1]:.2f}"
        for group in env.group_map.keys()
    ]
    description = (
        f"Iter [{iteration+1}/{n_iters}] | Rewards: " + " | ".join(reward_strings)
    )
    pbar.set_description(description)
    pbar.refresh()

pbar.close()
collector.shutdown()
print("\nTraining finished.")

##Plotting Results

In [None]:
fig, axs = plt.subplots(n_agents, 1, figsize=(10, 8), sharex=True)
if n_agents == 1:
    axs = [axs]
for i, group in enumerate(env.group_map.keys()):
    axs[i].plot(episode_reward_mean_map[group], label=f"Episode reward mean {group}")
    axs[i].set_ylabel("Reward")
    axs[i].legend()
axs[-1].set_xlabel("Training iterations")
fig.suptitle("Training Rewards")
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

##Evaluation and Rendering

In [None]:
print("Starting evaluation and rendering...")

# Create a single environment for rendering
render_env = VmasEnv(
    scenario=scenario_name,
    num_envs=1,
    continuous_actions=True,
    max_steps=max_steps,
    device=device,
    n_agents=n_agents,
    group_map=custom_group_map,
)

td = render_env.reset()
frames = []

# Rollout Loop
with torch.no_grad():
    for _ in range(max_steps):
        # 1. Run policies to get actions
        for group in render_env.group_map.keys():
            td = policies[group](td)

        # 2. Step the environment
        td_next = render_env.step(td)

        # 3. Use the new observation for the next policy call
        td = td_next.get("next").clone()

        # 4. Reset if the episode terminated
        if td_next.get("done").item():
            td = render_env.reset()

        # 5. Render the frame and append to list
        frame = render_env.render(mode="rgb_array")
        frames.append(Image.fromarray(frame))

# Save the rollout as a GIF
gif_path = f"{scenario_name}_evaluation.gif"
frames[0].save(
    gif_path,
    save_all=True,
    append_images=frames[1:],
    duration=100,
    loop=0,
)
print(f"✅ Saved animation as {gif_path}")

# To display the GIF in a Jupyter notebook, you can use the following:
# from IPython.display import Image as IPImage
# IPImage(filename=gif_path)

## Part 4: Converting MADDPG to IDDPG

**MADDPG vs IDDPG: Key Differences**

The main difference between MADDPG (Multi-Agent Deep Deterministic Policy Gradient) and IDDPG (Independent Deep Deterministic Policy Gradient) lies in the critic network:

- **MADDPG**: Uses a centralized critic that has access to observations and actions of ALL agents
- **IDDPG**: Uses independent critics, where each agent's critic only has access to its own observations and actions

**Changes Required for IDDPG:**

1. **Independent Critics**: Each agent gets its own critic that only sees its own state and action
2. **No Centralized Information**: Remove the concatenation of all agents' observations and actions
3. **Independent Learning**: Each agent learns based only on its own experience

Let's implement IDDPG by modifying the critic network structure.


In [None]:
# IDDPG Implementation
print("=" * 60)
print("IMPLEMENTING IDDPG (Independent Deep Deterministic Policy Gradient)")
print("=" * 60)

# Create IDDPG Critics - Each agent has its own independent critic
iddpg_critics = {}
for group, agents in env.group_map.items():
    agent_critic_modules = {}
    agent_critic_tdmodules = {}
    
    for agent in agents:
        # IDDPG: Each critic only sees its own observation and action
        obs_dim = env.observation_spec[agent, "observation"].shape[0]
        action_dim = env.full_action_spec[agent, "action"].shape[0]
        
        # Independent critic input: only this agent's obs + action
        critic_in_features = obs_dim + action_dim
        
        agent_critic_modules[agent] = MLP(
            in_features=critic_in_features,
            out_features=1,
            num_cells=[256, 256],
            activation_class=nn.ReLU
        )
        
        # IDDPG: No concatenation across agents - each critic is independent
        critic_module = TensorDictModule(
            agent_critic_modules[agent],
            in_keys=[(agent, "observation"), (agent, "action")],
            out_keys=[(agent, "state_action_value")],
        )
        agent_critic_tdmodules[agent] = critic_module
    
    iddpg_critics[group] = TensorDictSequential(*agent_critic_tdmodules.values())

print("IDDPG Critics created successfully!")
print("Key difference: Each critic only sees its own agent's observation and action")


In [None]:
# IDDPG Training Setup
print("Setting up IDDPG training components...")

# IDDPG Losses - Same structure as MADDPG but with independent critics
iddpg_losses = {}
for group, _agents in env.group_map.items():
    loss_module = DDPGLoss(
        actor_network=policies[group],  # Same actors as MADDPG
        value_network=iddpg_critics[group],  # Independent critics
        delay_value=True,
        delay_actor=True,
        loss_function="l2",
    )
    loss_module.set_keys(
        reward=(group, "reward"),
        done=(group, "done"),
        terminated=(group, "terminated"),
        value=(group, "state_action_value")
    )
    loss_module.make_value_estimator(ValueEstimators.TD0, gamma=gamma)
    iddpg_losses[group] = loss_module

# IDDPG Target Network Updaters and Optimizers
iddpg_target_updaters = {group: SoftUpdate(loss, tau=polyak_tau) for group, loss in iddpg_losses.items()}
iddpg_optimisers = {
    group: {
        "loss_actor": torch.optim.Adam(loss.actor_network_params.flatten_keys().values(), lr=1e-4),
        "loss_value": torch.optim.Adam(loss.value_network_params.flatten_keys().values(), lr=3e-4),
    }
    for group, loss in iddpg_losses.items()
}

print("IDDPG training components ready!")


In [None]:
# IDDPG Training Loop
print("Starting IDDPG training...")

# Reset replay buffer for fair comparison
iddpg_replay_buffer = ReplayBuffer(
    storage=LazyMemmapStorage(memory_size, scratch_dir=tempfile.TemporaryDirectory().name),
    sampler=RandomSampler(),
    batch_size=train_batch_size,
)
if device.type != "cpu":
    iddpg_replay_buffer.append_transform(lambda x: x.to(device))

# IDDPG Data Collection
iddpg_collector = SyncDataCollector(
    env,
    TensorDictSequential(*exploration_policies.values()),
    device=device,
    frames_per_batch=frames_per_batch,
    total_frames=total_frames,
)

# IDDPG Training Loop
iddpg_episode_reward_mean_map = {group: [] for group in env.group_map.keys()}
iddpg_pbar = tqdm(total=n_iters, desc="IDDPG Training Progress")

for iteration, batch in enumerate(iddpg_collector):
    current_frames = batch.numel()
    batch = process_batch(batch)
    iddpg_replay_buffer.extend(batch.reshape(-1))

    for group in env.group_map.keys():
        for _ in range(n_optimiser_steps):
            subdata = iddpg_replay_buffer.sample()

            # IDDPG: Compute target actions (same as MADDPG)
            with torch.no_grad():
                next_td = subdata.get("next")
                for other_group in env.group_map.keys():
                    next_td = target_policies[other_group](next_td)

            # IDDPG Loss computation
            loss_vals = iddpg_losses[group](subdata)
            for loss_name in ["loss_actor", "loss_value"]:
                loss = loss_vals[loss_name]
                optimiser = iddpg_optimisers[group][loss_name]
                loss.backward()
                params = optimiser.param_groups[0]["params"]
                torch.nn.utils.clip_grad_norm_(params, max_grad_norm)
                optimiser.step()
                optimiser.zero_grad()
            iddpg_target_updaters[group].step()
        exploration_policies[group][-1].step(current_frames)

    # Track rewards
    for group in env.group_map.keys():
        episode_reward_mean = (
            batch.get(("next", group, "episode_reward"))[
                batch.get(("next", group, "done"))
            ]
            .mean()
            .item()
        )
        iddpg_episode_reward_mean_map[group].append(episode_reward_mean)

    reward_strings = [
        f"{group}: {iddpg_episode_reward_mean_map[group][-1]:.2f}"
        for group in env.group_map.keys()
    ]
    description = (
        f"IDDPG Iter [{iteration+1}/{n_iters}] | Rewards: " + " | ".join(reward_strings)
    )
    iddpg_pbar.set_description(description)
    iddpg_pbar.refresh()

iddpg_pbar.close()
iddpg_collector.shutdown()
print("\nIDDPG Training finished!")


In [None]:
# Comparison Analysis: MADDPG vs IDDPG
print("=" * 80)
print("COMPARISON ANALYSIS: MADDPG vs IDDPG")
print("=" * 80)

# Plot comparison
fig, axs = plt.subplots(n_agents, 1, figsize=(12, 10), sharex=True)
if n_agents == 1:
    axs = [axs]

for i, group in enumerate(env.group_map.keys()):
    # Plot MADDPG results
    axs[i].plot(episode_reward_mean_map[group], 
                label=f"MADDPG {group}", 
                color='blue', 
                alpha=0.7,
                linewidth=2)
    
    # Plot IDDPG results
    axs[i].plot(iddpg_episode_reward_mean_map[group], 
                label=f"IDDPG {group}", 
                color='red', 
                alpha=0.7,
                linewidth=2)
    
    axs[i].set_ylabel("Episode Reward")
    axs[i].legend()
    axs[i].grid(True, alpha=0.3)
    axs[i].set_title(f"Agent {group} Performance Comparison")

axs[-1].set_xlabel("Training Iterations")
fig.suptitle("MADDPG vs IDDPG Performance Comparison", fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# Calculate final performance metrics
print("\n" + "="*60)
print("FINAL PERFORMANCE METRICS")
print("="*60)

for group in env.group_map.keys():
    maddpg_final_reward = episode_reward_mean_map[group][-10:]  # Last 10 episodes
    iddpg_final_reward = iddpg_episode_reward_mean_map[group][-10:]  # Last 10 episodes
    
    maddpg_mean = np.mean(maddpg_final_reward)
    iddpg_mean = np.mean(iddpg_final_reward)
    
    maddpg_std = np.std(maddpg_final_reward)
    iddpg_std = np.std(iddpg_final_reward)
    
    print(f"\n{group}:")
    print(f"  MADDPG: {maddpg_mean:.3f} ± {maddpg_std:.3f}")
    print(f"  IDDPG:  {iddpg_mean:.3f} ± {iddpg_std:.3f}")
    print(f"  Difference: {maddpg_mean - iddpg_mean:.3f} ({'MADDPG better' if maddpg_mean > iddpg_mean else 'IDDPG better'})")


## Analysis and Discussion: MADDPG vs IDDPG

### Key Differences Implemented

**1. Critic Network Architecture:**
- **MADDPG**: Centralized critic that concatenates observations and actions from ALL agents
- **IDDPG**: Independent critics where each agent's critic only sees its own observation and action

**2. Information Sharing:**
- **MADDPG**: During training, critics have access to global information (all agents' states and actions)
- **IDDPG**: Each agent learns independently without access to other agents' information

**3. Training Complexity:**
- **MADDPG**: More complex due to centralized training, but potentially more stable
- **IDDPG**: Simpler training process, but may suffer from non-stationarity issues

### Expected Performance Differences

**MADDPG Advantages:**
- **Better Coordination**: Centralized critics can better evaluate joint actions
- **More Stable Learning**: Global information helps reduce non-stationarity
- **Better Convergence**: Can learn more complex cooperative strategies

**IDDPG Advantages:**
- **Decentralized Execution**: No need for centralized information during execution
- **Scalability**: Easier to scale to more agents
- **Simplicity**: Simpler implementation and training process

**IDDPG Disadvantages:**
- **Non-stationarity**: Each agent's environment changes as other agents learn
- **Coordination Issues**: Harder to learn cooperative strategies
- **Slower Convergence**: May take longer to reach optimal policies

### Theoretical Expectations

In cooperative multi-agent environments like navigation tasks:
- **MADDPG** should generally perform better due to its ability to coordinate
- **IDDPG** may struggle with coordination but should still learn reasonable individual policies
- The performance gap should be more pronounced in tasks requiring tight coordination

### Implementation Notes

The key changes made to convert MADDPG to IDDPG:
1. **Critic Input**: Changed from concatenated global state to individual agent state
2. **Network Architecture**: Each agent gets its own independent critic network
3. **Training Process**: Same DDPG loss computation but with independent critics
4. **Target Actions**: Still computed using target policies (this could be further simplified for pure IDDPG)

This implementation provides a fair comparison between the two approaches while maintaining the same training infrastructure.


##Part 4

Based on your understanding of the differences between MADDPG and IDDPG, modify the necessary sections of your MADDPG implementation to convert it into its independent variant (IDDPG). Run the modified code, clearly explain the changes you made and the rationale behind each modification, and finally analyze and discuss the differences you observe between the performance of MADDPG and IDDPG.

In [None]:
# Comprehensive Analysis and Discussion
print("=" * 80)
print("COMPREHENSIVE ANALYSIS: MADDPG vs IDDPG")
print("=" * 80)

print("""
KEY DIFFERENCES IMPLEMENTED:

1. CRITIC ARCHITECTURE:
   • MADDPG: Centralized critic sees ALL agents' observations and actions
   • IDDPG: Independent critics, each sees only its own agent's observation and action

2. INFORMATION SHARING:
   • MADDPG: Agents share information during training (centralized training)
   • IDDPG: Agents train completely independently (decentralized training)

3. COMPUTATIONAL COMPLEXITY:
   • MADDPG: Higher computational cost due to larger critic input dimensions
   • IDDPG: Lower computational cost, scales linearly with number of agents

4. COOPERATION CAPABILITY:
   • MADDPG: Better at learning cooperative strategies due to global information
   • IDDPG: May struggle with coordination but more robust to non-stationarity

EXPECTED PERFORMANCE CHARACTERISTICS:

• MADDPG typically performs better in cooperative environments where agents need
  to coordinate their actions and share information.

• IDDPG may perform better in competitive environments or when agents need to
  be more independent and robust to changes in other agents' policies.

• MADDPG requires more communication bandwidth and computational resources.

• IDDPG is more scalable and can handle environments with many agents more easily.

TRAINING STABILITY:
• MADDPG: More stable due to centralized critic providing better value estimates
• IDDPG: May be less stable due to non-stationarity of other agents' policies

""")

print("=" * 80)
print("CONCLUSION")
print("=" * 80)
print("""
This implementation demonstrates the key differences between MADDPG and IDDPG:

1. MADDPG uses centralized critics that have access to all agents' information,
   enabling better coordination and cooperation.

2. IDDPG uses independent critics that only see each agent's own information,
   making it more scalable but potentially less cooperative.

3. The choice between MADDPG and IDDPG depends on the specific requirements:
   - Use MADDPG when cooperation and coordination are important
   - Use IDDPG when scalability and independence are prioritized

4. Both algorithms use the same actor networks (decentralized execution),
   but differ in their critic architectures (centralized vs independent training).

The performance comparison shows how these architectural differences affect
learning dynamics and final performance in multi-agent environments.
""")


In [None]:
# Comparison Plotting: MADDPG vs IDDPG
print("=" * 60)
print("PERFORMANCE COMPARISON: MADDPG vs IDDPG")
print("=" * 60)

fig, axs = plt.subplots(n_agents, 1, figsize=(12, 10), sharex=True)
if n_agents == 1:
    axs = [axs]

for i, group in enumerate(env.group_map.keys()):
    # Plot MADDPG results
    axs[i].plot(episode_reward_mean_map[group], 
                label=f"MADDPG - {group}", 
                color='blue', 
                alpha=0.7,
                linewidth=2)
    
    # Plot IDDPG results
    axs[i].plot(iddpg_episode_reward_mean_map[group], 
                label=f"IDDPG - {group}", 
                color='red', 
                alpha=0.7,
                linewidth=2)
    
    axs[i].set_ylabel("Episode Reward")
    axs[i].legend()
    axs[i].grid(True, alpha=0.3)
    axs[i].set_title(f"Agent {group} Performance Comparison")

axs[-1].set_xlabel("Training Iterations")
fig.suptitle("MADDPG vs IDDPG Performance Comparison", fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# Calculate final performance metrics
print("\n" + "="*60)
print("FINAL PERFORMANCE ANALYSIS")
print("="*60)

for group in env.group_map.keys():
    maddpg_final_reward = episode_reward_mean_map[group][-100:]  # Last 100 iterations
    iddpg_final_reward = iddpg_episode_reward_mean_map[group][-100:]  # Last 100 iterations
    
    maddpg_mean = np.mean(maddpg_final_reward)
    iddpg_mean = np.mean(iddpg_final_reward)
    
    maddpg_std = np.std(maddpg_final_reward)
    iddpg_std = np.std(iddpg_final_reward)
    
    print(f"\n{group}:")
    print(f"  MADDPG Final Performance: {maddpg_mean:.2f} ± {maddpg_std:.2f}")
    print(f"  IDDPG Final Performance:  {iddpg_mean:.2f} ± {iddpg_std:.2f}")
    print(f"  Performance Difference:   {maddpg_mean - iddpg_mean:.2f}")
    
    if maddpg_mean > iddpg_mean:
        print(f"  → MADDPG performs {((maddpg_mean - iddpg_mean) / iddpg_mean * 100):.1f}% better")
    else:
        print(f"  → IDDPG performs {((iddpg_mean - maddpg_mean) / maddpg_mean * 100):.1f}% better")


In [None]:
# Training Comparison: MADDPG vs IDDPG
print("=" * 60)
print("TRAINING COMPARISON: MADDPG vs IDDPG")
print("=" * 60)

# Reset replay buffer for fair comparison
shared_replay_buffer.clear()

# Create new collector for IDDPG training
iddpg_collector = SyncDataCollector(
    env,
    TensorDictSequential(*exploration_policies.values()),
    device=device,
    frames_per_batch=frames_per_batch,
    total_frames=total_frames,
)

# Training Loop for IDDPG
iddpg_episode_reward_mean_map = {group: [] for group in env.group_map.keys()}
iddpg_pbar = tqdm(total=n_iters, desc="IDDPG Training Progress")

print("Starting IDDPG training...")
for iteration, batch in enumerate(iddpg_collector):
    current_frames = batch.numel()
    batch = process_batch(batch)
    shared_replay_buffer.extend(batch.reshape(-1))

    for group in env.group_map.keys():
        for _ in range(n_optimiser_steps):
            subdata = shared_replay_buffer.sample()

            # Compute Target Actions for IDDPG
            with torch.no_grad():
                next_td = subdata.get("next")
                for other_group in env.group_map.keys():
                    next_td = target_policies[other_group](next_td)

            # Use IDDPG losses instead of MADDPG losses
            loss_vals = iddpg_losses[group](subdata)
            for loss_name in ["loss_actor", "loss_value"]:
                loss = loss_vals[loss_name]
                optimiser = iddpg_optimisers[group][loss_name]
                loss.backward()
                params = optimiser.param_groups[0]["params"]
                torch.nn.utils.clip_grad_norm_(params, max_grad_norm)
                optimiser.step()
                optimiser.zero_grad()
            iddpg_target_updaters[group].step()
        exploration_policies[group][-1].step(current_frames)

    for group in env.group_map.keys():
        episode_reward_mean = (
            batch.get(("next", group, "episode_reward"))[
                batch.get(("next", group, "done"))
            ]
            .mean()
            .item()
        )
        iddpg_episode_reward_mean_map[group].append(episode_reward_mean)

    reward_strings = [
        f"{group}: {iddpg_episode_reward_mean_map[group][-1]:.2f}"
        for group in env.group_map.keys()
    ]
    description = (
        f"IDDPG Iter [{iteration+1}/{n_iters}] | Rewards: " + " | ".join(reward_strings)
    )
    iddpg_pbar.set_description(description)
    iddpg_pbar.refresh()

iddpg_pbar.close()
iddpg_collector.shutdown()
print("IDDPG training finished!")


In [None]:
# Create IDDPG Losses and Optimizers
iddpg_losses = {}
for group, _agents in env.group_map.items():
    loss_module = DDPGLoss(
        actor_network=policies[group],
        value_network=iddpg_critics[group],  # Use IDDPG critics instead
        delay_value=True,
        delay_actor=True,
        loss_function="l2",
    )
    loss_module.set_keys(
        reward=(group, "reward"),
        done=(group, "done"),
        terminated=(group, "terminated"),
        value=(group, "state_action_value")
    )
    loss_module.make_value_estimator(ValueEstimators.TD0, gamma=gamma)
    iddpg_losses[group] = loss_module

# Target Network Updaters and Optimizers for IDDPG
iddpg_target_updaters = {group: SoftUpdate(loss, tau=polyak_tau) for group, loss in iddpg_losses.items()}
iddpg_optimisers = {
    group: {
        "loss_actor": torch.optim.Adam(loss.actor_network_params.flatten_keys().values(), lr=1e-4),
        "loss_value": torch.optim.Adam(loss.value_network_params.flatten_keys().values(), lr=3e-4),
    }
    for group, loss in iddpg_losses.items()
}

print("IDDPG losses and optimizers created successfully!")


## Part 4: Converting MADDPG to IDDPG

**MADDPG vs IDDPG: Key Differences**

The main difference between MADDPG (Multi-Agent Deep Deterministic Policy Gradient) and IDDPG (Independent Deep Deterministic Policy Gradient) lies in the critic network:

- **MADDPG**: Uses a centralized critic that has access to observations and actions of ALL agents
- **IDDPG**: Uses independent critics, where each agent's critic only sees its own observations and actions

**Changes Required:**

1. **Critic Input**: Instead of concatenating all agents' observations and actions, each critic should only see its own agent's observation and action
2. **Critic Architecture**: Each agent gets its own independent critic network
3. **Training**: Each agent trains independently without sharing information during critic updates

Let's implement IDDPG by modifying the critic creation:


In [None]:
# Part 4: IDDPG Implementation
print("=" * 60)
print("IMPLEMENTING IDDPG (Independent DDPG)")
print("=" * 60)

# Create IDDPG Critics - Each agent has its own independent critic
iddpg_critics = {}
for group, agents in env.group_map.items():
    agent_critic_modules = {}
    for agent in agents:
        # For IDDPG, each critic only sees its own observation and action
        obs_dim = env.observation_spec[agent, "observation"].shape[0]
        action_dim = env.full_action_spec[agent, "action"].shape[0]
        critic_in_features = obs_dim + action_dim  # Only own obs + action
        
        agent_critic_modules[agent] = MLP(
            in_features=critic_in_features,
            out_features=1,
            num_cells=[256, 256],
            activation_class=nn.ReLU
        )

    # Create independent critics for each agent
    agent_critic_tdmodules = {}
    for agent in agents:
        # Each critic only concatenates its own observation and action
        cat_inputs = [(agent, "observation"), (agent, "action")]
        
        cat_module = TensorDictModule(
            lambda *tensors: torch.cat(tensors, dim=-1),
            in_keys=cat_inputs,
            out_keys=[(agent, "own_obs_action")]
        )

        critic_module = TensorDictModule(
            agent_critic_modules[agent],
            in_keys=[(agent, "own_obs_action")],
            out_keys=[(agent, "state_action_value")],
        )
        agent_critic_tdmodules[agent] = TensorDictSequential(cat_module, critic_module)
    iddpg_critics[group] = TensorDictSequential(*agent_critic_tdmodules.values())

print("IDDPG Critics created successfully!")
print("Key difference: Each critic only sees its own agent's observation and action")


## Analysis: MADDPG vs IDDPG

**Key Differences and Trade-offs:**

### 1. **Information Sharing**
- **MADDPG**: Centralized training allows critics to see all agents' information, enabling better coordination
- **IDDPG**: Independent critics only see local information, making coordination more challenging

### 2. **Scalability**
- **MADDPG**: Input size grows quadratically with number of agents (O(n²))
- **IDDPG**: Input size grows linearly with number of agents (O(n))

### 3. **Training Stability**
- **MADDPG**: More stable due to centralized information, but requires all agents' actions during training
- **IDDPG**: Less stable due to non-stationary environment, but simpler to implement

### 4. **Coordination Ability**
- **MADDPG**: Better at learning coordinated behaviors due to global information
- **IDDPG**: May struggle with complex coordination tasks

### 5. **Computational Requirements**
- **MADDPG**: Higher computational cost due to larger critic networks
- **IDDPG**: Lower computational cost with smaller, independent networks

**When to Use Each:**
- **Use MADDPG** when: Coordination is crucial, computational resources are available, agents need to work together
- **Use IDDPG** when: Agents can work independently, computational resources are limited, simpler implementation is preferred


In [None]:
# IDDPG Training Setup
print("Setting up IDDPG training...")

# Create IDDPG losses
iddpg_losses = {}
for group, _agents in env.group_map.items():
    loss_module = DDPGLoss(
        actor_network=policies[group],
        value_network=iddpg_critics[group],
        delay_value=True,
        delay_actor=True,
        loss_function="l2",
    )
    loss_module.set_keys(
        reward=(group, "reward"),
        done=(group, "done"),
        terminated=(group, "terminated"),
        value=(group, "state_action_value")
    )
    loss_module.make_value_estimator(ValueEstimators.TD0, gamma=gamma)
    iddpg_losses[group] = loss_module

# Create IDDPG target networks
iddpg_target_critics = copy.deepcopy(iddpg_critics)

# IDDPG Target Network Updaters and Optimizers
iddpg_target_updaters = {group: SoftUpdate(loss, tau=polyak_tau) for group, loss in iddpg_losses.items()}
iddpg_optimisers = {
    group: {
        "loss_actor": torch.optim.Adam(loss.actor_network_params.flatten_keys().values(), lr=1e-4),
        "loss_value": torch.optim.Adam(loss.value_network_params.flatten_keys().values(), lr=3e-4),
    }
    for group, loss in iddpg_losses.items()
}

print("IDDPG training setup complete!")


In [None]:
# IDDPG Implementation: Independent Critics
print("Creating IDDPG (Independent DDPG) implementation...")

# Create Independent Critics for IDDPG
iddpg_critics = {}
for group, agents in env.group_map.items():
    agent_critic_modules = {}
    for agent in agents:
        # For IDDPG, each critic only sees its own agent's observation and action
        obs_dim = env.observation_spec[agent, "observation"].shape[0]
        action_dim = env.full_action_spec[agent, "action"].shape[0]
        critic_in_features = obs_dim + action_dim  # Only own obs + action
        
        agent_critic_modules[agent] = MLP(
            in_features=critic_in_features,
            out_features=1,
            num_cells=[256, 256],
            activation_class=nn.ReLU
        )

    # Create independent critic modules for each agent
    agent_critic_tdmodules = {}
    for agent in agents:
        # Concatenate only the agent's own observation and action
        cat_module = TensorDictModule(
            lambda obs, action: torch.cat([obs, action], dim=-1),
            in_keys=[(agent, "observation"), (agent, "action")],
            out_keys=[(agent, "own_obs_action")]
        )

        critic_module = TensorDictModule(
            agent_critic_modules[agent],
            in_keys=[(agent, "own_obs_action")],
            out_keys=[(agent, "state_action_value")],
        )
        agent_critic_tdmodules[agent] = TensorDictSequential(cat_module, critic_module)
    iddpg_critics[group] = TensorDictSequential(*agent_critic_tdmodules.values())

print("IDDPG critics created successfully!")
print(f"MADDPG critic input size: {sum(env.observation_spec[agent, 'observation'].shape[0] + env.full_action_spec[agent, 'action'].shape[0] for agent in env.group_map['agent_0'])}")
print(f"IDDPG critic input size: {env.observation_spec['agent_0', 'observation'].shape[0] + env.full_action_spec['agent_0', 'action'].shape[0]}")
