## RLDMUU 2025
#### MARL - introduction
jakub.tluczek@unine.ch

In the last lab of the RLDMUU course, we are going to explore the basics of multi agent reinforcement learning. We are going to go through the Petting Zoo workflow (Gymnasium-based framework for multi agent setting), as well as an example end to end training of multiple agents.

In [None]:
import pettingzoo
import torch
import torch.nn as nn
import torch.nn.functional as F
import gymnasium as gym
import numpy as np
from matplotlib import pyplot as plt
from tqdm import tqdm
from collections import deque
import random
torch.manual_seed(123)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
np.random.seed(123)

Petting Zoo offer the same environment based workflow as Gymnasium, while offering two APIs for agent interaction:

- Agent Environment Cycle (AEC) API - turn based
- Parallel API - moves are made simultaneously

Petting zoo features similar `step` function we know from Gymnasium, with this difference however, that here we operate on dictionaries, where each agent's observation, reward or action is an entry in the `dict` with agent's identifier serving as a key. Moreover, in AEC API `step` only acts on the environment, while the observations are retrieved using `last` method. Let's take a look at an example, a classic tic-tac-toe game with AEC:

In [None]:
from pettingzoo.classic import tictactoe_v3

env = tictactoe_v3.env()
env.reset()

# agent_iter is an iterable object that generates an ID of the agent whose turn is it right now
for agent in env.agent_iter():
    observation, reward, termination, truncation, info = env.last()

    if termination or truncation:
        action = None
    else:
        # action mask part in the observation vector tells us which actions are allowed
        mask = observation["action_mask"]

        action = env.action_space(agent).sample(mask)

    env.step(action)
env.close()

Another way to act with multiagent environment is to act in parallel. Agents take action at the same time, without observing the current action of other agents. Observations are retrieved in a similar way as in Gymnasium:

In [None]:
from pettingzoo.butterfly import knights_archers_zombies_v10

env = knights_archers_zombies_v10.parallel_env()
observations, infos = env.reset()

# instead of generating current IDs from iterable, loop until therea are no more active agents
while env.agents:
    # sampling actions for all agents
    actions = {agent: env.action_space(agent).sample() for agent in env.agents}

    observations, rewards, terminations, truncations, infos = env.step(actions)
env.close()

Let's stick to tic-tac-toe for the purpose of this class. The nice thing about classical games is that they have nice, predefined goal (winning) so we don't have to think about the solution concepts. In case of tic tac toe we also assume that it's not in agent's interest to cooperate, as it is ultimately a zero sum game. Though simple, this environment still suffers from usual problems related to MARL - implementing a simple Q-Learning is tricky, since the environment is not stationary, moreover the rewards are incredibly sparse (they are only awarded at the end of the game). 

Let's take a look at the first method, called **self play**. Here we play a game against our own policy, which we update periodically. As we improve our $\pi$, after each time we update the "opponent", it gets better at the game. First, let's reuse the DQN we have coded before, it will form the basis of our policy.

In [None]:
class QNetwork(nn.Module):
    def __init__(self, n_states, n_actions, hidden_dim):
        super(QNetwork, self).__init__()
        self.linear1 = nn.Linear(n_states, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
        self.linear3 = nn.Linear(hidden_dim, n_actions)

    def forward(self, state):
        x = F.relu(self.linear1(state))
        x = F.relu(self.linear2(x))
        return self.linear3(x)

At first, agent doesn't know the game at all, so you can initially fill the replay buffer with random games - just sample the action spaces of both players. Remember to use action mask!

In [None]:
class ReplayBuffer:
    def __init__(self, size):
        self.memory = deque([], maxlen=size)

    # this method samples transitions and returns tensors of each type registered in the environment step
    def sample(self, sample_size):
        sample = random.sample(self.memory, sample_size)
        states = []
        actions = []
        rewards = []
        next_states = []
        dones = []
        for x in sample:
            states.append(x[0])
            actions.append(x[1])
            rewards.append(x[2])
            next_states.append(x[3])
            dones.append(x[4])
        states = torch.tensor(states).to(device)
        actions = torch.tensor(actions).to(device)
        rewards = torch.tensor(rewards).to(device)
        next_states = torch.tensor(next_states).to(device)
        dones = torch.tensor(dones, dtype=torch.int).to(device)
        return states, actions, rewards, next_states, dones
    
    # add transition to the buffer
    def append(self, item):
        self.memory.append(item)

    def __len__(self):
        return len(self.memory)
    
def parameter_update(source_model, target_model, tau):
    for target_param, source_param in zip(target_model.parameters(), source_model.parameters()):
        target_param.data.copy_(tau * source_param.data + (1.0 - tau)*target_param.data)

Now your task is to define a policy of an agent we want to train, as well as "oponnent" policy. You can reuse the code from Lab 8. Although you can reuse the target network as the "oponnent", remember to turn off gradient updates when picking an oponnent's move. 

In [None]:
# 超参数设置
hidden_dim = 64
replay_buffer_size = 10000
batch_size = 64
gamma = 0.99
lr = 1e-3
epsilon = 1.0
epsilon_end = 0.1
epsilon_decay = 0.995
tau = 0.01

# 初始化环境和网络
env = tictactoe_v3.env()
env.reset()
obs_shape = env.observation_spaces[env.agents[0]]['observation'].shape[0]
n_actions = env.action_spaces[env.agents[0]].n

# 共享 Q 网络
q_net = QNetwork(obs_shape, n_actions, hidden_dim).to(device)
target_q_net = QNetwork(obs_shape, n_actions, hidden_dim).to(device)
target_q_net.load_state_dict(q_net.state_dict())

optimizer = torch.optim.Adam(q_net.parameters(), lr=lr)
replay_buffer = ReplayBuffer(replay_buffer_size)

# 用随机策略填充经验池
print("Filling replay buffer with random transitions...")
while len(replay_buffer) < 1000:
    env.reset()
    for agent in env.agent_iter():
        obs, reward, term, trunc, info = env.last()
        if term or trunc:
            action = None
        else:
            mask = obs["action_mask"]
            valid_actions = np.flatnonzero(mask)
            action = np.random.choice(valid_actions)
        next_obs, next_reward, next_term, next_trunc, _ = obs, reward, term, trunc, info
        env.step(action)
        if action is not None:
            replay_buffer.append((obs["observation"], action, reward, next_obs["observation"], term or trunc))

print("Replay buffer filled.")

# DQN 训练循环
NUM_TRAJECTORIES = 1000
for episode in range(NUM_TRAJECTORIES):
    env.reset()
    for agent in env.agent_iter():
        obs, reward, term, trunc, info = env.last()

        if term or trunc:
            env.step(None)
            continue

        # epsilon-greedy 动作选择
        obs_tensor = torch.tensor(obs["observation"], dtype=torch.float32).unsqueeze(0).to(device)
        mask = torch.tensor(obs["action_mask"], dtype=torch.bool).to(device)
        if random.random() < epsilon:
            action = np.random.choice(np.flatnonzero(obs["action_mask"]))
        else:
            q_values = q_net(obs_tensor)
            q_values[~mask] = -float('inf')
            action = torch.argmax(q_values).item()

        # 执行动作
        env.step(action)
        next_obs, next_reward, next_term, next_trunc, _ = env.last()
        done = next_term or next_trunc
        if not done:
            replay_buffer.append((obs["observation"], action, reward, next_obs["observation"], done))

        # 学习
        if len(replay_buffer) >= batch_size:
            states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size)

            q_values = q_net(states.float()).gather(1, actions.unsqueeze(1)).squeeze(1)
            next_q_values = target_q_net(next_states.float()).max(1)[0]
            targets = rewards + gamma * next_q_values * (1 - dones)

            loss = F.mse_loss(q_values, targets.detach())

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # 软更新 target network
            parameter_update(q_net, target_q_net, tau)

    # 衰减 epsilon
    epsilon = max(epsilon_end, epsilon * epsilon_decay)

    if episode % 100 == 0:
        print(f"Episode {episode}, Epsilon: {epsilon:.3f}")

env.close()


Finally your task would be to create a training loop for DQN with self-play. Plot the results after you are finished.

In [None]:
NUM_TRAJECTORIES = 1000

env = tictactoe_v3.env()
env.reset()

# TODO: fill the buffer with the random transitions

# TODO: create an AEC training loop
# Hint : You can reuse the DQN training code from Lab 8 
# Remember however about action masking and specific way multi agent environment is updated

for tau in range(NUM_TRAJECTORIES):
    pass


env.close()

*BONUS*: Since the rewards are very sparse, you can augment the reward function and use another technique called **curriculum learning**. Using your expert knowledge you can then lead policy into regions that are known to work in the particular game. For example, try awarding agent for putting 2 X's or O's in row without opponents piece.