# Multi-Agent Reinforcement Learning (MARL)

This tutorial demonstrates how to use Tianshou for multi-agent reinforcement learning scenarios. We'll explore different MARL paradigms and implement a practical example using the Tic-Tac-Toe game.

## MARL Paradigms

Tianshou supports three fundamental types of multi-agent reinforcement learning paradigms:

1. **Simultaneous move**: All agents take their actions at each timestep simultaneously (e.g., MOBA games)
2. **Cyclic move**: Agents take actions sequentially in turns (e.g., Go)
3. **Conditional move**: The environment conditionally selects which agent acts at each timestep (e.g., [Pig Game](https://en.wikipedia.org/wiki/Pig_(dice_game)))

Our approach addresses these multi-agent RL problems by converting them into traditional single-agent RL formulations.

## Converting MARL to Single-Agent RL

### Simultaneous Move

For simultaneous-move scenarios, the solution is straightforward: we add an extra `num_agents` dimension to the state, action, and reward tensors. No other modifications are necessary.

### Cyclic and Conditional Move

Both cyclic and conditional move scenarios can be unified into a single framework. At each timestep, the environment selects an agent identified by `agent_id` to act. Since multiple agents are typically wrapped into a single object (the "abstract agent"), we pass the `agent_id` to this abstract agent, which then delegates the action to the appropriate specific agent.

Additionally, in multi-agent RL, the set of legal actions often varies across timesteps (as in Go). Therefore, the environment must also provide a legal action mask to the abstract agent. This mask is a boolean array where `True` indicates available actions and `False` indicates illegal actions at the current timestep.

<div style="text-align: center; padding: 1rem;">
<img src="../_static/images/marl.png" style="height: 300px; padding-bottom: 1rem;"><br>
The abstract agent framework for multi-agent RL
</div>

## Unified Formulation

This architecture leads to the following formulation of multi-agent RL:

```python
act = policy(state, agent_id, mask)
(next_state, next_agent_id, next_mask), reward = env.step(act)
```

By constructing an augmented state `state_ = (state, agent_id, mask)`, we can reduce this to the standard single-agent RL formulation:

```python
act = policy(state_)
next_state_, reward = env.step(act)
```

Following this principle, we'll implement a Q-learning algorithm to play [Tic-Tac-Toe](https://en.wikipedia.org/wiki/Tic-tac-toe) against a random opponent.

## PettingZoo Integration

Tianshou is fully compatible with [PettingZoo](https://pettingzoo.farama.org/) environments for multi-agent RL. While Tianshou doesn't directly provide specialized MARL facilities, it offers a flexible framework that can be adapted to various MARL scenarios.

For comprehensive tutorials on using Tianshou with PettingZoo, refer to:

* [Beginner Tutorial](https://pettingzoo.farama.org/tutorials/tianshou/beginner/)
* [Intermediate Tutorial](https://pettingzoo.farama.org/tutorials/tianshou/intermediate/)
* [Advanced Tutorial](https://pettingzoo.farama.org/tutorials/tianshou/advanced/)

In this tutorial, we'll demonstrate how to use Tianshou in a multi-agent setting where only one agent is trained while the other uses a fixed random policy. You can then use this as a blueprint to replace the random policy with another trainable agent.

Specifically, we'll train an agent to play Tic-Tac-Toe against a random opponent:

<div style="text-align: center; padding: 1rem;">
<img src="../_static/images/tic-tac-toe.png" style="padding-bottom: 1rem;"><br>
Tic-Tac-Toe game board
</div>

## Exploring the Tic-Tac-Toe Environment

The complete scripts are located in `test/pettingzoo/`. Tianshou provides the `PettingZooEnv` wrapper class that can wrap any PettingZoo environment. Let's explore the 3×3 Tic-Tac-Toe environment provided by PettingZoo.

In [None]:
from pettingzoo.classic import tictactoe_v3  # the Tic-Tac-Toe environment

from tianshou.env import PettingZooEnv  # wrapper for PettingZoo environments

# Initialize the environment
# The board has 3 rows and 3 columns (9 positions total)
# Players place 'X' and 'O' alternately on the board
# The first player to get 3 consecutive marks wins
env = PettingZooEnv(tictactoe_v3.env(render_mode="human"))
obs = env.reset()
env.render()  # render the empty board

The output shows an empty 3×3 board:

```
board (step 0):
     |     |
  -  |  -  |  -
_____|_____|_____
     |     |
  -  |  -  |  -
_____|_____|_____
     |     |
  -  |  -  |  -
     |     |
```

In [None]:
# Examine the observation structure
print(obs)

### Understanding the Observation Space

The observation returned by the environment is a dictionary with three keys:

- **`agent_id`**: The identifier of the currently acting agent (e.g., `'player_1'` or `'player_2'`)

- **`obs`**: The actual environment observation. For Tic-Tac-Toe, this is a numpy array with shape `(3, 3, 2)`:
  - For `player_1`: The first 3×3 plane represents X placements, the second plane represents O placements
  - For `player_2`: The planes are swapped (O in first plane, X in second)
  - Each cell contains either 0 (empty/not placed) or 1 (mark placed)

- **`mask`**: A boolean array indicating legal actions at the current timestep. For Tic-Tac-Toe, index `i` corresponds to position `(i // 3, i % 3)` on the board. If `mask[i] == True`, the player can place their mark at that position. Initially, all positions are available, so all mask values are `True`.

> **Note**: The mask representation is flexible and works for both discrete and continuous action spaces. While we use a boolean array here, you could also use action spaces like `gymnasium.spaces.Discrete` or `gymnasium.spaces.Box` to represent available actions.

### Playing a Few Steps

Let's play a couple of moves to understand the environment dynamics better.

In [None]:
import numpy as np

# Take an action (place mark at position 0 - top-left corner)
action = 0  # action can be an integer or a numpy array with one element
obs, reward, done, truncated, info = env.step(action)  # follows the Gymnasium API

print("Observation after first move:")
print(obs)

# Examine the reward structure
# Reward has two items (one for each player): 1 for win, -1 for loss, 0 otherwise
print(f"\nReward: {reward}")

# Check if the game is over
print(f"Done: {done}")

# Info is typically an empty dict in Tic-Tac-Toe but may contain useful information in other environments
print(f"Info: {info}")

Notice that after the first move:
- The `agent_id` switches to `'player_2'`
- The observation array shows the X placement in the first position
- The mask now has `False` at index 0 (that position is occupied)
- The reward is `[0, 0]` (no winner yet)
- The game continues (`done = False`)

### Game Termination

An interesting detail: the game terminates when only one empty position remains, rather than when the board is completely full. This is because a player with only one available position has no meaningful choice.

In [None]:
# Continue playing (positions: 3, 1, 4, then 2)
# ... (intermediate moves omitted for brevity)

# Final move where player_1 wins
obs, reward, done, info = env.step(2)
print(f"Final state - Reward: {reward}, Done: {done}")
env.render()

The final board shows player_1 (X) winning with three consecutive marks:

```
     |     |
  X  |  O  |  -
_____|_____|_____
     |     |
  X  |  O  |  -
_____|_____|_____
     |     |
  X  |  -  |  -
     |     |
```

The reward is `[1, -1]` indicating player_1 wins (+1) and player_2 loses (-1).

## Random Agents

Now that we understand the environment, let's start by watching two random agents play against each other.

Tianshou provides built-in classes for multi-agent learning. The key components are:

- **`RandomPolicy`**: A policy that randomly selects actions
- **`MultiAgentPolicyManager`**: Manages multiple agent policies and delegates actions to the appropriate agent based on `agent_id`

<div style="text-align: center; padding: 1rem;">
<img src="../_static/images/marl.png" style="height: 300px; padding-bottom: 1rem;"><br>
The relationship between MultiAgentPolicyManager and individual agent policies
</div>

In [None]:
from tianshou.algorithm.algorithm_base import RandomActionPolicy
from tianshou.algorithm.multiagent.marl import MultiAgentPolicy
from tianshou.data import Collector
from tianshou.env import DummyVectorEnv

# Create a multi-agent policy with two random agents
policy = MultiAgentPolicy(
    {
        "a": RandomActionPolicy(action_space=env.action_space),
        "b": RandomActionPolicy(action_space=env.action_space),
    }
)

# Vectorize the environment for the collector
env = DummyVectorEnv([lambda: env])

# Create a collector to gather trajectories
collector = Collector(policy, env)

# Collect and visualize one episode
result = collector.collect(n_episode=1, render=0.1, reset_before_collect=True)

You'll see the game progress step by step. Here's an example of the final moves:

```
     |     |
  X  |  X  |  -
_____|_____|_____
     |     |
  X  |  O  |  -
_____|_____|_____
     |     |
  O  |  -  |  -
     |     |
```

```
     |     |
  X  |  X  |  -
_____|_____|_____
     |     |
  X  |  O  |  -
_____|_____|_____
     |     |
  O  |  -  |  O
     |     |
```

```
     |     |
  X  |  X  |  X
_____|_____|_____
     |     |
  X  |  O  |  -
_____|_____|_____
     |     |
  O  |  -  |  O
     |     |
```

Random agents perform poorly. In the game above, although agent 2 eventually wins, a smart agent 1 would have won immediately by placing an X at position (1, 1) (center of middle row).

## Training an Agent Against a Random Opponent

Now let's train an intelligent agent! We'll use Deep Q-Network (DQN) to learn optimal play against a random opponent.

### Imports and Setup

First, let's import all necessary modules:

In [None]:
import argparse
import os
from copy import deepcopy

import gymnasium as gym
import torch
from pettingzoo.classic import tictactoe_v3
from torch.utils.tensorboard import SummaryWriter

from tianshou.algorithm import (
    BasePolicy,
    DQNPolicy,
    MultiAgentPolicyManager,
    RandomPolicy,
)
from tianshou.data import Collector, VectorReplayBuffer
from tianshou.env import DummyVectorEnv
from tianshou.env.pettingzoo_env import PettingZooEnv
from tianshou.trainer import OffpolicyTrainer
from tianshou.utils import TensorboardLogger
from tianshou.utils.net.common import MLPActor

### Hyperparameters

Let's define the hyperparameters for our training experiment:

In [None]:
def get_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser()
    parser.add_argument("--seed", type=int, default=1626)
    parser.add_argument("--eps-test", type=float, default=0.05)
    parser.add_argument("--eps-train", type=float, default=0.1)
    parser.add_argument("--buffer-size", type=int, default=20000)
    parser.add_argument("--lr", type=float, default=1e-4)
    parser.add_argument(
        "--gamma",
        type=float,
        default=0.9,
        help="Discount factor (smaller values favor earlier wins)",
    )
    parser.add_argument("--n-step", type=int, default=3)
    parser.add_argument("--target-update-freq", type=int, default=320)
    parser.add_argument("--epoch", type=int, default=50)
    parser.add_argument("--epoch_num_steps", type=int, default=1000)
    parser.add_argument("--collection_step_num_env_steps", type=int, default=10)
    parser.add_argument("--update-per-step", type=float, default=0.1)
    parser.add_argument("--batch_size", type=int, default=64)
    parser.add_argument("--hidden-sizes", type=int, nargs="*", default=[128, 128, 128, 128])
    parser.add_argument("--num_train_envs", type=int, default=10)
    parser.add_argument("--num_test_envs", type=int, default=10)
    parser.add_argument("--logdir", type=str, default="log")
    parser.add_argument("--render", type=float, default=0.1)
    parser.add_argument(
        "--win-rate",
        type=float,
        default=0.6,
        help="Target winning rate (optimal policy achieves ~0.7)",
    )
    parser.add_argument(
        "--watch",
        default=False,
        action="store_true",
        help="Skip training and watch pre-trained models play",
    )
    parser.add_argument(
        "--agent-id", type=int, default=2, help="The learned agent plays as agent {1 or 2}"
    )
    parser.add_argument(
        "--resume-path", type=str, default="", help="Path to pre-trained agent .pth file"
    )
    parser.add_argument(
        "--opponent-path", type=str, default="", help="Path to pre-trained opponent .pth file"
    )
    parser.add_argument(
        "--device", type=str, default="cuda" if torch.cuda.is_available() else "cpu"
    )
    return parser


def get_args() -> argparse.Namespace:
    parser = get_parser()
    return parser.parse_known_args()[0]

### Agent Setup

The `get_agents` function creates and configures our agents:

- **Action Model**: We use `MLPActor`, a multi-layer perceptron with ReLU activations
- **Learning Agent**: A `DQNPolicy` that selects actions based on both the action mask and Q-values
- **Opponent**: Either a `RandomPolicy` that randomly chooses legal actions, or a pre-trained `DQNPolicy` for self-play

Both agents are managed by `MultiAgentPolicyManager`, which:
- Calls the correct agent based on `agent_id` in the observation
- Dispatches data to each agent according to their `agent_id`
- Makes each agent perceive the environment as a single-agent problem

<div style="text-align: center; padding: 1rem;">
<img src="../_static/images/marl.png" style="height: 300px; padding-bottom: 1rem;"><br>
How MultiAgentPolicyManager coordinates agent policies
</div>

In [None]:
def get_env(render_mode=None):
    """Create a Tic-Tac-Toe environment."""
    return PettingZooEnv(tictactoe_v3.env(render_mode=render_mode))


def get_agents(
    args: argparse.Namespace = get_args(),
    agent_learn: BasePolicy | None = None,
    agent_opponent: BasePolicy | None = None,
    optim: torch.optim.Optimizer | None = None,
) -> tuple[BasePolicy, torch.optim.Optimizer, list]:
    """Create or load agents for training."""
    env = get_env()
    observation_space = (
        env.observation_space["observation"]
        if isinstance(env.observation_space, gym.spaces.Dict)
        else env.observation_space
    )
    args.state_shape = observation_space.shape or observation_space.n
    args.action_shape = env.action_space.shape or env.action_space.n

    if agent_learn is None:
        # Create the neural network model
        net = MLPActor(
            args.state_shape, args.action_shape, hidden_sizes=args.hidden_sizes, device=args.device
        ).to(args.device)

        if optim is None:
            optim = torch.optim.Adam(net.parameters(), lr=args.lr)

        # Create DQN policy for the learning agent
        agent_learn = DQNPolicy(
            model=net,
            optim=optim,
            gamma=args.gamma,
            action_space=env.action_space,
            estimate_space=args.n_step,
            target_update_freq=args.target_update_freq,
        )

        if args.resume_path:
            agent_learn.load_state_dict(torch.load(args.resume_path))

    if agent_opponent is None:
        if args.opponent_path:
            # Load a pre-trained opponent for self-play
            agent_opponent = deepcopy(agent_learn)
            agent_opponent.load_state_dict(torch.load(args.opponent_path))
        else:
            # Use a random opponent
            agent_opponent = RandomPolicy(action_space=env.action_space)

    # Arrange agents based on which player position the learning agent takes
    if args.agent_id == 1:
        agents = [agent_learn, agent_opponent]
    else:
        agents = [agent_opponent, agent_learn]

    policy = MultiAgentPolicyManager(agents, env)
    return policy, optim, env.agents

### Training Loop

The training procedure follows the standard Tianshou workflow, similar to single-agent DQN training:

In [None]:
def train_agent(
    args: argparse.Namespace = get_args(),
    agent_learn: BasePolicy | None = None,
    agent_opponent: BasePolicy | None = None,
    optim: torch.optim.Optimizer | None = None,
) -> tuple[dict, BasePolicy]:
    """Train the agent using DQN."""
    # ======== Environment Setup =========
    train_envs = DummyVectorEnv([get_env for _ in range(args.num_train_envs)])
    test_envs = DummyVectorEnv([get_env for _ in range(args.num_test_envs)])

    # Set random seeds for reproducibility
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    train_envs.seed(args.seed)
    test_envs.seed(args.seed)

    # ======== Agent Setup =========
    policy, optim, agents = get_agents(
        args, agent_learn=agent_learn, agent_opponent=agent_opponent, optim=optim
    )

    # ======== Collector Setup =========
    train_collector = Collector(
        policy,
        train_envs,
        VectorReplayBuffer(args.buffer_size, len(train_envs)),
        exploration_noise=True,
    )
    test_collector = Collector(policy, test_envs, exploration_noise=True)

    # Collect initial random samples
    train_collector.collect(n_step=args.batch_size * args.num_train_envs)

    # ======== Logging Setup =========
    log_path = os.path.join(args.logdir, "tic_tac_toe", "dqn")
    writer = SummaryWriter(log_path)
    writer.add_text("args", str(args))
    logger = TensorboardLogger(writer)

    # ======== Callback Functions =========
    def save_best_fn(policy):
        """Save the best performing policy."""
        model_save_path = getattr(
            args, "model_save_path", os.path.join(args.logdir, "tic_tac_toe", "dqn", "policy.pth")
        )
        torch.save(policy.policies[agents[args.agent_id - 1]].state_dict(), model_save_path)

    def stop_fn(mean_rewards):
        """Stop training when target win rate is achieved."""
        return mean_rewards >= args.win_rate

    def train_fn(epoch, env_step):
        """Set exploration rate for training."""
        policy.policies[agents[args.agent_id - 1]].set_eps(args.eps_train)

    def test_fn(epoch, env_step):
        """Set exploration rate for testing."""
        policy.policies[agents[args.agent_id - 1]].set_eps(args.eps_test)

    def reward_metric(rews):
        """Extract the reward for our learning agent."""
        return rews[:, args.agent_id - 1]

    # ======== Trainer =========
    result = OffpolicyTrainer(
        policy,
        train_collector,
        test_collector,
        args.epoch,
        args.epoch_num_steps,
        args.collection_step_num_env_steps,
        args.num_test_envs,
        args.batch_size,
        train_fn=train_fn,
        test_fn=test_fn,
        stop_fn=stop_fn,
        save_best_fn=save_best_fn,
        update_per_step=args.update_per_step,
        logger=logger,
        test_in_train=False,
        reward_metric=reward_metric,
    ).run()

    return result, policy.policies[agents[args.agent_id - 1]]

### Evaluation Function

This function allows us to watch a trained agent play:

In [None]:
def watch(
    args: argparse.Namespace = get_args(),
    agent_learn: BasePolicy | None = None,
    agent_opponent: BasePolicy | None = None,
) -> None:
    """Watch a pre-trained agent play."""
    env = get_env(render_mode="human")
    env = DummyVectorEnv([lambda: env])
    policy, optim, agents = get_agents(args, agent_learn=agent_learn, agent_opponent=agent_opponent)
    policy.eval()
    policy.policies[agents[args.agent_id - 1]].set_eps(args.eps_test)
    collector = Collector(policy, env, exploration_noise=True)
    result = collector.collect(n_episode=1, render=args.render)
    rews, lens = result["rews"], result["lens"]
    print(f"Final reward: {rews[:, args.agent_id - 1].mean()}, Episode length: {lens.mean()}")

### Running the Training

Now let's train the agent and watch it play!

In [None]:
# Train the agent
args = get_args()
result, agent = train_agent(args)

# Watch the trained agent play
watch(args, agent)

## Training Results

After training for less than a minute, you'll see the agent play against the random opponent. Here's an example game:

<details>
<summary>Example: Trained Agent vs Random Opponent</summary>

```
     |     |
  -  |  -  |  -
_____|_____|_____
     |     |
  -  |  -  |  X
_____|_____|_____
     |     |
  -  |  -  |  -
     |     |
```

```
     |     |
  -  |  -  |  -
_____|_____|_____
     |     |
  -  |  O  |  X
_____|_____|_____
     |     |
  -  |  -  |  -
     |     |
```

```
     |     |
  -  |  -  |  -
_____|_____|_____
     |     |
  X  |  O  |  X
_____|_____|_____
     |     |
  -  |  -  |  -
     |     |
```

```
     |     |
  -  |  O  |  -
_____|_____|_____
     |     |
  X  |  O  |  X
_____|_____|_____
     |     |
  -  |  -  |  -
     |     |
```

```
     |     |
  -  |  O  |  -
_____|_____|_____
     |     |
  X  |  O  |  X
_____|_____|_____
     |     |
  -  |  X  |  -
     |     |
```

```
     |     |
  O  |  O  |  -
_____|_____|_____
     |     |
  X  |  O  |  X
_____|_____|_____
     |     |
  -  |  X  |  -
     |     |
```

```
     |     |
  O  |  O  |  X
_____|_____|_____
     |     |
  X  |  O  |  X
_____|_____|_____
     |     |
  -  |  X  |  -
     |     |
```

```
     |     |
  O  |  O  |  X
_____|_____|_____
     |     |
  X  |  O  |  X
_____|_____|_____
     |     |
  -  |  X  |  O
     |     |
```

Final reward: 1.0, length: 8.0

</details>

Notice that our trained agent plays as player 2 (O) and wins! The agent has learned the game rules through trial and error, understanding that three consecutive O marks lead to victory.

## Command-Line Usage

You can also save the above code as a script (available at `test/pettingzoo/test_tic_tac_toe.py`) and run it from the command line:

```bash
# Train an agent
python test_tic_tac_toe.py
```

By default, the trained agent is saved to `log/tic_tac_toe/dqn/policy.pth`.

## Self-Play

You can make the trained agent play against itself:

```bash
python test_tic_tac_toe.py --watch \
    --resume-path log/tic_tac_toe/dqn/policy.pth \
    --opponent-path log/tic_tac_toe/dqn/policy.pth
```

Here's an example of self-play:

<details>
<summary>Example: Agent Playing Against Itself</summary>

```
     |     |
  -  |  -  |  -
_____|_____|_____
     |     |
  -  |  X  |  -
_____|_____|_____
     |     |
  -  |  -  |  -
     |     |
```

```
     |     |
  -  |  O  |  -
_____|_____|_____
     |     |
  -  |  X  |  -
_____|_____|_____
     |     |
  -  |  -  |  -
     |     |
```

```
     |     |
  X  |  O  |  -
_____|_____|_____
     |     |
  -  |  X  |  -
_____|_____|_____
     |     |
  -  |  -  |  -
     |     |
```

```
     |     |
  X  |  O  |  -
_____|_____|_____
     |     |
  -  |  X  |  -
_____|_____|_____
     |     |
  -  |  -  |  O
     |     |
```

```
     |     |
  X  |  O  |  O
_____|_____|_____
     |     |
  -  |  X  |  -
_____|_____|_____
     |     |
  -  |  -  |  O
     |     |
```

```
     |     |
  X  |  O  |  O
_____|_____|_____
     |     |
  -  |  X  |  -
_____|_____|_____
     |     |
  X  |  -  |  O
     |     |
```

```
     |     |
  X  |  O  |  O
_____|_____|_____
     |     |
  -  |  X  |  O
_____|_____|_____
     |     |
  X  |  -  |  O
     |     |
```

```
     |     |
  X  |  O  |  O
_____|_____|_____
     |     |
  -  |  X  |  O
_____|_____|_____
     |     |
  X  |  X  |  O
     |     |
```

```
     |     |
  X  |  O  |  O
_____|_____|_____
     |     |
  O  |  X  |  O
_____|_____|_____
     |     |
  X  |  X  |  O
     |     |
```

Final reward: 1.0, length: 8.0

</details>

While the trained agent plays well against a random opponent, it's still far from perfect play. The next step would be to implement self-play training, similar to AlphaZero, where the agent continuously improves by playing against increasingly stronger versions of itself.

## Summary

In this tutorial, we demonstrated how to use Tianshou for training a single agent in a multi-agent reinforcement learning setting. Key takeaways:

1. **MARL Paradigms**: Tianshou supports simultaneous, cyclic, and conditional move scenarios
2. **Abstraction**: Multi-agent problems can be converted to single-agent RL through clever state augmentation
3. **PettingZoo Integration**: Seamless compatibility with PettingZoo environments via `PettingZooEnv`
4. **Policy Management**: `MultiAgentPolicyManager` handles agent coordination and data distribution
5. **Flexible Framework**: Easy to extend from single-agent training to more complex multi-agent scenarios

Tianshou provides a flexible and intuitive framework for reinforcement learning. Experiment with different architectures, training regimes, and opponent strategies to build even more capable agents!