<!-- Centered layout with a university logo -->
<div align="center">

  <!-- University Logo -->
  <img src="https://cdn.freebiesupply.com/logos/large/2x/sharif-logo-png-transparent.png" width="180" height="180" style="margin-bottom: 10px;">
  
  <!-- Assignment Title -->
  <h1></h1>
  <h1 style="color:#0F5298; font-size: 40px; font-weight: bold; margin-bottom: 5px;">Deep Reinforcement Learning</h1>
  <h2 style="color:#0F5298; font-size: 32px; font-weight: normal; margin-top: 0px;">Assignment 10 - Multi-Agent Reinforcement Learning</h2>

  <!-- Department and University -->
  <h3 style="color:#696880; font-size: 24px; margin-top: 20px;">Computer Engineering Department</h3>
  <h3 style="color:#696880; font-size: 22px; margin-top: -5px;">Sharif University of Technology</h3>

  <!-- Semester -->
  <h3 style="color:#696880; font-size: 22px; margin-top: 20px;">Spring 2025</h3>

  <!-- Authors -->
  <h3 style="color:green; font-size: 22px; margin-top: 20px;">Full name: [FULL_NAME]</h3>
  <h3 style="color:green; font-size: 22px; margin-top: 20px;">Student ID: [STUDENT_ID]</h3>

  <!-- Horizontal Line for Separation -->
  <hr style="border: 1px solid #0F5298; width: 80%; margin-top: 30px;">

</div>


## Setup & Overview  
In this notebook, we explore Multi-Agent Reinforcement Learning (MARL) through various algorithms and environments.  
We implement and compare several approaches:
- **Independent Q-Learning** (IQL) - Each agent learns independently
- **QMIX** - Value decomposition for cooperative settings
- **MADDPG** - Multi-Agent Actor-Critic for mixed environments
- **Communication Protocols** - CommNet and TarMAC
- **Self-Play** - Training against past versions

We'll work with classic game theory environments like Prisoner's Dilemma and Coordination Games, then move to more complex multi-agent scenarios.

Follow the instructions carefully and complete the sections marked with **TODO**.


## Setup and Environment

In the upcoming cells, we import necessary libraries, set up utility functions for reproducibility and plotting, and define the basic components of our multi-agent experiments.


In [None]:
# %% [code]
import numpy as np
import random
import time
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
from collections import deque, defaultdict
import itertools
from dataclasses import dataclass
from copy import deepcopy
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.dpi'] = 100
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)


In [None]:
# %% [code]
def plot_logs(df, x_key, y_key, legend_key, **kwargs):
    """Plot learning curves for multi-agent experiments"""
    num = len(df[legend_key].unique())
    pal = sns.color_palette("hls", num)
    if 'palette' not in kwargs:
        kwargs['palette'] = pal
    ax = sns.lineplot(x=x_key, y=y_key, data=df, hue=legend_key, **kwargs)
    return ax

def plot_game_matrix(matrix, title="Game Matrix", figsize=(8, 6)):
    """Plot a 2x2 game matrix"""
    fig, ax = plt.subplots(figsize=figsize)
    
    # Create heatmap
    im = ax.imshow(matrix, cmap='RdYlBu', aspect='auto')
    
    # Add text annotations
    for i in range(matrix.shape[0]):
        for j in range(matrix.shape[1]):
            text = ax.text(j, i, f'{matrix[i, j][0]:.1f}, {matrix[i, j][1]:.1f}',
                         ha="center", va="center", color="black", fontsize=12)
    
    # Set labels
    ax.set_xticks([0, 1])
    ax.set_yticks([0, 1])
    ax.set_xticklabels(['Action 0', 'Action 1'])
    ax.set_yticklabels(['Action 0', 'Action 1'])
    ax.set_xlabel('Agent 2 Action')
    ax.set_ylabel('Agent 1 Action')
    ax.set_title(title)
    
    # Add colorbar
    cbar = plt.colorbar(im)
    cbar.set_label('Payoff Value')
    
    plt.tight_layout()
    plt.show()

def set_seed(s):
    """Set random seeds for reproducibility"""
    np.random.seed(s)
    random.seed(s)
    torch.manual_seed(s)

set_seed(42)


## Game Theory Foundations

We start with classic game theory environments to understand multi-agent interactions:

### Prisoner's Dilemma
- **Cooperation (C)**: Both agents cooperate, get moderate reward
- **Defection (D)**: One agent defects while other cooperates, defector gets high reward
- **Mutual Defection**: Both defect, get low reward

### Coordination Game  
- **Pure Coordination**: Both agents must choose same action for high reward
- **Battle of Sexes**: Different preferences but coordination still beneficial


In [None]:
# %% [code]
# Define game matrices
PRISONERS_DILEMMA = np.array([
    [[3, 3], [0, 5]],  # Agent 1: C, Agent 2: C,D
    [[5, 0], [1, 1]]   # Agent 1: D, Agent 2: C,D
])

COORDINATION_GAME = np.array([
    [[2, 2], [0, 0]],   # Agent 1: A, Agent 2: A,B
    [[0, 0], [1, 1]]    # Agent 1: B, Agent 2: A,B
])

BATTLE_OF_SEXES = np.array([
    [[2, 1], [0, 0]],   # Agent 1: A, Agent 2: A,B
    [[0, 0], [1, 2]]    # Agent 1: B, Agent 2: A,B
])

print("Prisoner's Dilemma Matrix:")
print("Agent 1\\Agent 2 | Cooperate | Defect")
print("Cooperate        |   3, 3    |  0, 5")
print("Defect           |   5, 0    |  1, 1")
print("\nNash Equilibrium: (Defect, Defect)")
print("Pareto Optimal: (Cooperate, Cooperate)")

plot_game_matrix(PRISONERS_DILEMMA, "Prisoner's Dilemma")


**Q:** Why is the Nash Equilibrium (Defect, Defect) suboptimal in the Prisoner's Dilemma?

**A:** The Nash Equilibrium (Defect, Defect) is suboptimal because it represents a situation where both agents choose their individually rational strategy, but this leads to a worse outcome for both compared to mutual cooperation. Each agent defects because they fear being exploited if they cooperate while the other defects. However, if both could commit to cooperation, they would both be better off (3,3 vs 1,1). This illustrates the fundamental tension between individual rationality and collective welfare in competitive environments.


In [None]:
# %% [code]
plot_game_matrix(COORDINATION_GAME, "Coordination Game")
print("\nCoordination Game:")
print("Agent 1\\Agent 2 | Action A | Action B")
print("Action A         |   2, 2   |  0, 0")
print("Action B         |   0, 0   |  1, 1")
print("\nNash Equilibria: (A,A) and (B,B)")
print("Pareto Optimal: (A,A)")

plot_game_matrix(BATTLE_OF_SEXES, "Battle of Sexes")
print("\nBattle of Sexes:")
print("Agent 1\\Agent 2 | Action A | Action B")
print("Action A         |   2, 1   |  0, 0")
print("Action B         |   0, 0   |  1, 2")
print("\nNash Equilibria: (A,A) and (B,B)")
print("Agent 1 prefers (A,A), Agent 2 prefers (B,B)")


## Multi-Agent Environment

We'll create a flexible environment that can handle different game matrices and multiple agents.


In [None]:
# %% [code]
@dataclass
class MultiAgentGame:
    """Multi-agent game environment"""
    payoff_matrix: np.ndarray
    num_agents: int = 2
    num_actions: int = 2
    
    def step(self, actions):
        """Execute actions and return rewards"""
        if self.num_agents == 2:
            return self.payoff_matrix[actions[0], actions[1]]
        else:
            # For more than 2 agents, we'll use a different structure
            raise NotImplementedError("Only 2-agent games implemented")
    
    def get_optimal_strategies(self):
        """Find Nash equilibria"""
        equilibria = []
        
        # Check all pure strategy combinations
        for a1 in range(self.num_actions):
            for a2 in range(self.num_actions):
                is_equilibrium = True
                
                # Check if agent 1 wants to deviate
                for a1_dev in range(self.num_actions):
                    if a1_dev != a1:
                        if self.payoff_matrix[a1_dev, a2][0] > self.payoff_matrix[a1, a2][0]:
                            is_equilibrium = False
                            break
                
                # Check if agent 2 wants to deviate
                if is_equilibrium:
                    for a2_dev in range(self.num_actions):
                        if a2_dev != a2:
                            if self.payoff_matrix[a1, a2_dev][1] > self.payoff_matrix[a1, a2][1]:
                                is_equilibrium = False
                                break
                
                if is_equilibrium:
                    equilibria.append((a1, a2))
        
        return equilibria

# Test the environment
pd_env = MultiAgentGame(PRISONERS_DILEMMA)
print("Prisoner's Dilemma Nash Equilibria:", pd_env.get_optimal_strategies())

coord_env = MultiAgentGame(COORDINATION_GAME)
print("Coordination Game Nash Equilibria:", coord_env.get_optimal_strategies())


## Independent Q-Learning (IQL)

The simplest approach to multi-agent RL: each agent learns independently, treating other agents as part of the environment.

**Key Characteristics:**
- Each agent maintains its own Q-table
- No communication between agents
- Non-stationarity: environment changes as other agents learn
- No convergence guarantees
