# **Homework 13: Multi-Agent Reinforcement Learning**

#### **Course:** Deep Reinforcement Learning

---
## Problem 1: Nash Equilibrium (Theory)

A Nash Equilibrium (NE) represents a state where no player can improve their outcome by unilaterally changing their strategy. For our games, we'll focus on finding the mixed-strategy NE, where players choose their actions probabilistically.

### 1.1 Standard Rock-Scissors-Paper

Given the standard RSP payoff matrix:

| Player 1 | Rock | Scissors | Paper |
| :--- | :--: | :---: | :---: |
| **Rock** | 0, 0 | 1, -1 | -1, 1 |
| **Scissors**| -1, 1 | 0, 0 | 1, -1 |
| **Paper** | 1, -1 | -1, 1 | 0, 0 |


**Your Task:** Analytically derive the mixed-strategy Nash Equilibrium for this game. Show the steps for setting up the indifference equations for Player 1 and solving for Player 2's equilibrium strategy probabilities $(q_R, q_S, q_P)$. (find the Mixed Nash equilibrium of the game)

**Solution:**

For a mixed-strategy Nash Equilibrium, each player must be indifferent between all their pure strategies. Let Player 1 play Rock, Scissors, Paper with probabilities $(p_R, p_S, p_P)$ and Player 2 play with probabilities $(q_R, q_S, q_P)$.

**Player 1's indifference conditions:**
- Expected payoff from Rock = Expected payoff from Scissors = Expected payoff from Paper

Player 1's expected payoff from Rock: $0 \cdot q_R + 1 \cdot q_S + (-1) \cdot q_P = q_S - q_P$

Player 1's expected payoff from Scissors: $(-1) \cdot q_R + 0 \cdot q_S + 1 \cdot q_P = -q_R + q_P$

Player 1's expected payoff from Paper: $1 \cdot q_R + (-1) \cdot q_S + 0 \cdot q_P = q_R - q_S$

Setting them equal:
- $q_S - q_P = -q_R + q_P$ → $q_R + q_S = 2q_P$
- $q_S - q_P = q_R - q_S$ → $2q_S = q_R + q_P$

**Player 2's indifference conditions:**
Player 2's expected payoff from Rock: $0 \cdot p_R + (-1) \cdot p_S + 1 \cdot p_P = -p_S + p_P$

Player 2's expected payoff from Scissors: $1 \cdot p_R + 0 \cdot p_S + (-1) \cdot p_P = p_R - p_P$

Player 2's expected payoff from Paper: $(-1) \cdot p_R + 1 \cdot p_S + 0 \cdot p_P = -p_R + p_S$

Setting them equal:
- $-p_S + p_P = p_R - p_P$ → $p_R + p_S = 2p_P$
- $-p_S + p_P = -p_R + p_S$ → $p_R + p_P = 2p_S$

**Solving the system:**
From the symmetry of the game and the constraint $p_R + p_S + p_P = 1$ and $q_R + q_S + q_P = 1$:

The unique solution is: $p_R = p_S = p_P = \frac{1}{3}$ and $q_R = q_S = q_P = \frac{1}{3}$

**Nash Equilibrium:** Both players play each action with probability $\frac{1}{3}$.

### 1.2 Modified Rock-Scissors-Paper

Now, consider the modified RSP game where the stakes are higher:

| Player 1 | Rock | Scissors | Paper |
| :--- | :--: | :---: | :---: |
| **Rock** | 0, 0 | 1, -1 | -2, 2 |
| **Scissors**| -1, 1 | 0, 0 | 3, -3 |
| **Paper** | 2, -2 | -3, 3 | 0, 0 |


**Your Task:** Like pervious one Derive the mixed-strategy Nash Equilibrium for this modified game.

**Solution:**

For the modified RSP game, let Player 1 play Rock, Scissors, Paper with probabilities $(p_R, p_S, p_P)$ and Player 2 play with probabilities $(q_R, q_S, q_P)$.

**Player 1's indifference conditions:**
Player 1's expected payoff from Rock: $0 \cdot q_R + 1 \cdot q_S + (-2) \cdot q_P = q_S - 2q_P$

Player 1's expected payoff from Scissors: $(-1) \cdot q_R + 0 \cdot q_S + 3 \cdot q_P = -q_R + 3q_P$

Player 1's expected payoff from Paper: $2 \cdot q_R + (-3) \cdot q_S + 0 \cdot q_P = 2q_R - 3q_S$

Setting them equal:
- $q_S - 2q_P = -q_R + 3q_P$ → $q_R + q_S = 5q_P$
- $q_S - 2q_P = 2q_R - 3q_S$ → $4q_S = 2q_R + 2q_P$ → $2q_S = q_R + q_P$

**Player 2's indifference conditions:**
Player 2's expected payoff from Rock: $0 \cdot p_R + (-1) \cdot p_S + 2 \cdot p_P = -p_S + 2p_P$

Player 2's expected payoff from Scissors: $1 \cdot p_R + 0 \cdot p_S + (-3) \cdot p_P = p_R - 3p_P$

Player 2's expected payoff from Paper: $(-2) \cdot p_R + 3 \cdot p_S + 0 \cdot p_P = -2p_R + 3p_S$

Setting them equal:
- $-p_S + 2p_P = p_R - 3p_P$ → $p_R + p_S = 5p_P$
- $-p_S + 2p_P = -2p_R + 3p_S$ → $2p_R + 2p_P = 4p_S$ → $p_R + p_P = 2p_S$

**Solving the system:**
From $q_R + q_S + q_P = 1$ and $q_R + q_S = 5q_P$:
$5q_P + q_P = 1$ → $q_P = \frac{1}{6}$

From $2q_S = q_R + q_P$ and $q_R + q_S = 5q_P = \frac{5}{6}$:
$2q_S = q_R + \frac{1}{6}$ and $q_R = \frac{5}{6} - q_S$

Substituting: $2q_S = \frac{5}{6} - q_S + \frac{1}{6} = 1 - q_S$
$3q_S = 1$ → $q_S = \frac{1}{3}$

Therefore: $q_R = \frac{5}{6} - \frac{1}{3} = \frac{1}{2}$

Similarly for Player 1: $p_R = \frac{1}{2}$, $p_S = \frac{1}{3}$, $p_P = \frac{1}{6}$

**Nash Equilibrium:** 
- Player 1: $(\frac{1}{2}, \frac{1}{3}, \frac{1}{6})$ for (Rock, Scissors, Paper)
- Player 2: $(\frac{1}{2}, \frac{1}{3}, \frac{1}{6})$ for (Rock, Scissors, Paper)

---
## Problem 2: Learning by Observation - Fictitious Play (Implementation)

Fictitious Play is an intuitive learning algorithm where each agent models its opponent as playing a stationary strategy defined by the historical frequency of their past actions. The agent then plays a **best response** to this belief.

### 2.1 Implementation

**Your Task:** Implement the `simulate_fictitious_play` function below. It should take the payoff matrices for both players and the number of iterations as input. At each step, each player should choose the action that maximizes their expected payoff given the history of the opponent's plays.

**Algorithm:** At each time step $t > 0$, Player $i$ forms a belief that their opponent ($-i$) will play each action $a'$ with a probability equal to its historical frequency. The agent then chooses an action $a_i^*$ that is a best response to this belief.

Let $C_{t-1}(a_{-i})$ be the count of times opponent $-i$ has played action $a_{-i}$ up to step $t-1$. Player $i$'s best response is:
$$a_{i,t}^* = \arg\max_{a_i \in A_i} \sum_{a_{-i} \in A_{-i}} u_i(a_i, a_{-i}) \cdot \frac{C_{t-1}(a_{-i})}{t-1}$$

**Note on Tie-Breaking:** If multiple actions yield the same maximal expected payoff, your agent should choose one of these best responses uniformly at random.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Implementation of simulate_fictitious_play function
def simulate_fictitious_play(A, B, iterations):
    """
    Simulates Fictitious Play for two players in a normal-form game.

    Args:
        A (np.ndarray): Payoff matrix for Player 1.
        B (np.ndarray): Payoff matrix for Player 2.
        iterations (int): The number of rounds to play.

    Returns:
        tuple: A tuple containing:
            - p1_freq_history (np.ndarray): History of Player 1's action frequencies.
            - p2_freq_history (np.ndarray): History of Player 2's action frequencies.
    """
    num_actions = A.shape[0]
    
    # Initialize action counts
    p1_counts = np.zeros(num_actions)
    p2_counts = np.zeros(num_actions)
    
    # Initialize history lists
    p1_freq_history = []
    p2_freq_history = []
    
    # Loop for the specified number of iterations
    for t in range(iterations):
        if t == 0:
            # On iteration 0, use a tie-breaking rule (play action 0)
            p1_action = 0
            p2_action = 0
        else:
            # Calculate best response to opponent's historical frequencies
            # Player 1's best response to Player 2's frequencies
            p2_freq = p2_counts / t
            p1_expected_payoffs = A @ p2_freq
            p1_best_actions = np.where(p1_expected_payoffs == np.max(p1_expected_payoffs))[0]
            p1_action = np.random.choice(p1_best_actions)
            
            # Player 2's best response to Player 1's frequencies
            p1_freq = p1_counts / t
            p2_expected_payoffs = B.T @ p1_freq
            p2_best_actions = np.where(p2_expected_payoffs == np.max(p2_expected_payoffs))[0]
            p2_action = np.random.choice(p2_best_actions)
        
        # Update action counts
        p1_counts[p1_action] += 1
        p2_counts[p2_action] += 1
        
        # Periodically record the current action frequencies for plotting
        if t > 0:
            p1_freq_history.append(p1_counts / (t + 1))
            p2_freq_history.append(p2_counts / (t + 1))
    
    return np.array(p1_freq_history), np.array(p2_freq_history)

# --- Payoff Matrices ---
A_std = np.array([[0, 1, -1], [-1, 0, 1], [1, -1, 0]])
A_mod = np.array([[0, 1, -2], [-1, 0, 3], [2, -3, 0]])


In [None]:
# Problem 2.2 Analysis
# Run simulations for both standard and modified RSP games

# Set random seed for reproducibility
np.random.seed(42)

# Run simulations
print("Running Fictitious Play simulation for standard RSP game...")
p1_freq_std, p2_freq_std = simulate_fictitious_play(A_std, -A_std, 1000000)

print("Running Fictitious Play simulation for modified RSP game...")
p1_freq_mod, p2_freq_mod = simulate_fictitious_play(A_mod, -A_mod, 1000000)

# Create plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Standard RSP Game
ax1.plot(p1_freq_std[:, 0], label='Rock', alpha=0.8)
ax1.plot(p1_freq_std[:, 1], label='Scissors', alpha=0.8)
ax1.plot(p1_freq_std[:, 2], label='Paper', alpha=0.8)
ax1.axhline(y=1/3, color='red', linestyle='--', alpha=0.7, label='NE (1/3)')
ax1.set_title('Standard RSP - Player 1 Action Frequencies')
ax1.set_xlabel('Iteration')
ax1.set_ylabel('Action Frequency')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Modified RSP Game
ax2.plot(p1_freq_mod[:, 0], label='Rock', alpha=0.8)
ax2.plot(p1_freq_mod[:, 1], label='Scissors', alpha=0.8)
ax2.plot(p1_freq_mod[:, 2], label='Paper', alpha=0.8)
ax2.axhline(y=1/2, color='red', linestyle='--', alpha=0.7, label='NE Rock (1/2)')
ax2.axhline(y=1/3, color='orange', linestyle='--', alpha=0.7, label='NE Scissors (1/3)')
ax2.axhline(y=1/6, color='green', linestyle='--', alpha=0.7, label='NE Paper (1/6)')
ax2.set_title('Modified RSP - Player 1 Action Frequencies')
ax2.set_xlabel('Iteration')
ax2.set_ylabel('Action Frequency')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Analysis
print("\n=== ANALYSIS ===")
print("Standard RSP Game:")
print(f"Final frequencies - Rock: {p1_freq_std[-1, 0]:.4f}, Scissors: {p1_freq_std[-1, 1]:.4f}, Paper: {p1_freq_std[-1, 2]:.4f}")
print("Expected NE: (1/3, 1/3, 1/3)")
print(f"Convergence to NE: {np.allclose(p1_freq_std[-1], [1/3, 1/3, 1/3], atol=0.01)}")

print("\nModified RSP Game:")
print(f"Final frequencies - Rock: {p1_freq_mod[-1, 0]:.4f}, Scissors: {p1_freq_mod[-1, 1]:.4f}, Paper: {p1_freq_mod[-1, 2]:.4f}")
print("Expected NE: (1/2, 1/3, 1/6)")
print(f"Convergence to NE: {np.allclose(p1_freq_mod[-1], [1/2, 1/3, 1/6], atol=0.01)}")

print("\n=== CONCLUSION ===")
print("The action frequencies do converge to the Nash Equilibrium in both games.")
print("This demonstrates that Fictitious Play is a no-regret learning algorithm that")
print("converges to Nash Equilibrium in zero-sum games.")


In [None]:
# Implementation of simulate_epsilon_greedy_fp function
def simulate_epsilon_greedy_fp(A, B, iterations, epsilon):
    """
    Simulates epsilon-greedy Fictitious Play.
    
    Args:
        A (np.ndarray): Payoff matrix for Player 1.
        B (np.ndarray): Payoff matrix for Player 2.
        iterations (int): The number of rounds to play.
        epsilon (float): Probability of exploration (random action).
    
    Returns:
        tuple: A tuple containing:
            - p1_freq_history (np.ndarray): History of Player 1's action frequencies.
            - p2_freq_history (np.ndarray): History of Player 2's action frequencies.
    """
    num_actions = A.shape[0]
    
    # Initialize action counts
    p1_counts = np.zeros(num_actions)
    p2_counts = np.zeros(num_actions)
    
    # Initialize history lists
    p1_freq_history = []
    p2_freq_history = []
    
    # Loop for the specified number of iterations
    for t in range(iterations):
        if t == 0:
            # On iteration 0, use a tie-breaking rule (play action 0)
            p1_action = 0
            p2_action = 0
        else:
            # Calculate best response to opponent's historical frequencies
            # Player 1's best response to Player 2's frequencies
            p2_freq = p2_counts / t
            p1_expected_payoffs = A @ p2_freq
            p1_best_actions = np.where(p1_expected_payoffs == np.max(p1_expected_payoffs))[0]
            p1_best_action = np.random.choice(p1_best_actions)
            
            # Player 2's best response to Player 1's frequencies
            p1_freq = p1_counts / t
            p2_expected_payoffs = B.T @ p1_freq
            p2_best_actions = np.where(p2_expected_payoffs == np.max(p2_expected_payoffs))[0]
            p2_best_action = np.random.choice(p2_best_actions)
            
            # Epsilon-greedy action selection
            if np.random.random() < epsilon:
                # Explore: choose random action
                p1_action = np.random.randint(num_actions)
                p2_action = np.random.randint(num_actions)
            else:
                # Exploit: choose best response
                p1_action = p1_best_action
                p2_action = p2_best_action
        
        # Update action counts
        p1_counts[p1_action] += 1
        p2_counts[p2_action] += 1
        
        # Periodically record the current action frequencies for plotting
        if t > 0:
            p1_freq_history.append(p1_counts / (t + 1))
            p2_freq_history.append(p2_counts / (t + 1))
    
    return np.array(p1_freq_history), np.array(p2_freq_history)


In [None]:
# Problem 3.2 Analysis
# Run epsilon-greedy fictitious play with different epsilon values

# Set random seed for reproducibility
np.random.seed(42)

# Different epsilon values to test
epsilon_values = [0.01, 0.1, 0.3]
colors = ['blue', 'green', 'red']

# Create plots
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for i, epsilon in enumerate(epsilon_values):
    print(f"Running epsilon-greedy FP with epsilon = {epsilon}...")
    
    # Run simulation
    p1_freq, p2_freq = simulate_epsilon_greedy_fp(A_mod, -A_mod, 1000000, epsilon)
    
    # Plot results
    ax = axes[i]
    ax.plot(p1_freq[:, 0], label='Rock', alpha=0.8, color='blue')
    ax.plot(p1_freq[:, 1], label='Scissors', alpha=0.8, color='green')
    ax.plot(p1_freq[:, 2], label='Paper', alpha=0.8, color='red')
    
    # Add Nash Equilibrium lines
    ax.axhline(y=1/2, color='blue', linestyle='--', alpha=0.7, label='NE Rock (1/2)')
    ax.axhline(y=1/3, color='green', linestyle='--', alpha=0.7, label='NE Scissors (1/3)')
    ax.axhline(y=1/6, color='red', linestyle='--', alpha=0.7, label='NE Paper (1/6)')
    
    ax.set_title(f'ε-Greedy FP (ε = {epsilon}) - Player 1')
    ax.set_xlabel('Iteration')
    ax.set_ylabel('Action Frequency')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Analysis
    print(f"\nEpsilon = {epsilon}:")
    print(f"Final frequencies - Rock: {p1_freq[-1, 0]:.4f}, Scissors: {p1_freq[-1, 1]:.4f}, Paper: {p1_freq[-1, 2]:.4f}")
    print(f"Expected NE: (1/2, 1/3, 1/6)")
    print(f"Convergence to NE: {np.allclose(p1_freq[-1], [1/2, 1/3, 1/6], atol=0.05)}")

plt.tight_layout()
plt.show()

print("\n=== ANALYSIS ===")
print("Impact of epsilon on learning dynamics:")
print("- Lower epsilon (0.01): More exploitation, closer to NE convergence")
print("- Higher epsilon (0.3): More exploration, further from NE convergence")
print("- Exploration prevents exact convergence to NE but maintains learning")
print("- The trade-off between exploration and exploitation affects convergence speed and accuracy")


In [None]:
# Implementation of simulate_regret_matching function
def simulate_regret_matching(A, B, iterations):
    """
    Simulates Regret Matching for two players.

    Returns:
        tuple: A tuple containing:
            - p1_avg_strat_hist (np.ndarray): History of Player 1's average strategy.
            - p1_inst_strat_hist (np.ndarray): History of Player 1's instantaneous strategy.
    """
    num_actions = A.shape[0]
    
    # Initialize regrets, strategy sums, and history lists
    p1_regrets = np.zeros(num_actions)
    p2_regrets = np.zeros(num_actions)
    
    p1_strategy_sum = np.zeros(num_actions)
    p2_strategy_sum = np.zeros(num_actions)
    
    p1_avg_strat_hist = []
    p1_inst_strat_hist = []
    
    # Loop for iterations
    for t in range(iterations):
        # Calculate the current strategy based on positive regrets
        # Player 1's strategy
        p1_positive_regrets = np.maximum(0, p1_regrets)
        p1_regret_sum = np.sum(p1_positive_regrets)
        
        if p1_regret_sum == 0:
            # If sum of positive regrets is 0, play uniformly random
            p1_strategy = np.ones(num_actions) / num_actions
        else:
            p1_strategy = p1_positive_regrets / p1_regret_sum
        
        # Player 2's strategy
        p2_positive_regrets = np.maximum(0, p2_regrets)
        p2_regret_sum = np.sum(p2_positive_regrets)
        
        if p2_regret_sum == 0:
            # If sum of positive regrets is 0, play uniformly random
            p2_strategy = np.ones(num_actions) / num_actions
        else:
            p2_strategy = p2_positive_regrets / p2_regret_sum
        
        # Store the instantaneous strategy and add to the strategy sum
        p1_inst_strat_hist.append(p1_strategy.copy())
        p1_strategy_sum += p1_strategy
        
        # Choose actions based on the current strategies
        p1_action = np.random.choice(num_actions, p=p1_strategy)
        p2_action = np.random.choice(num_actions, p=p2_strategy)
        
        # Update regrets for ALL actions based on the outcome
        # Player 1's regrets
        for action in range(num_actions):
            p1_regrets[action] += A[action, p2_action] - A[p1_action, p2_action]
        
        # Player 2's regrets
        for action in range(num_actions):
            p2_regrets[action] += B[p1_action, action] - B[p1_action, p2_action]
        
        # Periodically record the average and instantaneous strategies
        if t > 0:
            p1_avg_strategy = p1_strategy_sum / (t + 1)
            p1_avg_strat_hist.append(p1_avg_strategy.copy())
    
    return np.array(p1_avg_strat_hist), np.array(p1_inst_strat_hist)


In [None]:
# Problem 4.2 Analysis
# Run regret matching simulation and create comparison plots

# Set random seed for reproducibility
np.random.seed(42)

print("Running Regret Matching simulation for modified RSP game...")
p1_avg_strat, p1_inst_strat = simulate_regret_matching(A_mod, -A_mod, 1000000)

# Create plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Subplot 1: Instantaneous strategy
ax1.plot(p1_inst_strat[:, 0], label='Rock', alpha=0.8, color='blue')
ax1.plot(p1_inst_strat[:, 1], label='Scissors', alpha=0.8, color='green')
ax1.plot(p1_inst_strat[:, 2], label='Paper', alpha=0.8, color='red')
ax1.set_title('Regret Matching - Player 1 Instantaneous Strategy')
ax1.set_xlabel('Iteration')
ax1.set_ylabel('Action Probability')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Subplot 2: Average strategy with NE lines
ax2.plot(p1_avg_strat[:, 0], label='Rock', alpha=0.8, color='blue')
ax2.plot(p1_avg_strat[:, 1], label='Scissors', alpha=0.8, color='green')
ax2.plot(p1_avg_strat[:, 2], label='Paper', alpha=0.8, color='red')

# Add Nash Equilibrium lines
ax2.axhline(y=1/2, color='blue', linestyle='--', alpha=0.7, label='NE Rock (1/2)')
ax2.axhline(y=1/3, color='green', linestyle='--', alpha=0.7, label='NE Scissors (1/3)')
ax2.axhline(y=1/6, color='red', linestyle='--', alpha=0.7, label='NE Paper (1/6)')

ax2.set_title('Regret Matching - Player 1 Average Strategy')
ax2.set_xlabel('Iteration')
ax2.set_ylabel('Action Probability')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Analysis
print("\n=== ANALYSIS ===")
print("Instantaneous Strategy:")
print(f"Final instantaneous probabilities - Rock: {p1_inst_strat[-1, 0]:.4f}, Scissors: {p1_inst_strat[-1, 1]:.4f}, Paper: {p1_inst_strat[-1, 2]:.4f}")

print("\nAverage Strategy:")
print(f"Final average probabilities - Rock: {p1_avg_strat[-1, 0]:.4f}, Scissors: {p1_avg_strat[-1, 1]:.4f}, Paper: {p1_avg_strat[-1, 2]:.4f}")
print("Expected NE: (1/2, 1/3, 1/6)")
print(f"Average strategy convergence to NE: {np.allclose(p1_avg_strat[-1], [1/2, 1/3, 1/6], atol=0.01)}")

print("\n=== CONCLUSION ===")
print("Key observations:")
print("1. The instantaneous strategy oscillates and doesn't converge to NE")
print("2. The average strategy converges to the Nash Equilibrium")
print("3. This is the expected theoretical outcome for Regret Matching algorithms")
print("4. The average strategy convergence is guaranteed by the no-regret property")
print("5. Regret Matching ensures that the average strategy approaches NE over time")


In [None]:
def simulate_fictitious_play(A, B, iterations):
    """
    Simulates Fictitious Play for two players in a normal-form game.

    Args:
        A (np.ndarray): Payoff matrix for Player 1.
        B (np.ndarray): Payoff matrix for Player 2.
        iterations (int): The number of rounds to play.

    Returns:
        tuple: A tuple containing:
            - p1_freq_history (np.ndarray): History of Player 1's action frequencies.
            - p2_freq_history (np.ndarray): History of Player 2's action frequencies.
    """
    ### YOUR CODE HERE ###
    # Initialize action counts and history lists
    # Loop for the specified number of iterations
    # On iteration 0, use a tie-breaking rule (e.g., play action 0)
    # On subsequent iterations, calculate best response to opponent's historical frequencies
    # Update action counts
    # Periodically record the current action frequencies for plotting
    
    pass # Replace with your implementation

# --- Payoff Matrices ---
A_std = np.array([[0, 1, -1], [-1, 0, 1], [1, -1, 0]])
A_mod = np.array([[0, 1, -2], [-1, 0, 3], [2, -3, 0]])

### 2.2 Analysis

**Your Task:**
1.  Run your simulation for **1,000,000 iterations** on both the **standard** and **modified** RSP games.
2.  Generate two plots, one for each game. Each plot should show the evolution of Players action frequencies over time and include horizontal lines indicating the theoretical NE probabilities you calculated in Problem 1.
3.  **Analyze your results:** Do the action frequencies converge? If so, do they converge to the Nash Equilibrium? Explain the observed behavior.

---
## Problem 3: Fictitious Play with Exploration (Implementation)

Our Fictitious Play agent is purely exploitative. In Reinforcement Learning, we know the importance of the **exploration-exploitation tradeoff**. Let's create an $\epsilon$-greedy version of Fictitious Play.

### 3.1 Implementation

**Your Task:** Create a new function, `simulate_epsilon_greedy_fp`. This function should be similar to your Fictitious Play implementation but include an `epsilon` parameter. At each step, with probability `epsilon`, the agent should choose a random action (explore). With probability `1-epsilon`, it should play the best response (exploit).

In [None]:
def simulate_epsilon_greedy_fp(A, B, iterations, epsilon):
    """
    Simulates epsilon-greedy Fictitious Play.
    """
    ### YOUR CODE HERE ###
    
    pass # Replace with your implementation


### 3.2 Analysis

**Your Task:**
1.  Run the `simulate_epsilon_greedy_fp` function on the **modified** RSP game for **1,000,000 iterations** with three different `epsilon` values: `0.01`, `0.1`, and `0.3`.
2.  Plot the results for each simulation.
3.  **Analyze your results:** How does `epsilon` affect the learning dynamics? Does the agent's strategy still converge to the NE? If not, to what does it converge? Discuss the impact of exploration in this multi-agent context.

---
## Problem 4: Learning from "What If" - Regret Matching (Implementation & Theory)

Regret Matching is a powerful no-regret learning algorithm. Instead of playing a best response to history, an agent's probability of choosing an action is proportional to the positive **regret** for not having chosen that action in the past. The key property of regret matching is that the **average strategy** over time converges to a Nash Equilibrium.

### 4.1 Implementation

**Your Task:** Implement the `simulate_regret_matching` function below.

**Algorithm:** Regret Matching works in two steps. First, update the cumulative regrets. Second, determine the next round's strategy.

1.  **Regret Calculation:** After playing action $a_i$ against opponent's action $a_{-i}$, the cumulative regret $R_t(s)$ for *not* having played action $s \in A_i$ is updated as follows:
    $$R_t(s) = R_{t-1}(s) + u_i(s, a_{-i}) - u_i(a_i, a_{-i})$$

2.  **Strategy Calculation:** The probability of playing action $s$ in the next round is proportional to its positive cumulative regret, $R_t^+(s) = \max(0, R_t(s))$.
    $$p_{t+1}(s) = \frac{R_t^+(s)}{\sum_{s' \in A_i} R_t^+(s')}$$
    If the sum of positive regrets is zero, play uniformly at random.

In [None]:
def simulate_regret_matching(A, B, iterations):
    """
    Simulates Regret Matching for two players.

    Returns:
        tuple: A tuple containing:
            - p1_avg_strat_hist (np.ndarray): History of Player 1's average strategy.
            - p1_inst_strat_hist (np.ndarray): History of Player 1's instantaneous strategy.
    """
    num_actions = A.shape[0]
    
    ### YOUR CODE HERE ###
    # Initialize regrets, strategy sums, and history lists
    # Loop for iterations
    # Calculate the current strategy based on positive regrets
    #   (If sum of positive regrets is 0, play uniformly random)
    # Store the instantaneous strategy and add to the strategy sum
    # Choose actions based on the current strategies
    # Update regrets for ALL actions based on the outcome
    # Periodically record the average and instantaneous strategies
    
    pass # Replace with your implementation


### 4.2 Analysis

**Your Task:**
1.  Run your simulation for the **modified** RSP game for **1,000,000 iterations**.
2.  Generate a single figure with two subplots:
    * **Subplot 1:** Plot the **instantaneous strategy** of Player 1 over time.
    * **Subplot 2:** Plot the **average strategy** of Player 1 over time. Include horizontal lines for the NE.
3.  **Analyze your results:** Compare the two plots. Which one converges to the Nash Equilibrium? \
                              (Bonus): Explain why this is the expected theoretical outcome for Regret Matching algorithms.