# L10c: Multiplicative Weights Update Algorithm and Zero-Sum Games
In this lab, we'll use the multiplicative weights update algorithm to find approximate Nash equilibria in zero-sum games, in particular focusing on the classic Rock-Paper-Scissors game.

> __Learning Objectives:__
>
> After completing this activity, students will be able to:
>
> * **Understand the Multiplicative Weights Algorithm:** We explore the theoretical foundations of the Multiplicative Weights Algorithm, including how it adapts strategies in zero-sum games by updating weights based on performance, leading to approximate Nash equilibria.
> * **Implement the Multiplicative Weights Algorithm:** We build and execute the MWA for zero-sum games using Rock-Paper-Scissors as an example. The implementation shows how to simulate repeated play and compute average strategies over time.
> * **Analyze convergence to Nash equilibrium:** We empirically check if the algorithm converges to an approximate Nash equilibrium by computing average strategies and verifying the ε-condition holds as the number of rounds increases.

Let's get started!
___

## Background: Multiplicative Weights Algorithm (MWA)
The **Multiplicative Weights Algorithm (MWA)** is a simple yet robust online learning method that embodies a similar idea to the weighted majority algorithm, i.e., learning from expert advice. Here, the learning rate $\eta$ plays a role analogous to $\varepsilon$ in the Weighted Majority Algorithm, controlling adaptation speed. 

Let’s walk through the setup and sketch out the algorithm.

### Problem Setting
Suppose we are faced with a repeated decision-making task over rounds $t = 1, 2, \ldots, T$. At each round, we have access to $N$ experts, each providing a recommendation or prediction. Our goal is to combine their advice adaptively in order to make strong decisions over time, even in adversarial or uncertain environments.

* Let $\mathbf{p}^{(t)} = \{p_1^{(t)}, p_2^{(t)}, \ldots, p_N^{(t)}\}$ denote our belief distribution over experts at round $t$, updated iteratively based on their past performance.
* We select an expert by sampling from this distribution—for example, using a Categorical distribution:
  $i \sim \texttt{Categorical}(\mathbf{p}^{(t)})$—and follow that expert’s recommendation.
* After the decision is made, the environment (or adversary) reveals the true outcome. We then compute a cost vector $\mathbf{m}^{(t)} = \{m_1^{(t)}, \dots, m_N^{(t)}\}$, where $m_i^{(t)} \in [-1, 1]$ denotes the cost incurred by expert $i$ at time $t$. A correct prediction receives a cost of $-1$, and an incorrect one receives a cost of $+1$.

### Algorithm
__Initialize__: Fix a learning rate $\eta\leq{1}/{2}$, for each expert initialize the weight $w_{i}^{(1)} = 1$.

For $t=1,2,\dots,T$:
1. Chose expert $i$ with probability $p_{i}^{(t)} = w_{i}^{(t)}/\sum_{j=1}^{N}w_{j}^{(t)}$. Ask expert $i$ what the outcome of the experiment should be, denote the experts answer to this as: $\hat{y}_{i}^{(t)}$.
2. The adversary (nature) reveals the true outcome $y_{t}$ of the experiment at time $t$. Compute the cost of the following expert $i$, denoted as $m_{i}^{(t)}$. 
    $$
    m_i^{(t)} =
    \begin{cases}
    -1 & \text{if } \hat{y}_i^{(t)} = y_t \quad \text{(correct)} \\
    +1 & \text{if } \hat{y}_i^{(t)} \neq y_t \quad \text{(incorrect)}
    \end{cases}
   $$
3. Update the weights of expert $i$ as (renormalize the weights to obtain the new probability distribution):
$$
\begin{align*}
w_{i}^{(t+1)} = w_{i}^{(t)}\cdot\left(1-\eta\cdot{m_{i}^{(t)}}\right)
\end{align*}
$$

This is a super simple algorithm, with some very nice properties. The weights are updated multiplicatively based on the performance of each expert, hence the name Multiplicative Weights Algorithm. The learning rate $\eta$ controls how aggressively the algorithm adapts to the experts' performance. And there is a theoretical guarantee that the algorithm will perform nearly as well as the best fixed expert in hindsight!

By choosing $\eta = \sqrt{\frac{\ln N}{T}}$, this regret bound becomes __sublinear__:
$$
R(T) \leq 2 \sqrt{T \ln N}
$$
This ensures that the algorithm's **average regret per round** vanishes as $T \to \infty$, meaning that MWA performs nearly as well as the best fixed expert in hindsight.
___

## MWA Applied to zero-sum games
Let's consider the application of the multiplicative weights update algorithm to zero-sum games. 

> In [a zero-sum game](https://en.wikipedia.org/wiki/Zero-sum_game), players have _opposing interests_, and the players' payoffs sum to zero: one's gain is the other's loss. The multiplicative-weights (MW) algorithm finds (approximate) Nash equilibria by down-weighting poorly performing actions over repeated play.

Let's dig into some the details of the game:
* **Game**: Consider a competitive setting with $k$ players. A game is called **zero-sum** if, for any outcome, the players' payoffs add to zero. The standard theory we use below focuses on the $k = 2$ case. Each player chooses an action $a \in \mathcal{A}$ from some finite action set $\mathcal{A}$ with $|\mathcal{A}| = N$. For the two-player case, we model payoffs with a matrix $\mathbf{M} \in \mathbb{R}^{N \times N}$ (for simplicity, assume both players have $N$ actions). If the row player chooses action $i$ and the column player chooses action $j$, then the row player's payoff is $m_{ij}$ and the column player's payoff is $-m_{ij}$. This is what we mean by __zero-sum__: whatever one player gains, the other loses.

* **Goals**: The row player wants to **maximize** their payoff. The column player wants to **minimize** the row player's payoff. Let the row player randomize over rows using a mixed strategy $\mathbf{p}$ (a probability distribution over the $N$ rows), and let the column player randomize over columns using a mixed strategy $\mathbf{q}$ (a probability distribution over the $N$ columns). The expected payoff to the row player is $\mathbf{p}^{\top}\mathbf{M}\mathbf{q}$ and because the game is zero-sum, the expected payoff to the column player is $-\mathbf{p}^{\top}\mathbf{M}\mathbf{q}$. So both players care about the same scalar $\mathbf{p}^{\top}\mathbf{M}\mathbf{q}$, but they pull it in opposite directions.

* **Nash Equilibrium**: A Nash equilibrium is a pair of (possibly mixed) strategies $(\mathbf{p}^*, \mathbf{q}^*)$ such that each player's strategy is a best response to the other's. In other words, given $\mathbf{q}^*$, the row player cannot switch from $\mathbf{p}^*$ to some other $\mathbf{p}$ and improve their expected payoff, and given $\mathbf{p}^*$, the column player cannot switch from $\mathbf{q}^*$ to some other $\mathbf{q}$ and further reduce the row player's expected payoff.

In a two-player zero-sum game, every Nash equilibrium corresponds to a **minimax solution**. The minimax theorem guarantees that:
$$
\max_{\mathbf{p}} \min_{\mathbf{q}} \mathbf{p}^{\top}\mathbf{M}\mathbf{q} = \min_{\mathbf{q}} \max_{\mathbf{p}} \mathbf{p}^{\top}\mathbf{M}\mathbf{q} = v
$$
where $v$ is called the value of the game. At equilibrium, the row player's strategy $\mathbf{p}^*$ guarantees at least $v$ no matter what the column player does, and the column player's strategy $\mathbf{q}^*$ holds the row player to at most $v$ no matter what the row player does. That shared value $v$ is the Nash equilibrium payoff.
  
Finally, learning dynamics: if both players repeatedly play the game and update their mixed strategies using sublinear algorithms such as multiplicative weights, then the time-averaged strategies approach an $\epsilon$-Nash equilibrium (equivalently, an $\epsilon$-minimax solution), where $\epsilon$ becomes small as regret becomes small.


### Algorithm
Let's outline a simple implementation of the multiplicative weights update algorithm for a two-player zero-sum game. Given a payoff matrix $\mathbf{M}\in\mathbb{R}^{N\times{N}}$, we want to find a _mixed strategy_, a probability distribution over actions, for the row player that minimizes expected loss.

__Initialization:__ Given a payoff matrix $\mathbf{M}\in\mathbb{R}^{N\times{N}}$, where the payoffs (elements of $\mathbf{M}$) are in the range $m_{ij}\in[-1, 1]$. 
Initialize the weights $w_{i}^{(1)} \gets 1$ for all actions $i\in\mathcal{A}$, where $\mathcal{A} = \{1,2,\dots,N\}$, and set the learning rate $\eta\in(0,1)$.

> __Choosing T__: The number of rounds $T$ determines the accuracy of the approximate Nash equilibrium. To achieve an $\epsilon$-Nash equilibrium, choose $T \geq \frac{\ln N}{\epsilon^2}$. For example, with $N=10$ actions and desired accuracy $\epsilon=0.1$, we need $T \geq \frac{\ln 10}{0.01} \approx 230$ rounds.

> __Choosing η__: The learning rate $\eta$ controls the step size of weight updates. Common rules of thumb include:
> - __Theory-based__: $\eta = \sqrt{\frac{\ln N}{T}}$ optimizes the convergence bound
> - __Simple rule__: $\eta = \frac{1}{\sqrt{T}}$ for practical applications  
> - __Adaptive__: Start with $\eta = 0.1$ and reduce by half if convergence stalls
> - __Constraint__: Ensure $\eta \leq 1$ to prevent negative weights (since losses are bounded in $[-1,1]$)

For each round $t=1,2,\dots,T$ __do__:
1. Compute the normalization factor: $\Phi^{(t)} \gets \sum_{i=1}^{N}w_{i}^{(t)}$.
1. __Row player__ computes its strategy: The _row player_ will choose an action with probability $\mathbf{p}^{(t)} \gets \left\{w_{i}^{(t)}/\Phi^{(t)} \mid i = 1,2,\dots,N\right\}$. Let the row player action be $i^{\star}$.
2. __Column player__ computes its strategy: The _column player_ will choose action: $j\gets \text{arg}\min_{j\in\mathcal{A}}\left\{\mathbf{p}^{(t)\top}\mathbf{M}\mathbf{e}_{j}\right\}$, so that $\mathbf{q}^{(t)} \gets \mathbf{e}_{j}$, where $\mathbf{e}_{j}$ is the $j$-th standard basis vector. The row player experiences loss vector $\boldsymbol{\ell}^{(t)} \gets \mathbf{L}\mathbf{q}^{(t)}$, where $\mathbf{L} = -\mathbf{M}$ is the loss matrix.
3. Update the weights: $w_i^{(t+1)} \gets w_i^{(t)}\;\exp\bigl(-\eta\,\ell_i^{(t)}\bigr)$ for all actions $i\in\mathcal{A}$ for the row player.

### Convergence
After $T$ rounds, define the average strategies:  
$$
\bar p \;=\;\frac{1}{T}\sum_{t=1}^{T}p^{(t)}, 
\quad
\bar q \;=\;\frac{1}{T}\sum_{t=1}^{T}q^{(t)}.
$$
Then $(\bar p,\bar q)$ is an $\epsilon$-Nash equilibrium with
$$
  \max_{q}\,\bar p^\top M\,q
  \;-\;\min_{p}\,p^\top M\,\bar q
  \;\le\;\epsilon,
  \quad
  \epsilon = O\Bigl(\sqrt{\tfrac{\ln N}{T}}\Bigr).
$$

___

## Example: Rock-Paper-Scissors
Let's consider an example of a two-player zero-sum game: [Rock-Paper-Scissors](https://en.wikipedia.org/wiki/Rock_paper_scissors). In this game, each player _simultaneously_ chooses one of three possible actions: Rock, Paper, or Scissors. This game has three possible outcomes: win, loose or draw.
> __Rules:__ A player who decides to play rock will beat another player who chooses scissors (`rock crushes scissors`), but will lose to one who has played paper (`paper covers rock`); a play of paper will lose to a play of scissors (`scissors cuts paper`). If both players choose the same shape, the game is a draw.

The payoff matrix for this game is the `3` $\times$ `3` matrix:
$$
\begin{align*}
\mathbf{M} = \begin{pmatrix}
0 & -1 & 1\\
1 & 0 & -1\\
-1 & 1 & 0
\end{pmatrix}
\end{align*}
$$
where the rows correspond to the actions of the _row player_ and the columns, correspond to the actions of the _column player_. The payoff for the _row player_ is $m_{ij}$, and the payoff for the _column player_ is $-m_{ij}$.

In [1]:
include("Include-student.jl"); # load my codes, packages, etc

LoadError: LoadError: SystemError: opening file "/Users/jdv27/Desktop/julia_work/CHEME-5800-Instances/Fall-2025/CHEME-5800-Labs-Fall-2025/labs/week-10/L10d/src/Files.jl": No such file or directory
in expression starting at /Users/jdv27/Desktop/julia_work/CHEME-5800-Instances/Fall-2025/CHEME-5800-Labs-Fall-2025/labs/week-10/L10d/Include-student.jl:28

__Build a model__. Let's construct an instance of [the `MyTwoPersonZeroSumGameModel` type](src/Types.jl) using [a custom `build(...)` method](src/Factory.jl). The model holds information associated with the game. We store the game model in the `model::MyTwoPersonZeroSumGameModel` variable:

In [2]:
model = let

    # setup 
    M = [0 -1 1; 1 0 -1 ; -1 1 0]; # rock paper scissors payoff matrix
    T = 2000; # number of rounds we play the game
    n = 3; # number of actions
    η = sqrt(log(n)/T); # learning rate

    # build a model -
    model = build(MyTwoPersonZeroSumGameModel, (
        ϵ = η, # learning rate
        n = n, # number of actions
        T = T, # number of rounds we play the game
        payoffmatrix = M, # payoff matrix
    ));

    model; # return the 
end;

__Play the game__. Next, we play the game. We pass the `model::MyTwoPersonZeroSumGameModel` instance into [the `play(...)` method](src/Online.jl) as the only argument. This method returns the raw game output, where each row is a game instance (round), each column is a player action, and the weights matrix.
* The `rps_sims::Array{Int64,2}` array holds the outcome of each game encoded as 1 = rock, 2 = paper and 3 = scissors. The first column is the _row player_, while the second is the _column player_.
* The `weights::Array{Float64,2}` holds the _row player_ distribution for each instance of the game.

Let's play the game for $T$ rounds and see what happens

In [3]:
(rps_sim, weights) = play(model);

What's in the rps_sims and weights variables?

In [4]:
rps_sim

2000×2 Matrix{Int64}:
 1  3
 3  3
 3  3
 1  3
 1  3
 1  3
 1  3
 3  3
 2  3
 1  3
 ⋮  
 1  3
 1  3
 1  3
 1  3
 1  3
 1  3
 1  3
 1  3
 1  3

__Games outcome table__. `Unhide` the code block below to see how we constructed the game table [using the `pretty_tables(...)` method exported by the `PrettyTables.jl` package](https://github.com/ronisbr/PrettyTables.jl).

> __Summary:__ Each row of the table displays the game's outcome. The first column shows the action of the _row player_, while the second column shows the (near) optimal action of the _column player_, given the action of the _row player_.

So what do we see?

In [5]:
let

    # initialize -
    T = model.T;
    moves = Dict{Int, String}(1 => "rock", 2=> "paper", 3=>"scissors"); # setup moves map
    df = DataFrame();

    # build rounds table -
    for t ∈ 1:T
        row_df = (
            game = t,
            row_player = rps_sim[t,1] |> i-> moves[i],
            col_player = rps_sim[t,2] |> i-> moves[i],
        )
        push!(df, row_df);
    end
    
    # build a table -
    pretty_table(
         df;
         backend = :text,
         table_format = TextTableFormat(borders = text_table_borders__compact)
    );
end

 ------- ------------ ------------
 [1m  game [0m [1m row_player [0m [1m col_player [0m
 [90m Int64 [0m [90m     String [0m [90m     String [0m
 ------- ------------ ------------
      1         rock     scissors
      2     scissors     scissors
      3     scissors     scissors
      4         rock     scissors
      5         rock     scissors
      6         rock     scissors
      7         rock     scissors
      8     scissors     scissors
      9        paper     scissors
     10         rock     scissors
     11     scissors     scissors
     12     scissors     scissors
     13     scissors     scissors
     14         rock     scissors
     15         rock     scissors
      ⋮            ⋮            ⋮
 ------- ------------ ------------
[36m                  1985 rows omitted[0m


### Check for convergence
To check if the algorithm has converged to an approximate Nash equilibrium, we need to compute the average strategies:

> **Average row player strategy $\bar{p}$**: The average of the probability distributions $\mathbf{p}^{(t)}$ over all rounds. We'll compute that from looking at the weights matrix.
> 
> **Average column player strategy $\bar{q}$**: The average of the one-hot vectors $\mathbf{q}^{(t)}$ (i.e., the empirical frequency of each action chosen by the column player).

From the weights matrix, we can compute $\mathbf{p}^{(t)} = \frac{\text{weights}[t, :]}{\sum \text{weights}[t, :]}$ for each $t$. For $\mathbf{q}^{(t)}$, since the column player plays deterministically, $\mathbf{q}^{(t)}$ is a one-hot vector based on the action in $\text{rps\_sim}[t, 2]$.

Let's compute these averages.

In [6]:
eps_approx = let


    # initialize -
    T = model.T;
    n = model.n;
    
    # Average row player strategy p̄ (compute from the weights)
    p_avg = zeros(Float64, n);
    for t in 1:T
        Φ = sum(weights[t, :]);
        p_t = weights[t, :] / Φ;
        p_avg += p_t; # does element-wise addition
    end
    p_avg = (1/T)*p_avg;

    # Average column player strategy q̄ (empirical frequencies)
    q_avg = zeros(Float64, n);
    for t in 1:T
        action = rps_sim[t, 2];
        q_avg[action] += 1.0;
    end
    q_avg = (1/T)*q_avg;
    
    # Let's print out the average strategies
    println("Average row player strategy (p̄): ", p_avg);
    println("Average column player strategy (q̄): ", q_avg);

    # Check for convergence
    M = model.payoffmatrix;
    max_part = maximum(p_avg' * M); # max_q p̄^T M q = max_j (p̄^T M)_j
    min_part = minimum(M * q_avg);  # min_p p^T M q̄ = min_i (M q̄)_i
    ε_approx = max_part - min_part;
    println("Approximate ε (max_q p̄^T M q - min_p p^T M q̄): ", ε_approx);
    ε_approx;
end;

Average row player strategy (p̄): [0.9846274315653998, 0.0006138829367003676, 0.014758685497901778]
Average column player strategy (q̄): [0.0, 0.0, 1.0]
Approximate ε (max_q p̄^T M q - min_p p^T M q̄): 1.9840135486286994
[0.9846274315653998, 0.0006138829367003676, 0.014758685497901778]
Average column player strategy (q̄): [0.0, 0.0, 1.0]
Approximate ε (max_q p̄^T M q - min_p p^T M q̄): 1.9840135486286994


In [7]:
model.ϵ

0.023437281078104066

Ok, so if this did not convertion [the `@assert` statement](https://docs.julialang.org/en/v1/base/base/#Base.@assert) will blow up. What do we see?

In [8]:
let

    x = floor(Int, log10(model.ϵ));
    y = floor(Int, log10(eps_approx));

    @show x, y

    @assert isapprox(x,y)
end

(x, y) = (-2, 0)


AssertionError: AssertionError: isapprox(x, y)

## Summary
In this lab, we implemented the Multiplicative Weights Algorithm for zero-sum games, using Rock-Paper-Scissors to demonstrate how it finds approximate Nash equilibria through repeated play.

> __Key Takeaways:__
>
> * **Adaptive strategy updates:** The Multiplicative Weights Algorithm adjusts player strategies by updating weights based on payoffs, down-weighting poorly performing actions to encourage mixing.
> * **Convergence to equilibrium:** Over many rounds, the average strategies approach an approximate Nash equilibrium, where neither player can gain by unilaterally changing strategy.
> * **Application to zero-sum games:** The algorithm handles competitive settings like Rock-Paper-Scissors, converging to uniform mixing (equal probability for each action) as the optimal response to an adaptive opponent.

The Multiplicative Weights Algorithm effectively finds balanced strategies in zero-sum games without prior knowledge of the opponent's moves.
___
