# L6a: The Multiplicative Weights Algorithm (MWA)

---

In this lecture, we will discuss the Multiplicative Weights Algorithm (MWA) and its applications in online learning. The key ideas discussed in this lecture are as follows:
* [Online Learning](https://en.wikipedia.org/wiki/Online_machine_learning) is a type of machine learning where the model learns from a stream of data. Thus, unlike traditional supervised machine learning, the model is not trained on a fixed dataset. Instead, it learns from new data as it arrives. This idea dates back to the 1950s and has been studied extensively in the context of game theory, optimization, and machine learning.
* [The Multiplicative Weights Algorithm (MWA)](https://en.wikipedia.org/wiki/Multiplicative_weight_update_method) is a simple and powerful algorithm for online learning. The MWA is a type of sequential prediction algorithm that updates the weights of a set of experts based on their performance on past predictions. The key idea is to assign higher weights to experts that perform well and lower weights to experts that perform poorly. This allows the algorithm to adapt to changing data distributions and learn from its mistakes.

Today, we are going to borrow notes from several sources: [Arora et al., The Multiplicative Weights Update Method: A Meta-Algorithm and Applications, Theory of Computing, Volume 8 (2012), pp. 121–164](https://theoryofcomputing.org/articles/v008a006/v008a006.pdf) and the [15-859 CMU Lecture 16](https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15859-f11/www/notes/lecture16.pdf) and [15-850 CMU Lecture 17](https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15859-f11/www/notes/lecture17.pdf).

---

## Weighted Majority Algorithm
Let's start with a precursor the Multiplicative Weights Algorithm (MWA) called the Weighted Majority Algorithm developed by Littlestone and Warmuth in 1994:
* [Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108(2), 212-261.](https://www.sciencedirect.com/science/article/pii/S0890540184710091)

We illustrate this approach with an example found in [Arora et al, 2005, Princeton](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-6/L6a/docs/Arora-MWsurvey-CS-Princeton.pdf), which is a _prediction from expert advice_ problem. Let's take a look at the framing: 

* __Game__: Suppose we want to predict the daily movement of a stock price, i.e., whether it will go `{up | down}` on the day; in this case, we are predicting a _binary_ outcome which we encode as `{1 | -1}`. Each morning we predict whether the price will go up or down that day. At the market close, we find out the `true` movement of the stock price for the day. _If our prediction is incorrect we lose a dollar_. Thus, our objective is to _minimize_ our losses over the course of the game. However, we don't do this on our own. In making our predictions, we watch the predictions of $n$ experts (who predictions may correlated, and who may or may be correct). 

The first approach we'll consider to play this game is the Weighted Majority Algorithm (WM).
* __Goal__: The weighted majority algorithm (us) must limit our losses to approximately the same as the _best_ expert. However, we do the without knowing which expert is the best, i.e., it is not known until the end of the sequence of days who the best expert was, but we must make decisions each day in the meantime. The algorithm does this by maintaining a weighting of the experts. Initially all have equal weight. As time goes on, some experts are seen as making better predictions than others, and the algorithm increases their weight proportionately.

We play this game between an omniscient Adversary (nature, i.e., the market) and an Aggregator (us) who is advised by $n$ experts; we select $n$ as odd to avoid ties. The game proceeds in $T$ rounds $t = 1, 2, \ldots, T$. During each round the aggregator (us) makes a _binary_ decision $y_t \in \{-1, 1\}$, and the adversary (market) reveals the true outcome $y_t$. Initially, the experts have weights $w^{(1)} = \left\{w_{i}^{(1)} = 1 \mid i = 1, 2, \ldots, n\right\}$. 

In each round $t=1,2,\dots,T$
1. The aggregator (us) makes a prediction $y_t \in \{-1, 1\}$ based on the weighted majority of the experts' predictions. If the total weight of all experts predicting `1` at time $t$ is $w^{(t)}\geq\sum_{i}w_{i}^{(t)}/2$, then the aggregator predicts `1`, otherwise it predicts `-1`.
2. The adversary (market) reveals the true outcome $y_t \in \{-1, 1\}$.
3. We decrease the weights of the experts who predicted incorrectly. For each expert $i$ who predicted incorrectly, we upudate the weight: $w_{i}^{(t+1)} = w_{i}^{(t)}(1-\epsilon)$, where $0<\epsilon\leq{1/2}$ is a learning rate parameter.

The weighted majority algorithm has the follwing theorecical guarantee (which bounds the number of mistakes made by the aggregator):
* __Theorem__: After $t$ steps, let $m_{i}^{(t)}$ be the number of mistakes made by expert $i$ up to time $t$ and $m^{(t)}$ be the total number of mistakes made by the aggregator (us). Then for every expert $i$ and the aggregator, we have:
$$
\begin{align*}
m^{(t)} \leq \frac{2\ln(n)}{\epsilon} + 2\left(1+\epsilon\right)m_{i}^{(t)}
\end{align*}
$$

The [proof of this theorem can be found here](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-6/L6a/docs/Arora-MWsurvey-CS-Princeton.pdf). The MW algorithm is deterministic and has a simple implementation (any randomness comes from the adversary). 

## Hedge strategy
Next, let's look at a modification to the weighted majority algorithm called the _Hedge strategy_ introduced by Freund and Schapire in 1997:
* [Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-
line learning and an application to boosting. J. Comput. Syst. Sci., 55:119–139,
August 1997](https://www.cis.upenn.edu/~mkearns/teaching/COLT/adaboost.pdf)

In the _Hedge strategy_, the aggregator (us) makes a prediction $y_t \in \{-1, 1\}$ by sampling a distribution computed from a modified weighting update.
Suppose we estimate a probability over the experts $p_{i}^{(t)}$ for each expert $i$ at time $t$ where $p_{i}^{(t)} = {w_{i}^{(t)}/\Phi^{(t)}}$ and $\Phi^{(t)} = \sum_{i}w_{i}^{(t)}$ is a normalization factor (called the _potential_). Initially, the experts have weights $w^{(1)} = \left\{w_{i}^{(1)} = 1 \mid i = 1, 2, \ldots, n\right\}$ which gives a maximum entropy distribution over the experts.

In each round $t=1,2,\dots,T$
1. The aggregator (us) computes the probability distribution $\mathbf{p}^{(t)} = \left\{p_{1}^{(t)}, p_{2}^{(t)}, \ldots, p_{n}^{(t)}\right\}$ over the experts based on their weights up to time $t$.
2. Each expert $i$ makes a prediction $y_{i}^{(t)} \in \{-1, 1\}$.

The key difference between the Weighted Majority Algorithm and the Hedge strategy are: The choice dynamics of the Weighted Majority Algorithm are deterministic, while the choices in the Hedge strategy are probabilistic. Second, the Hedge strategy uses an exponential update rule for the weights, while the Weighted Majority Algorithm uses a linear update rule.
