# L10a: Online Learning: Weighted Majority and Multiplicative Weights Algorithms
This lecture will introduce the last topic for the course, namely online learning, and several algorithms that build upon following the advice of experts and providing feedback to the experts based on their performance. The lecture will cover the following topics:

> __Learning Objectives.__
>
> By the end of this lecture, you will be able to define and demonstrate mastery of the following key concepts:
>
> * [Online Learning](https://en.wikipedia.org/wiki/Online_machine_learning) is a type of machine learning where the model learns from a data stream. Thus, unlike traditional supervised machine learning, the model is not trained on a fixed dataset. Instead, it learns from new data as it arrives. This idea dates [back to the 1950s and has been studied extensively in the context of game theory, optimization, and machine learning](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-6/L6a/docs/Orabona-OnlineLearning-arXiv-2019.pdf).
> * [Regret minimization](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-6/L6a/docs/Orabona-OnlineLearning-AnnRev-2021.pdf) in online learning involves designing algorithms that aim to minimize the difference between the algorithm's performance and that of the best possible strategy in hindsight, often measured as cumulative regret over time. This approach is particularly useful in sequential decision-making scenarios where the environment is dynamic or uncertain, such as financial portfolio optimization or strategic games.
> * [The Multiplicative Weights Algorithm (MWA)](https://en.wikipedia.org/wiki/Multiplicative_weight_update_method) is a simple and powerful algorithm for online learning. The MWA is a type of sequential prediction algorithm that updates the weights of experts based on their performance on past predictions. The key idea is to assign higher weights to experts who perform well and lower weights to experts who perform poorly. This allows the algorithm to adapt to changing data distributions and learn from its mistakes.

This is our first lecture on online learning, which is a foundational component of reinforcement learning (along with concepts like value functions, policies, and exploration strategies). Online learning is such an intuitively simple idea: a machine (like us from when we are a baby to adulthood) learns from experience, good and bad. 

I'm super excited to share this module with you, so let's get started!

___

## General Reinforcement/Online Learning Problem
Suppose we have an agent that can be in a state $s \in \mathcal{S}$ and can take an action $a \in \mathcal{A}$. After taking action $a$ in state $s$, the agent receives a reward $r$. But how does the agent learn to choose the best possible action in each state to maximize its cumulative reward over time?

<div>
    <center>
        <img src="figs/Fig-Schematic-RL.svg" width="580"/>
    </center>
</div>

In reinforcement learning, an agent interacts with an environment by observing its current state $s \in \mathcal{S}$, selecting an action $a \in \mathcal{A}$, and receiving a reward that influences its future decisions. We'll explore three distinct approaches to this problem:

* __Multiplicative weights__ update expert weights based on their past performance.
* __Bandit algorithms__ operate in stateless environments. On each round, they explore different actions to estimate their rewards and adapt their action-selection strategy based on the outcomes.
* __Q-learning__ is a value-based method that estimates the long-term value of each state-action pair, enabling the agent to learn optimal behavior in environments with temporal and sequential dynamics.


These approaches highlight different strategies for learning from interaction, but they all must balance a fundamental challenge in reinforcement learning: the tradeoff between exploring new actions to gather information and exploiting known actions to maximize reward.

___

## Online Learning and Regret Minimization
Online learning algorithms typically consist of the following components:
* __Sequential data processing__: These algorithms process data sequentially, updating the model incrementally as new data becomes available.
* __Prediction mechanism__: The algorithm makes predictions for each incoming data point based on its current model, often using techniques like online convex optimization or recursive least squares.
* __Immediate feedback and model update__: After each prediction, the algorithm receives immediate feedback and updates its model accordingly, minimizing the discrepancy between predicted and actual outcomes.

Suppose we are playing a number guess game where we have to predict a number over $T$ rounds. 

For $t = 1, 2, ..., T$ __do__:
1. An _adversary_ picks a number $y_t$ from the set $\{1, 2, ..., N\}$ and keeps it secret.
2. The _aggregator_ (us) make a prediction $p_t$ for the number, based on some strategy or algorithm.
3. The adversary reveals the true number $y_t$, and we receive feedback using a loss function $l_t(p_t, y_{t})$, which measures the discrepancy between our prediction and the true number.

The cumulative difference between what we selected at each turn and the __best possible strategy in hindsight__ is called the _regret_:
$$
\begin{equation*}
R(T) = \underbrace{\sum_{t=1}^{T} l_t(p_t, y_t)}_{\text{aggregator}} - \underbrace{\min_{u} \sum_{t=1}^{T} l_t(u,y_t)}_{\text{best possible strategy in hindsight}}
\end{equation*}
$$

The objective of online learning is to design algorithms that _minimize cumulative regret_, ensuring that our online learning algorithm performs _nearly_ as well as the best possible strategy in hindsight.

__Key distinctions__

* __Adaptation__: Online learning algorithms are designed to adapt to changing data distributions and learn from their mistakes, while traditional supervised machine learning algorithms are trained on a fixed dataset.
* __Regret minimization__: Regret is a property of the _algorithm_ and not the player that is using the algorithm. The goal of an algorithm designer is to produce an algorithm that minimizes regret, not the player using the algorithm.

Now that we have a basic understanding of online learning and regret minimization, let's explore a simple implementation of this idea, the weighted majority algorithm.

___

## Weighted Majority Algorithm (WMA)
Let's start with the __Weighted Majority Algorithm (WMA)__ developed by Littlestone and Warmuth in 1994:
* [Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108(2), 212-261.](https://www.sciencedirect.com/science/article/pii/S0890540184710091)

We illustrate this approach with an example found in [Arora et al., 2005, Princeton](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-6/L6a/docs/Arora-MWsurvey-CS-Princeton.pdf), which is a _prediction from expert advice_ problem. Let's take a look at the framing: 

* __Game__: Suppose we want to predict the daily movement of a stock price, i.e., whether it will go `{up | down}` on the day; in this case, we are predicting a _binary_ outcome which we encode as  `{1 | -1}`. We predict whether the price will go up or down each morning. At the market close, we find out the `true` movement of the stock price for the day. _If our prediction is incorrect, we lose a dollar_. Thus, our objective is to _minimize_ our losses throughout the game. However, we don't do this on our own. In making our predictions, we watch the predictions of $n$ experts (whose predictions may correlated and who may or may be correct). 
* __Goal__: The weighted majority algorithm (us) must limit our losses to approximately the same as the _best_ expert. However, we do this without knowing which expert is the best, i.e., it is not known until the end of the sequence of days who the best expert is, but we must make decisions each day. The algorithm does this by maintaining a weighting of the experts. Initially, all have equal weight. As time passes, some experts are seen as making better predictions than others, and the algorithm increases their weight proportionately.

We play this game between an omniscient _adversary_ (nature, i.e., the market) and an _aggregator_ (us) who $n$ experts advise; we select $n$ as odd to avoid ties. The game proceeds in $T$ rounds $t = 1, 2, \ldots, T$. During each round the aggregator (us) makes a _binary_ decision $y_t \in \{-1, 1\}$, and the adversary (market) reveals the true outcome $y_t$. Initially, the experts have weights $\left\{w_{i}^{(1)} = 1 \mid i = 1, 2, \ldots, n\right\}$. 

### Algorithm
For each round $t=1,2,\dots,T$:
1. The aggregator (us) makes a prediction $y_t \in \{-1, 1\}$ based on the weighted majority of the experts' predictions. If the total weight of all experts predicting `1` at time $t$ is $w^{(t)}\geq\sum_{i}w_{i}^{(t)}/2$, then the aggregator predicts `1`, otherwise it predicts `-1`.
2. The adversary (market) reveals the actual outcome $y_t \in \{-1, 1\}$.
3. We decrease the weights of the experts who predicted incorrectly. For each expert $i$ who predicted incorrectly, we update the weight: $w_{i}^{(t+1)} = w_{i}^{(t)}(1-\epsilon)$, where $0<\epsilon\leq{1/2}$ is a learning rate parameter.

__Theorem__: The weighted majority algorithm has the following theoretical guarantee (which bounds the number of mistakes the aggregator makes). Let $m_{i}^{(t)}$ be the number of mistakes made by expert $i$ up to time $t$ and $m^{(t)}$ be the total number of mistakes made by the aggregator (us). Then, for every expert $i$ and the aggregator, we have:
$$
\begin{align*}
m^{(t)} \leq \frac{2\ln(n)}{\epsilon} + 2\left(1+\epsilon\right)m_{i}^{(t)}
\end{align*}
$$

The [proof of this theorem can be found here](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-6/L6a/docs/Arora-MWsurvey-CS-Princeton.pdf). The WMA algorithm is deterministic and has a simple implementation (any randomness comes from the adversary). 

___

## Multiplicative Weights Algorithm (MWA)
The **Multiplicative Weights Algorithm (MWA)** is a simple yet robust online learning method that embodies a similar idea to the weighted majority algorithm, i.e., learning from expert advice. Here, the learning rate $\eta$ plays a role analogous to $\varepsilon$ in the Weighted Majority Algorithm, controlling adaptation speed. 

Let’s walk through the setup and sketch out the algorithm.

### Problem Setting
Suppose we are faced with a repeated decision-making task over rounds $t = 1, 2, \ldots, T$. At each round, we have access to $N$ experts, each providing a recommendation or prediction. Our goal is to combine their advice adaptively in order to make strong decisions over time, even in adversarial or uncertain environments.

* Let $\mathbf{p}^{(t)} = \{p_1^{(t)}, p_2^{(t)}, \ldots, p_N^{(t)}\}$ denote our belief distribution over experts at round $t$, updated iteratively based on their past performance.
* We select an expert by sampling from this distribution—for example, using a Categorical distribution:
  $i \sim \texttt{Categorical}(\mathbf{p}^{(t)})$—and follow that expert’s recommendation.
* After the decision is made, the environment (or adversary) reveals the true outcome. We then compute a cost vector $\mathbf{m}^{(t)} = \{m_1^{(t)}, \dots, m_N^{(t)}\}$, where $m_i^{(t)} \in [-1, 1]$ denotes the cost incurred by expert $i$ at time $t$. A correct prediction receives a cost of $-1$, and an incorrect one receives a cost of $+1$.

The **expected loss** of the algorithm at round $t$ is:
$$
L^{(t)} = \sum_{i=1}^{N} p_i^{(t)} m_i^{(t)}
$$
and the **total expected loss** over $T$ rounds is:
$$
L_A = \sum_{t=1}^{T} L^{(t)} = \sum_{t=1}^{T} \sum_{i=1}^{N} p_i^{(t)} m_i^{(t)}
$$
The goal of the algorithm is to ensure that this cumulative loss is as close as possible to that of the **best expert in hindsight**, this difference we give a 
special name, the **regret**:
$$
R(T) = \underbrace{\sum_{t=1}^{T} \sum_{i=1}^{N} p_i^{(t)} m_i^{(t)}}_{\text{Algorithm's Loss}} - \underbrace{\min_{i \in \{1, \dots, N\}} \sum_{t=1}^{T} m_i^{(t)}}_{\text{Best Expert's Loss}}
$$
This quantity measures how much worse the algorithm performs compared to the best fixed decision rule. A key property of successful online learning algorithms like MWA is that they achieve **sublinear regret** which implies that the **average regret per round** vanishes as $T \to \infty$:
$$
\frac{R(T)}{T} \to 0
$$


### Algorithm

__Initialize__: Fix a learning rate $\eta\leq{1}/{2}$, for each expert initialize the weight $w_{i}^{(1)} = 1$.

For $t=1,2,\dots,T$:
1. Chose expert $i$ with probability $p_{i}^{(t)} = w_{i}^{(t)}/\sum_{j=1}^{N}w_{j}^{(t)}$. Ask expert $i$ what the outcome of the experiment should be, denote the experts answer to this as: $\hat{y}_{i}^{(t)}$.
2. The adversary (nature) reveals the true outcome $y_{t}$ of the experiment at time $t$. Compute the cost of the following expert $i$, denoted as $m_{i}^{(t)}$. 
    $$
    m_i^{(t)} =
    \begin{cases}
    -1 & \text{if } \hat{y}_i^{(t)} = y_t \quad \text{(correct)} \\
    +1 & \text{if } \hat{y}_i^{(t)} \neq y_t \quad \text{(incorrect)}
    \end{cases}
   $$
3. Update the weights of expert $i$ as (renormalize the weights to obtain the new probability distribution):
$$
\begin{align*}
w_{i}^{(t+1)} = w_{i}^{(t)}\cdot\left(1-\eta\cdot{m_{i}^{(t)}}\right)
\end{align*}
$$

This is a super simple algorithm, with some very nice properties. The weights are updated multiplicatively based on the performance of each expert, hence the name Multiplicative Weights Algorithm. The learning rate $\eta$ controls how aggressively the algorithm adapts to the experts' performance. And there is a theoretical guarantee that the algorithm will perform nearly as well as the best fixed expert in hindsight, let's check that out!

### Theoretical Regret Bound
Assume all costs lie in the range $m_i^{(t)} \in [-1, 1]$, and fix a learning rate $\eta \leq \frac{1}{2}$. Then the Multiplicative Weights Algorithm (MWA) guarantees that for any expert $i$, after $T$ rounds:
$$
\begin{align*}
\sum_{t=1}^{T} \mathbf{p}^{(t)} \cdot \mathbf{m}^{(t)} & \leq \sum_{t=1}^{T} m_i^{(t)} + \eta \underbrace{\sum_{t=1}^{T} |m_i^{(t)}|}_{= T} + \frac{\ln N}{\eta} \\
\underbrace{\sum_{t=1}^{T} \mathbf{p}^{(t)} \cdot \mathbf{m}^{(t)} - \overbrace{\sum_{t=1}^{T} m_i^{(t)}}^{\text{best expert}}}_{R(T)} & \leq \eta T + \frac{\ln N}{\eta} \\
R(T) & \leq \eta T + \frac{\ln N}{\eta}\quad\blacksquare
\end{align*}
$$
where we used the fact that $|m_i^{(t)}| = 1$. By choosing $\eta = \sqrt{\frac{\ln N}{T}}$, this regret bound becomes sublinear:
$$
R(T) \leq 2 \sqrt{T \ln N}
$$
This ensures that the algorithm's **average regret per round** vanishes as $T \to \infty$, meaning that MWA performs nearly as well as the best fixed expert in hindsight.

___

### Additional Resources
This module borrowed notes and was inspired from several sources: [Arora et al., The Multiplicative Weights Update Method: A Meta-Algorithm and Applications, Theory of Computing, Volume 8 (2012), pp. 121–164](https://theoryofcomputing.org/articles/v008a006/v008a006.pdf) and the [15-859 CMU Lecture 16](https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15859-f11/www/notes/lecture16.pdf) and [15-850 CMU Lecture 17](https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15859-f11/www/notes/lecture17.pdf). 

___

## Summary

In this notebook, we've explored online learning through expert advice aggregation and regret minimization frameworks:

> __Key takeaways:__
>
> 1. **Online learning and regret minimization**: Online learning processes data sequentially and adapts over time, unlike batch learning on fixed datasets. Regret measures the difference between the algorithm's cumulative loss and the best fixed strategy's loss. Successful algorithms achieve sublinear regret, where average regret per round approaches zero as the number of rounds increases.
> 2. **Weighted Majority Algorithm as a foundation**: The Weighted Majority Algorithm maintains weights for each expert and predicts based on the weighted majority of their predictions. When an expert predicts incorrectly, its weight is reduced multiplicatively by a factor of (1 - ε). The algorithm's total mistakes are bounded by a logarithmic factor plus the mistakes of the best expert.
> 3. **Multiplicative Weights Algorithm for robust learning**: The Multiplicative Weights Algorithm converts expert weights into a probability distribution and selects experts randomly according to this distribution. It guarantees sublinear regret, performing nearly as well as the best expert in hindsight. The algorithm is simple to implement and applies to various sequential decision problems.

These expert aggregation methods establish the foundational principles for adaptive learning systems that balance exploration and exploitation in uncertain, dynamic environments.

___