# L12a: Stochastic Multi-Armed Bandit Problems
In this lecture, we explore the fundamentals of Reinforcement Learning (RL), a branch of machine learning focused on how agents should take actions in an environment to maximize cumulative reward over time. 

> __Learning Objectives:__
> 
> By the end of this module, you will be able to define and demonstrate mastery of the following key concepts:
>
> * __Exploration vs. Exploitation__: In reinforcement learning, the exploration vs. exploitation trade-off forces an agent to balance trying new actions to discover potentially better rewards (exploration) against leveraging its current knowledge to maximize immediate payoff (exploitation). Striking the right balance is crucial for learning an optimal policy that performs well both now and in the long run.
> * __Multi-Armed Bandits__: A sequential decision problem where an agent repeatedly selects from $K$ arms, each with an unknown reward distribution, and aims to maximize cumulative reward by striking a balance between exploration (testing less-tried arms) and exploitation (choosing the empirically best arm). Algorithms like ε-greedy, Upper Confidence Bound, and Thompson Sampling provide principled strategies with provable performance guarantees for managing this trade-off.
> * __Regret Minimization__: The fundamental metric for evaluating bandit algorithms, measuring the cumulative difference between an algorithm's performance and the best possible strategy in hindsight, with successful algorithms achieving sublinear regret bounds that ensure near-optimal performance over time.


At its core, RL is about learning from doing: an agent observes the state of its environment, takes actions, i.e., makes decisions, receives rewards, and updates its knowledge to improve future decision-making. Let’s get started!
___

## Exploration vs. Exploitation
In reinforcement learning, the problem on the surface is deceptively simple: an agent is in a state $s$ and can take an action $a\in A_{s}$, where $A_{s}$ is the set of actions currently available to the agent. The agent chooses an action $a$, implements it, receives a reward $r$, and transitions to a new state $s^{\prime}$. The goal is to learn a policy that maximizes cumulative reward over time, i.e., the best possible action in each state.

But when you think about it, the problem is actually quite complex. The agent must make decisions based on incomplete information and balance two competing objectives: exploration and exploitation. The exploration vs. exploitation trade-off is a fundamental challenge that agents must navigate. It involves balancing two competing objectives:
1. **Exploration**: Trying new actions to discover potentially better rewards. This is essential for learning about the environment and finding optimal policies. Taking purely random actions, or actions that have not been tried often, to gather information about their outcomes and rewards is an example of exploration.
2. **Exploitation**: Leveraging current knowledge to maximize immediate payoff. This involves choosing actions that have previously yielded high rewards based on the agent's experience. However, if the agent only exploits, it may miss out on discovering better actions that could yield higher rewards in the long run. The agent never tries anything new—how boring!

Striking the right balance between exploration and exploitation is crucial for learning an optimal policy that performs well both now and in the long run. If an agent explores too much, it may miss out on immediate rewards. If it exploits too much, it may fail to discover better long-term strategies.

The exploration-exploitation trade-off is often formalized in algorithms that guide the agent's decision-making process. These algorithms provide principled strategies for managing the trade-off, ensuring that the agent can learn effectively while maximizing cumulative reward over time.
___

## What Is a Bandit Problem?  
A bandit problem is a class of online (sequential) decision-making tasks in which an agent repeatedly chooses among $K$ options—called _arms_—and receives a reward based on that choice.

The agent chooses from $K$ alternatives (somehow) and executes the desired action. At each round $t$, the agent pulls one arm and observes a reward $r_{t}$. Good pulls yield higher rewards; poor pulls yield lower rewards (or even losses). The agent’s goal is to maximize cumulative reward over time.

Here are a few examples of applications of bandit problems:
* __Clinical Trials__: Balances learning about new treatments (exploration) with assigning patients to the current best therapy (exploitation).
* __Financial Portfolio Design__: Dynamically allocates capital across assets to maximize returns while testing novel investments.
* __Adaptive Routing__: Chooses network paths to minimize delay, trading off probing unknown routes against using established fast ones.
* __Recommendation Systems__: Iteratively selects items to display—like movies or products—balancing novel suggestions against proven favorites.

__Additional Resources__: Our lecture notes for this week were inspired by Chapter 1 of "Introduction to Multi-Armed Bandits" by Aleksandrs Slivkins. This is an excellent resource (albeit quite technical) for learning more about bandit problems. [The book is available online](https://arxiv.org/abs/1904.07272). We also drew material from the [Bandit problem Thompson sampling tutorial by Russo et al., 2020](https://arxiv.org/abs/1707.02038).

___


In the stochastic multi-armed bandit problem, the agent must choose an action $a$ from the set of all possible actions $\mathcal{A}$, where $\dim\mathcal{A} = K$ during each round $t = 1,2,\dots, T$ of the game or task. The agent receives a reward $r_{a}$ from the environment, where $r_{a}$ is sampled from some unknown distribution $\mathcal{D}_{a}$.

For $t = 1,2,\dots,T$:
1. _Aggregator_: The agent picks an action $a_{t} \in \mathcal{A}$. How the agent makes this choice is one of the main differences between algorithms for solving this problem. 
2. _Adversary_: The agent implements action $a_{t}$ and receives a reward $r_{t}\in\left[0,1\right]$ sampled from the (unknown) distribution $\mathcal{D}_{a}\mid a = a_{t}$.
3. Agent observes $r_{t}$ but nothing else. It cannot see the distribution $\mathcal{D}_{a}$; only the _adversary_ can see this.

The agent is interested in learning the mean of the reward distribution of each arm, $\mu(a) = \mathbb{E}\left[r_{t}\sim\mathcal{D}_{a}\right]$, by experimenting against the world (adversary). The goal of the agent is to maximize the total reward. However, the goal of the algorithm designer is to minimize the _regret_ of the algorithm that the agent uses to choose $a\in\mathcal{A}$.

### Regret
Regret measures the difference between what could have been achieved by always making the best decision, i.e., the decision that maximizes reward (in hindsight), and what the agent actually chooses to do during each round. 
> __Perspective__: Regret is a property of the algorithm, not the agent (which only cares about reward). Each decision-making framework the agent employs may lead to a different bound on regret. Thus, the goal of the algorithm designer is to minimize the regret of the agent's algorithm.

__Definition__: _Regret_. Let $\mu^{\star}$ be the mean of the best arm, i.e., $\mu^{\star} = \max_{a\in\mathcal{A}}\mu(a)$ after playing the game for $T$ rounds. The regret $R(T)$ of an algorithm after $T$ rounds is defined as:
$$
\begin{align*}
R(T) = T\cdot\mu^{\star} - \sum_{t=1}^{T}r_{t}
\end{align*}
$$
The first term is the reward that would have been obtained if the best arm was always chosen over the $T$ rounds. The second term is the total reward obtained by the agent over the $T$ rounds, where $r_{t}$ is the reward received at round $t$ for the action chosen by the agent.

### Explore-First Exploration
A straightforward approach to the multi-armed bandit problem is to explore each arm equally. This is called _uniform exploration_ or the explore-first algorithm. In this approach, the agent begins with a purely _exploratory phase_, pulling each arm $N$ times. After this exploration phase, the agent selects the arm with the highest mean reward for the rest of the game. This is called the _exploitation phase_.

#### Explore-First Algorithm
The agent has $K$ arms, $\mathcal{A} = \left\{1,2,\dots,K\right\}$, and the total number of rounds is $T$. The agent uses the following algorithm to choose which arm to pull during each round:
1. _Initialization_: For each arm $a\in\mathcal{A}$, set $N_{a} = (T/K)^{2/3}\cdot\mathcal{O}\left(\log{T}\right)^{1/3}$ (the number of times we try action $a$).
2. _Exploration_: Play each arm $a\in\mathcal{A}$ for $N_{a}$ rounds and record the rewards. After the exploration phase, select the arm $a^{\star}$ with the highest mean reward (break ties arbitrarily).
3. _Exploitation_: Play arm $a^{\star}$ for the remaining rounds.

__Theorem__: The _expected_ regret over $T$ rounds of the _uniform exploration_ algorithm is bounded by $\mathbb{E}\left[R(T)\right]\leq{T}^{2/3}\times\mathcal{O}\left(K\cdot\log{T}\right)^{1/3}$, where $K$ is the number of arms, $T$ is the total number of rounds and $N = (T/K)^{2/3}\cdot\mathcal{O}\left(\log{T}\right)^{1/3}$ is the number of rounds in the exploration phase for each action (choice).
___

### Epsilon-Greedy Exploration
One issue with the _uniform exploration_ algorithm is that it may not be the best choice for all problems. For example, performance in the exploration phase may be _bad_ if many of the arms have a large gap $\Delta({a})$:
* _What is this gap?_ Let the (true) mean reward for each arm be $\mu(a) = \mathbb{E}\left[r_{t}\sim\mathcal{D}_{a}\right]$, where $a\in\mathcal{A}$. The _best_ mean reward over the actions is $\mu^{\star} = \max_{a\in\mathcal{A}}\mu_{a}$. Then, the gap $\Delta({a}) = \mu^{\star} - \mu(a)$ is the difference between the mean reward of the best arm and the mean reward of arm $a$. If the gap is _large_, the agent may miss out on many rewards by exploring each arm equally.

With a large gap, it may be better to spread out (and interleave) the exploration and exploitation phases of the arms. This is the idea behind the _epsilon-greedy_ algorithm. In this algorithm, the agent chooses the best arm with probability $1-\epsilon$ and a random arm with probability $\epsilon$. This allows the agent to explore the arms more evenly and may lead to better performance in cases where the gap is large.

While [Slivkins](https://arxiv.org/abs/1904.07272) doesn't give a reference for the epsilon-greedy algorithm, other sources point to (at least in part) to [Thompson and Thompson sampling, proposed in 1933 in the context of drug trials](https://arxiv.org/abs/1707.02038).

#### Epsilon-Greedy Algorithm
The agent has $K$ arms (choices), $\mathcal{A} = \left\{1,2,\dots,K\right\}$, and the total number of rounds is $T$. The agent uses the following algorithm to choose which arm to pull (which action to take) during each round:

For $t = 1,2,\dots,T$:
1. _Initialize_: Roll a random number $p\in\left[0,1\right]$ and compute a threshold $\epsilon_{t}\sim{t}^{-1/3}$. Note that in other sources, $\epsilon$ is a constant, not a function of $t$.
2. _Exploration_: If $p\leq\epsilon_{t}$, choose a random (uniform) arm $a_{t}\in\mathcal{A}$. Execute action $a_{t}$ and receive a reward $r_{t}$ from the _adversary_ (nature). 
3. _Exploitation_: Else if $p>\epsilon_{t}$, choose action $a^{\star}$ (the action with the highest average reward so far, the greedy choice). Execute action $a^{\star}_{t}$ and receive a reward $r_{t}$ from the _adversary_ (nature).
4. Update the list of rewards for $a_{t}\in\mathcal{A}$.

__Theorem__: The epsilon-greedy algorithm with exploration probability $\epsilon_{t}={t^{-1/3}}\cdot\left(K\cdot\log(t)\right)^{1/3}$ achieves a regret bound of $\mathbb{E}\left[R(t)\right]\leq{t}^{2/3}\cdot\left(K\cdot\log(t)\right)^{1/3}$ for each round $t$.
___

### Optimism Under Uncertainty
Let's consider the final approach for solving bandit problems for today: the optimism under uncertainty algorithm. The key assumption of this approach:
* __Assumption__: Assume each arm is as good as it can be given the observations so far, and choose the best arm based on these optimistic estimates. This intuition leads to the `UCB1` algorithm.

Given a history of rewards and the number of pulls for each arm, the `UCB1` algorithm calculates an upper confidence bound (UCB) and uses it to decide which arm to pull. 

__Definition__: Upper Confidence Bound (UCB). During each round $t =1,2, \dots, T$ the `UCB1` algorithm maximizes the sum $\bar{\mu}(a)+U(a,t)$ where $\bar{\mu}(a)$ is the _estimated_ mean return of arm $a\in\mathcal{A}$ at time $t$ and $U(a,t)$ is the _upper confidence bound_ of arm $a\in\mathcal{A}$ at time $t$:
$$
\begin{align*}
U(a,t) = \sqrt{\frac{2\log(t)}{N_{a}}}
\end{align*}
$$
where $N_{a}$ is the number of times that arm $a\in\mathcal{A}$ has been pulled up to time $t$. The `UCB1` algorithm chooses the arm $a^{\star}$ that maximizes the sum $\bar{\mu}(a)+U(a,t)$ during each round $t$. The `UCB1` algorithm was originally proposed by Auer, Cesa-Bianchi, and Fischer in 2002.

#### UCB1 Algorithm
The agent has $K$ arms (choices), $\mathcal{A} = \left\{1,2,\dots,K\right\}$, and a total number of rounds is $T\gg{K}$.

_Initialization_: Pull each arm $a\in\mathcal{A}$ once and record the rewards. For each arm $a\in\mathcal{A}$, set $N_{a} = 1$ and $\bar{\mu}(a) = r_{a}$.

For rounds $t = K+1,K+2,\dots,T$:
1. Compute the upper confidence bound $U(a,t)$ for each arm $a\in\mathcal{A}$.
2. Choose the best arm $a^{\star} = \text{arg}\max\,\left\{\bar{\mu}(a)+U(a,t)\mid\,a\in\mathcal{A}\right\}$ at time $t$.
3. Execute action $a^{\star}$ and receive a reward $r_{t}$ from the _adversary_ (nature).
4. Update the estimated mean reward $\bar{\mu}(a^{\star})$ and the number of pulls $N_{a^{\star}}$.

__Theorem__: The `UCB1` algorithm achieves a regret bound for $K$ arms of $\mathbb{E}\left[R(t)\right]\leq\mathcal{O}\left(\sqrt{KT\cdot\log(T)}\right)$ 
over $T$ rounds.
___

## Summary

In this lecture, we explored the fundamentals of stochastic multi-armed bandit problems and algorithms for balancing exploration and exploitation:

> __Key takeaways:__
>
> 1. **Exploration vs. exploitation trade-off**: The fundamental challenge in sequential decision-making requires agents to balance trying new actions to discover potentially better rewards against leveraging current knowledge to maximize immediate payoffs, with principled algorithms providing theoretical guarantees for managing this trade-off effectively.
> 2. **Multi-armed bandit framework**: A sequential decision problem where agents repeatedly select from K options with unknown reward distributions, aiming to maximize cumulative reward through algorithms that balance exploration of less-tried arms with exploitation of empirically best-performing arms.
> 3. **Bandit algorithms with provable guarantees**: Explore-First, Epsilon-Greedy, and Upper Confidence Bound (UCB1) algorithms provide different strategies for managing the exploration-exploitation trade-off, with theoretical regret bounds ensuring near-optimal performance relative to the best possible strategy in hindsight.

These bandit algorithms establish foundational approaches for decision-making under uncertainty, with applications ranging from clinical trials and financial portfolio optimization to recommendation systems and adaptive routing.

___
