___
# L7c: Binary Bernoulli and Contextual Bandit Problems

In this lecture, we will discuss the binary Bernoulli bandit problem and the contextual bandit problem. The key concepts in this lecture are:
* __Binary Bernoulli Bandit Problem__: is a bandit problem where the reward for taking action $a\in\mathcal{A}$ is binary $r_{t} = \left\{0,1\right\}$. Howevet, the probability of getting reward `0` or `1` is unknown and needs to be estimated. While have a binary reward distribution may seem limiting, this structure is _extremely_ useful in many real-world applications that are `true` or `false` situations.
* __Binary Contextual Bandit Problem__: The contextual bandit problem introduces the notion of state or context $s_{t}\in\mathcal{S}$ that is observed by the agent before taking an action. The reward is still binary, but now the reward distribution is dependent on the context of the agent. The goal is to learn a policy that maps context to actions that maximize the expected reward.

The notes for this lecture are adapted from the following sources:
1. Chapter 8 of "Introduction to Multi-Armed Bandits" by Aleksandrs Slivkins. This is an excellent resource (albeit quite technical) for learning more about bandit problems. [The book is available online here!](https://arxiv.org/abs/1904.07272)
2. Chapter 3 of [A Tutorial on Thompson sampling, Russo et al., 2020](https://arxiv.org/abs/1707.02038) explores the binary bandit problem in more detail.
___

## Binary Bernoulli Bandit Problem
The binary Bernoulli bandit problem is a special case of the stochastic bandit problem where the reward for taking action $a\in\mathcal{A}$ is binary $r_{t} = \left\{0,1\right\}$. The probability of getting reward `1` is unknown and needs to be estimated. The goal is to maximize the expected reward by selecting the best action at each time step.

* Unlike a completely general stochastic bandit problem, the binary Bernoulli bandit problem assumes the _agent models how the world responds_ using a (deceptively) simple reward distribution, [the Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution). Thus, _the agent has a model of the world_ (which is so super cool!).

The Bernoulli distribution is a discrete probability distribution that returns a value of `1` with probability $p$ and value `0` with probability $1-p$. The probability mass function of the Bernoulli distribution is given by:
$$
\begin{equation*}
\text{Bern}(r; p) = \begin{cases}
p & \text{if } r = 1,\\
1-p & \text{if } r = 0.
\end{cases}
\end{equation*}
$$
where $r\in\left\{0,1\right\}$ is the reward and $p\in[0,1]$ is the probability of getting reward `1`. The expected reward is given by: $\mathbb{E}[r] = p$ and the variance is given by: $\text{Var}[r] = p(1-p)$. 
* _Ready to get you mind blown_? Ok, so here is the _really cool part_: the agent models the parameter $p$ using a _probability distribution_ (e.g., [a Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution)) and updates this distribution as it observes rewards. This is the essence of the Bayesian approach to bandit problems.

### $\epsilon$-Greedy Binary Bernoulli Bandit
The $\epsilon$-greedy algorithm is a simple and effective algorithm for solving the binary Bernoulli bandit problem. 
* The algorithm selects the _best action_ with probability $1-\epsilon$ and selects a random action with probability $\epsilon$. The pseudo-code for the $\epsilon$-greedy algorithm is given below [(with more detail version can be found here)](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-7/L7c/docs/BBBPcode.pdf):

#### Pseudo-code
The agent has $K$ arms (choices), $\mathcal{A} = \left\{1,2,\dots,K\right\}$, and the total number of rounds is $T\gg{K}$. Initialize the parameters of [the Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) for each arm $a\in\mathcal{A}$ to $\alpha_{a} = 1$ and $\beta_{a} = 1$. The agent uses the following algorithm to choose which arm to pull (which action to take) during each round:

For $t = 1,2,\dots,T$:
1. _Initialize_: Roll a random number $p\in\left[0,1\right]$ and compute a threshold $\epsilon_{t}={t^{-1/3}}\cdot\left(K\cdot\log(t)\right)^{1/3}$.
2. _Exploration_: If $p\leq\epsilon_{t}$, choose a random (uniform) arm $a_{t}\in\mathcal{A}$. Execute the action $a_{t}$ and receive a reward $r_{t} = \left\{0,1\right\}$ from the _adversary_ (nature). 
3. _Exploitation_: Else if $p>\epsilon_{t}$, choose action $a^{\star}_{t}$, the action with the _highest expected probability of success_ (still a greedy choice), using the agents model of the world. Execute the action $a^{\star}_{t}$ and recieve a reward $r^{\star}_{t}\in\left\{0,1\right\}$ from the _adversary_ (nature). 
    - We generate the highest probability estimate of success by sampling from the [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) for each arm: $\mathbf{p}\gets\left\{\text{Beta}(\alpha(a)+\mathbf{S}(a),\beta(a)+\mathbf{F}(a))\mid\forall{a}\in\mathcal{A}\right\}$ where $\mathbf{S}(a)$ and $\mathbf{F}(a)$ are the number of successes and failures for arm $a$. The highest probability action is: $a^{\star} = \text{argmax}_{a\in\mathcal{A}}\left\{\mathbf{p}(a)\right\}$.
4. Update the sucess $\mathbf{S}(a^{\star})$ and failure $\mathbf{F}(a^{\star})$ arrays for the chosen arm $a^{\star}_{t}$ using the reward $r^{\star}_{t}$:
$$
\begin{equation*}
S(a^{\star}_{t}) \gets S(a^{\star}_{t}) + r^{\star}_{t},\quad F(a^{\star}_{t}) \gets F(a^{\star}_{t}) + (1-r^{\star}_{t})
\end{equation*}
$$

#### How is this different from the general stochastic bandit problem?
* A key difference between the binary Bernoulli bandit problem and the general stochastic bandit problem is that the agent has a model of the world (i.e., [a Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution)) whose parameter is modeled using [a Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution). The agent updates this model as it observes rewards. This is in contrast to the general stochastic bandit problem where the agent does not have a model of the world and must estimate the reward distribution using an empirical estimate of the mean reward.
* Using a model of the world allows the agent to make more informed decisions about which actions to take. This is the essence of the Bayesian approach to bandit problems. The agent has a model of likely reward distribution for _each_ action and uses this model to select the best action at each time step.

## Binary Contextual Bandit Problem
The binary contextual bandit problem is a generalization of the binary Bernoulli bandit problem where the reward distribution is dependent on the context of the agent. Let's consider the following scenario:
* _Context matters_: Suppose we want to predict whether it will rain or not tomorrow. The reward is binary: `1` if it rains and `0` if it does not rain. The reward distribution is dependent on the context of the agent (e.g., the weather forecast, the season, location on the planet, etc.). Ultimately, the goal is to learn a policy that maps _context_ to _actions_ that maximize the expected reward.
* _Context can be observed_: The agent observes a context $s_{t}\in\mathcal{S}$ before taking an action. The reward is still binary, but now the reward distribution is dependent on the context of the agent. The observation of context is a type of _side information_ that can be used to improve the agent's decision-making.
* _Contextual bandit problems are everywhere_: The contextual bandit problem is used in many real-world applications. For example, in online advertising, the reward is binary (e.g., a user clicks on an ad or not) and the reward distribution is dependent on the context of the user (e.g., the user's demographics, browsing history, etc.). 

### Formal Definition
The binary contextual bandit problem is defined by a tuple $(\mathcal{S},\mathcal{A},\mathcal{R},\mathcal{P})$ where:
* $\mathcal{S}$ is the set of contexts that the agent can observe before taking an action.
* $\mathcal{A}_{s}$ is the set of actions that the agent can take in context $s\in\mathcal{S}$, where $\mathcal{A}_{s}\subseteq\mathcal{A}$.
* $\mathcal{R}_{s}$ is the set of rewards that the agent can receive in context $s\in\mathcal{S}$, where $R_{s}\subseteq\mathcal{R}$.
* $\mathcal{P}$ is the set of reward distributions that the agent can model. The reward distribution is dependent on the context $s_{t}\in\mathcal{S}$ and the action $a_{t}\in\mathcal{A}$, thus we may have $\mathcal{P} = \left\{P_{s,a}\mid s\in\mathcal{S},a\in\mathcal{A}_{s}\right\}$.

For a _small number of contexts_ we can solve the contextual bandit problem by maintaining a separate model of the world for each context $s\in\mathcal{S}$ and updating these models as we observe rewards. The agent selects the _best action_ based on the context $s_{t}$ at each time step, and its associated program.

_Initaialize_: For each context $s\in\mathcal{S}$, create an instance $\texttt{ALG}_{s}$ of $\texttt{ALG}$ (e.g., $\epsilon$-greedy) and initialize the parameters of the Beta distribution for each arm $a\in\mathcal{A}$ to $\alpha_{a} = 1$ and $\beta_{a} = 1$ for each context.

For $t = 1,2,\dots,T$:
1. _Observe context_: The agent observes the context $s_{t}\in\mathcal{S}$ and invoke the algorithm $\texttt{ALG}_{s_{t}}$.
2. _Choose action_: The agent chooses an action $a_{t}$ using the algorithm $\texttt{ALG}_{s_{t}}$.
3. _Observe reward_: The agent receives a reward $r_{t}\in\mathcal{R}$ from the _adversary_ (nature).
4. _Update model_: The agent updates the model of the world for the context $s_{t}$ using the reward $r_{t}$.


### $\epsilon$-Greedy Binary Contextual Bandit
The $\epsilon$-greedy algorithm can be extended to the binary contextual bandit problem by incorporating the _context_ into the agent's decision-making process. 

* In our simple approach, we'll assume that the context $s_{t}$ is a binary vector of length $d$ (i.e., $s_{t}\in\left\{0,1\right\}^{d}$). The agent maintains a separate model of the world for each context $s\in\left\{0,1\right\}^{d}$ and updates these models as it observes rewards. The agent selects the _best action_ based on the context $s_{t}$ at each time step, and its associated program.
* Thus, we modify the $\epsilon$-greedy algorithm to incorporate an observation of the _context_ $s_{t}$ which can itself be _correct_ or _incorrect_. For example, perhaps the _context_ is a function of the physical position of the agent in a room, and the agent can observe this position with some error. The agent must then learn to make decisions based on the _observed_ context.

## Lab
In `L7d`, we will implement $\epsilon$-Greedy Binary Bernoulli Bandit algorithm and simulate the agent's learning process. We will also discuss/explore the contextual bandit problem with a simple modification to the binary Bernoulli bandit problem. 

# Today?
That's a wrap! What are some of the interesting things we discussed today?