# L7c: Binary Bernoulli and Contextual Bandit Problems
In this lecture, we will discuss the binary Bernoulli bandit problem and the contextual bandit problem. The key concepts in this lecture are:
* __Binary Bernoulli Bandit Problem__: is a bandit problem where the reward for taking action $a\in\mathcal{A}$ is binary $r_{t} = \left\{0,1\right\}$. Howevet, the probability of getting reward `0` or `1` is unknown and needs to be estimated. While have a binary reward distribution may seem limiting, this structure is _extremely_ useful in many real-world applications that are `true` or `false` situations.
* __Binary Contextual Bandit Problem__: The contextual bandit problem introduces the notion of state or context $s_{t}\in\mathcal{S}$ that is observed by the agent before taking an action. The reward is still binary, but now the reward distribution is dependent on the context of the agent. The goal is to learn a policy that maps context to actions that maximize the expected reward.

The notes for this lecture are adapted from the following sources:
1. Chapter 8 of "Introduction to Multi-Armed Bandits" by Aleksandrs Slivkins. This is an excellent resource (albeit quite technical) for learning more about bandit problems. [The book is available online here!](https://arxiv.org/abs/1904.07272)
2. The Binary Bernoulli bandit problem is discussed in detail in this tutorial: [A Tutorial on Thompson sampling, Russo et al., 2020](https://arxiv.org/abs/1707.02038). 

## Binary Bernoulli Bandit Problem
The binary Bernoulli bandit problem is a special case of the stochastic bandit problem where the reward for taking action $a\in\mathcal{A}$ is binary $r_{t} = \left\{0,1\right\}$. The probability of getting reward `1` is unknown and needs to be estimated. The goal is to maximize the expected reward by selecting the best action at each time step.

* Unlike a completely general stochastic bandit problem, the binary Bernoulli bandit problem assumes the _agent models how the world responds_ using a (deceptively) simple reward distribution, [the Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution). Thus, _the agent has a model of the world_ (which is so super cool!).

The Bernoulli distribution is a discrete probability distribution that returns a value of `1` with probability $p$ and value `0` with probability $1-p$. The probability mass function of the Bernoulli distribution is given by:
$$
\begin{equation*}
\text{Bern}(r; p) = \begin{cases}
p & \text{if } r = 1,\\
1-p & \text{if } r = 0.
\end{cases}
\end{equation*}
$$
where $r\in\left\{0,1\right\}$ is the reward and $p\in[0,1]$ is the probability of getting reward `1`. The expected reward is given by: $\mathbb{E}[r] = p$ and the variance is given by: $\text{Var}[r] = p(1-p)$. 
* _Ready to get you mind blown_? Ok, so here is the _really cool part_: the agent models the parameter $p$ using a _probability distribution_ (e.g., a Beta distribution) and updates this distribution as it observes rewards. This is the essence of the Bayesian approach to bandit problems.

### $\epsilon$-Greedy Algorithm for Binary Bernoulli Bandit
The $\epsilon$-greedy algorithm is a simple and effective algorithm for the binary Bernoulli bandit problem. The algorithm selects the best action with probability $1-\epsilon$ and selects a random action with probability $\epsilon$. The pseudo-code for the $\epsilon$-greedy algorithm is given below:
```text
Initialize: alpha_{a} = 1, beta_{a} = 1 for all a in A
for t = 1, 2, ..., T do
    a = nothing
    with probability epsilon_{t}:
        a = select a random action a in A uniformly at random
    otherwise:
        a = select the action with the highest probability of reward
    observe the reward r_{t} = {0,1} and update the estimated reward for the selected action
end for
```