# ***The Multi-Armed Bandit***
##### (Moses Marsh, Jack Bennetto)

### Objectives: answer the following

 * What are ***exploitation***, ***exploration***, and ***regret*** in this context?
 * How is this framework related to traditional A/B testing?
 * What’s your favorite strategy?

In [17]:
import numpy as np
import random
probabilities = np.arange(.58,.2,-.16)
random.shuffle(probabilities)
from scipy import stats

## An example: treatments for tongue psoriasis

Suppose you're a doctor who has developed a new treatment for tongue psoriasis. You aren't sure it will actually help, so you want to compare it to a control group.

How would you do that?

What if you were a Bayesian?

The problem is that these are real people, suffering a real problem, and you help as many as possible.

In [18]:
probability = {}
probability['control'] = probabilities[0]
probability['drug'] = probabilities[1]
results = dict(control=[], drug=[], total=[])

In [19]:
def run_test(treatment):
    result = stats.bernoulli(probability[treatment]).rvs()
    results[treatment].append(result)
    results['total'].append(result)
    if result:
        print("The patient got better!")
    else:
        print("The patient is still sick :(")
    print("        got better   didn't")
    for treatment in ['control', 'drug', 'total']:
        print("{:10} {:5}  {:5}".format(treatment, results[treatment].count(1), results[treatment].count(0)))

Let's test some patients, giving some the drug and some not.

In [24]:
run_test('control')

The patient is still sick :(
        got better   didn't
control        0      3
drug           1      1
total          1      4


In [25]:
run_test('drug')

The patient is still sick :(
        got better   didn't
control        0      3
drug           1      2
total          1      5


In [26]:
results

{'control': [0, 0, 0], 'drug': [1, 0, 0], 'total': [0, 1, 0, 0, 0, 0]}

In [28]:
print('Experimental success rates:')
for treatment in ['control', 'drug']:
    print("  {:10} {:5.2f}".format(treatment, np.mean(results[treatment])))

Experimental success rates:
  control     0.00
  drug        0.33


In [29]:
print('Actual probabilities of getting better:')
for treatment in ['control', 'drug']:
    print("  {:10} {:5.2f}".format(treatment, probability[treatment]))

Actual probabilities of getting better:
  control     0.26
  drug        0.42


With traditional A/B testing there are two phases. First we collect data to determine the best choice, known as **exploration**. Once we're done testing, we make a decision between our options and stick with it; this is **exploitation**. Sometimes it's necessary to keep the testing phase strictly before the deployment phase, but what if there is no such restriction? Then we can combine the two phases, starting out exploring possible results and gradually concentrating on the best choice.

## The Multi-Armed Bandit

The multi-armed bandit is a mathematical problem. Suppose we have two or more slot machines, and each slot machine (a.k.a. one-armed bandit) has a different (unknown!) chance of winning. What strategy should we follow to maximize our payoff after a finite number of plays?
- bandits: $\{B_i\}$
- bandit payout probabilities: $\{p_i\}$

In this case we're assuming they all have the same payoff amount ("binary bandits"), but different payout probabilities (there are many extensions of this problem, but this is a sufficient starting point).

In reality these "bandits" might be drugs, or web-site designs, or ad campaigns, or job-search strategies, or dating profiles, or anything where we want to exploit the "winner" of our hypothesis testing.

- exploration: collecting more data for each bandit to get a better estimate of the true payout probabilities
- eploitation: using whichever bandit has performed the best so far

Every strategy for optimization will have to balance exploration and exploitation.

Each strategy will also have to track the performance of each bandit:
- $n_i$: number of visits (or rounds, or pulls) to bandit $B_i$
- $w_i$: number of successes at banding $B_i$
- $\hat{p}_i = w_i / n_i $: observed success rate of bandit $B_i$
  - if $n_i = 0$, this is undefined



## Common strategies

There are a number of common strategies that you'll implement in the assignment. Some are better than others, although a "best" strategy would require knowledge of the distribution of the payoffs.

To quantify a strategy's performance, we run simulations where we know the true payout probabilities and calculate **regret**: the expected difference in winnings between our strategy and the optimal one.
- let $p^*$ be the max of $\{p_1, p_2, p_3, \ldots, p_k\}$ 
- let $p(t)$ be the true success probability of the bandit chosed at round $t$
- then our regret after $T$ rounds is
$$ r = Tp^* - \sum_{t=1}^T p(t) $$
- and our expected average regret is 
$$ E[r] = \lim_{T \rightarrow \infty} r / T = p^* - \frac{1}{T}\sum_{t=1}^Tp(t)$$

We want a strategy that minimizes regret
- A ***zero-regret strategy*** is defined as one with $E[r] = 0$
- The interesting thing is that a zero-regret strategy does not guarantee that you will never choose a suboptimal outcome, instead it guarantees that as you continue to play you will tend to choose the optimal outcome.
- Note again that actually calculating regret requires knowing the true bandit probabilities


### Greedy Algorithm

The simplest model is a "greedy" algorithm, where we always choose the bandit that's been the most successful so far. Since we want to be able to explore at least a little, we might assume that each bandit has already had a single success.

What are the limitations of this?


### Epsilon-Greedy Algorithm

With epsilon-greedy we choose the best algorithm most of the time, but sometimes (with probability $\epsilon$) we choose one randomly.

Again, there isn't a "best" value, but $\epsilon = 0.1$ is typical.

- ***explore*** with some fixed probability $\epsilon$
  - generate a random number between 0 and 1. If it is less than $\epsilon$, choose a random bandit
- ***exploit*** at all other times: choose the bandit with the highest $\hat{p}_i$ 


Is this a zero-regret strategy?

### Softmax

We choose a bandit randomly in proportion to the softmax function of the payouts, e.g.

If there are three bandits, A, B, and C, the probability of choosing A is

$$ \frac{ e^{p_A/\tau} }{  e^{p_A /\tau} +  e^{p_B /\tau} + e^{p_C /\tau  } } + \frac{ e^{p_B/\tau} }{  e^{p_A /\tau} +  e^{p_B /\tau} + e^{p_C /\tau  } } + \frac{ e^{p_C/\tau} }{  e^{p_A /\tau} +  e^{p_B /\tau} + e^{p_C /\tau  } } $$
where

* $p_A$ is the average payoff of bandit A so far (assume 1.0 to start).
* $\tau$ is the "temperature" (generally constant).

How does this behave in the extremes?


* As $\tau \to \infty$, the algorithm will choose bandits equally.
* As $\tau \to 0$, it will choose the most successful so far.

What are the limitations?

### UCB1 Algorithm

Another approach is to balance the choose bandits based on a combination of expected payoff and uncertainty. The UCB1 algorithm scores each bandit based on the upper confidence bound

The UCB1 algorithm choosing the bandit for whom the Upper Confidence Bound is the highest, favoring bandits with a high expected payout, but also those with high uncertainty.

Choose a bandit to maximize

$$p_A + \sqrt{\frac{2 \ln{N}}{n_A}} $$

where

 * $p_A$ is the expected payout of bandit $A$.
 * $n_A$ is the number of times bandit $A$ has played.
 * N is the total number of trials so far.

This chooses the bandit for whom the Upper Confidence Bound is the highest.

### Bayesian Bandit

Use Bayesian statistics:

* Find probability distribution of payout of each bandit thus far. (how?)
* For each bandit, sample from distribution.
* Choose bandit for whom the sample has highest expected payout.