# ***The Multi-Armed Bandit***
##### (Moses Marsh, Jack Bennetto)

### Objective

 * Explain how multi-armed bandit addresses the trade-off between exploitation and exploration
 * Implement the multi-armed bandit algorithm.
 * Measure the regret of a strategy.

### Agenda

 * What is a multi-armed bandit?
 * How do we use this to do smarter A/B tests?
 * Common strategies


In [89]:
import numpy as np
import random
probabilities = np.arange(.58,.2,-.16)
random.shuffle(probabilities)
from scipy import stats

## An example: treatments for flatulence

Suppose you're a doctor who has developed a new treatment for flatulence. You aren't sure it will actually help, so you want to compare it to a control group.

How would you do that?

What if you were a Bayesian?

The problem is that these are real people, suffering a real problem, and you help as many as possible.

In [90]:
probability = {}
probability['control'] = probabilities[0]
probability['drug'] = probabilities[1]
results = dict(control=[], drug=[], total=[])

In [91]:
def run_test(treatment):
    result = stats.bernoulli(probability[treatment]).rvs()
    results[treatment].append(result)
    results['total'].append(result)
    if result:
        print("The patient got better!")
    else:
        print("The patient is still sick :(")
    print("        got better   didn't")
    for treatment in ['control', 'drug', 'total']:
        print("{:10} {:5}  {:5}".format(treatment, results[treatment].count(1), results[treatment].count(0)))

Let's test some patients, giving some the drug and some not.

In [92]:
run_test('control')

The patient got better!
        got better   didn't
control        1      0
drug           0      0
total          1      0


In [93]:
run_test('drug')

The patient is still sick :(
        got better   didn't
control        1      0
drug           0      1
total          1      1


In [94]:
print('Probabilities of getting better:')
for treatment in ['control', 'drug']:
    print("  {:10} {:5.2f}".format(treatment, probability[treatment]))

Probabilities of getting better:
  control     0.58
  drug        0.42


With traditional A/B testing there are two phases. First we do a test to determine the best choice, known as **exploration**. Once we're done testing, we use our results; this is **exploitation**. Sometimes that's necessary if the people doing the test are different from the ones using the result, but what if they aren't? Then we can combine the two phases, starting out exploring possible results and gradually concentrating on the best choice.

## The Multi-Armed Bandit

The multi-armed bandit is a mathematical problem. Suppose we have two or more slot machines, and each slot machine (a.k.a. one-armed bandit) has a different (unknown!) chance of winning. What strategy should we follow to maximize our payoff after a finite number of plays.

In this case we're assuming they all have the same payoff ("binary bandits") but there are many version of this problem.

In reality these might be drugs, or web-site designs, or ad campaigns, or job-search strategies, or dating profiles, or anything where we want to exploit the "winner" of our hypothesis testing.

To understand this we talk about minimizing **regret**, the expected difference in winnings between our strategy and the optimal one. (What does that mean?)

How would you solve this?

## Common strategies

There are a number of common strategies that you'll implement in the assignment. Some are better than others, although a "best" strategy would require knowledge of the distribution of the payoffs.

### Greedy Algorithm

The simplest model is a "greedy" algorithm, where we always choose the bandit that's been the most successful so far. Since we want to be able to explore at least a little, we might assume that each bandit has already had a single success.

What are the limitations of this?


### Epsilon-Greedy Algorithm

With epsilon-greedy we choose the best algorithm most of the time, but sometimes (with probability $\epsilon$) we choose one randomly.

Again, there isn't a "best" value, but $\epsilon = 0.1$ is typical.

What are the limitations?

### Softmax

We choose a bandit randomly in proportion to the softmax function of the payouts, e.g.

If there are three bandits, A, B, and C, the probability of choosing A is

$$ \frac{ e^{p_A/\tau} }{  e^{p_A /\tau} +  e^{p_B /\tau} + e^{p_C /\tau  } } + \frac{ e^{p_B/\tau} }{  e^{p_A /\tau} +  e^{p_B /\tau} + e^{p_C /\tau  } } + \frac{ e^{p_C/\tau} }{  e^{p_A /\tau} +  e^{p_B /\tau} + e^{p_C /\tau  } } $$
where

* $p_A$ is the average payoff of bandit A so far (assume 1.0 to start).
* $\tau$ is the "temperature" (generally constant).

How does this behave in the extremes?


* As $\tau \to \infty$, the algorithm will choose bandits equally.
* As $\tau \to 0$, it will choose the most successful so far.

What are the limitations?

### UCB1 Algorithm

Another approach is to balance the choose bandits based on a combination of expected payoff and uncertainty. The UCB1 algorithm scores each bandit based on the upper confidence bound

The UCB1 algorithm choosing the bandit for whom the Upper Confidence Bound is the highest, favoring bandits with a high expected payout, but also those with high uncertainty.

Choose a bandit to maximize

$$p_A + \sqrt{\frac{2 \ln{N}}{n_A}} $$

where

 * $p_A$ is the expected payout of bandit $A$.
 * $n_A$ is the number of times bandit $A$ has played.
 * N is the total number of trials so far.

This chooses the bandit for whom the Upper Confidence Bound is the highest.

### Bayesian Bandit

Use Bayesian statistics:

* Find probability distribution of payout of each bandit thus far. (how?)
* For each bandit, sample from distribution.
* Choose bandit for whom the sample has highest expected payout.