# $k$-armed bandit
In this notebook we familiarize with the problem known as $k$-armed bandit proble. The problem is the following, let $\mathbf{B}_1,\ldots,\mathbf{B}_k$ be $k$ random process that can be *run* at any given time $t\geq 0$. Let $R_i$ be the unknown *reward* of running process $i$. How can we maximize our total reward
\begin{align*}
R = \sum_{t=0}^{T}{R_{a(t)}}
\end{align*}
when $T$ total run of any of the bandit are done? Notice that we initially do not know any of the returns, as we start pulling the arms we start to *learn* the rewards however we have to choose
1. whether stick with the *best to time* arm or
2. try a arm that we have not yet tried and see if we got higher rewards.
The probal is more interesting when $T$ is of the same order of (or it is much less). In fact if $T \gg k$ then a not-so-bad (although perhpas not optimal) strategy is to try all the arms in the first $k$ pulls, and then always pull the one that has the higher reward for the remaining $T-k$ pulls.

## Some experimentation
We start with standard imports and define a function that generates the simplest bandits in forms of a $k$ vector of returns $\mathbf{R} \in \mathbb{R}^k$

In [1]:
import numpy as np
def make_bandits(k, min_R=0, max_R=1, random_state=None):
    if (random_state):
        return np.random.RandomState(random_state).uniform(min_R, max_R, k)
    return np.random.uniform(min_R, max_R, k)

## Exploit/Explore alternating strategy
The first strategy we analyze to implement a solution strategy for the problem makes continuous alternation between *exploration* (try new arms) and *exploitation*. More specifically the algorithm works as follow
1. Start with a random arm and pull it
2. Save the pulled arm as the ``best_so_far``
3. As long as there are unseen arms, choose one at random and pull it, if it rewards is better then the ``best_so_far`` use it a the new ``best_so_far``
4. Perform an *exploit* pull of best so fare and then go back to 3
5. If all arms have been pulled and there are stil steps to perform always pull ``best_so_far```

In [6]:
def alternate_strategy(bandits, steps, explore_rate=0.5):
    rewards = np.zeros(steps)
    t = 1
    k = len(bandits)
    unseen_arms = np.arange(k)
    best_so_far = np.random.choice(unseen_arms)
    np.delete(unseen_arms, best_so_far)
    rewards[0] = bandits[best_so_far]
    total_explore_steps = int(steps*explore_rate)
    explore_step = 0
    while(t < steps):
        # Explore
        if (explore_step < total_explore_steps):
            new_arm = np.random.choice(unseen_arms)
            np.delete(unseen_arms, new_arm)
            rewards[t] = bandits[new_arm]
            if (bandits[best_so_far] < bandits[new_arm]):
                best_so_far = new_arm
            t += 1
        # Exploit
        if (t < steps):
            rewards[t] = bandits[best_so_far]
            t += 1
    return rewards

In [7]:
k = 2**10
steps = int(k/4)
bandits = make_bandits(k,0,100)
actual_rewards = alternate_strategy(bandits, steps)
obtained_reward = np.sum(actual_rewards)
max_reward = steps*np.max(bandits)
print("We got reward {0:.4f} and best possible was {1:.4f}".format(obtained_reward, max_reward))
print("Efficiency of strategy {0:.0f}%".format(100*obtained_reward/max_reward))

We got reward 19263.6137 and best possible was 25563.3789
Efficiency of strategy 75%


#### Some comments on randomness
We thought that using alternate random choices was better than any other way of selecting the next arm to explore, in fact this would be true if the arms were ordered in some specific way. In our setup, and in general, the arms have not any specific order, that is, they are already randomly shuffle, any other randomness is in fact useless. This means that the choices done in the explore stages of the above algorithm chould have been done linearly through the arms and the average return would not be changed.

It turns out that this strategy is not optimal, in fact this is due to the fact that no matter what, when ($T<k/2$) we always perform a total of $T/2$ exploration moves and save the ``best_so_far`` arm, however if we do all such moves at the begin of the run, we then are left with $T/2$ all on the ``best_of_all`` pulled arms. In other words the absolute best is not worse than any of the partial best.

This implies that the strategy can be improved by making all the exploration pulls at the begin and then exploit the best arm.

In [4]:
def explore_first_steategy(bandits, steps, explore_fraction=0.25):
    rewards = np.zeros(steps)
    explore_rounds = int(steps*explore_fraction)
    exploit_rounds = steps - explore_rounds
    # use a 'linear explore' strategy which works if 'bandits' is shuffled
    best = 0
    # first pull on the first arm
    rewards[0] = bandits[0]
    i = 1
    best = 0
    while i < explore_rounds:
        rewards[i] = bandits[i]
        if (bandits[i] > bandits[best]):
            best = i
        i += 1
    # now exploit the best
    while i < steps:
        rewards[i] = bandits[best]
        i += 1
    return rewards

In [5]:
explore_first_return = explore_first_steategy(bandits, steps)
print("Explore first total reward {0:.5f}".format(np.sum(explore_first_return)))
print("Explore first efficiency {0:.0f}%".format(np.sum(explore_first_return)*100/max_reward ))

Explore first total reward 22157.13794
Explore first efficiency 87%


## Real testing
We so far seen one single instance of the various model, in fact we always used the same ``bandits`` rewards to test the returns of the different models. This is of course to much dependent from the specific vector of rewards and therefore it is necessary to make some more robust testing performing several random experiments.

In [8]:
n_samples = 1000
k = 100
n_steps = 100*k    
ratio = 0.5
E_alt = np.array(n_samples)
E_ef = np.array(n_samples)
for i in range(n_samples):
    bandit = make_bandits(k)
    max_return = np.max(bandit)
    ave_return = np.mean(bandit)
    # for alternate strategy it does not make sense to go beyond 0.5
    alt_rewards = alternate_strategy(bandits, n_steps, ratio)
    ef_rewardws = explore_first_steategy(bandits, n_steps, ratio)
    E_alt[i] = np.sum(alt_rewards)
    E_ef[i] = np.sum(ef_rewardws)

IndexError: index 1024 is out of bounds for axis 0 with size 1024