# Baby Robot's Guide to Multi-Armed Bandits

The code in this notebook comes from a Towards Data Science series I've written on [Multi-Armed Bandits](https://towardsdatascience.com/multi-armed-bandits-part-1-b8d33ab80697).

I've adapted it so that, rather than picking the best socket to charge up a Baby Robot, it instead tries to pick the best Candy Cane machines for Santa's Elves.

![Photo by Ferenc Almasi on Unsplash](https://cdn-images-1.medium.com/max/800/0*WUKevSQJrdR5Xjwr)

Photo by __[Ferenc Almasi](https://medium.com/r/?url=https%3A%2F%2Funsplash.com%2F%40flowforfrank%3Futm_source%3Dmedium%26utm_medium%3Dreferral)__ on __[Unsplash](https://medium.com/r/?url=https%3A%2F%2Funsplash.com%3Futm_source%3Dmedium%26utm_medium%3Dreferral)__

# What are Multi-Armed Bandits?

When faced with a choice of various options, where each option gives you a different degree of reward, how do you find which is the best?
This type of problem is commonly referred to as the multi-armed bandit.

In the multi-armed bandit you are trying to win as much money as possible from playing a set of one-armed bandits (otherwise known as slot machines or fruit machines), each of which can give a different payout. You need to find which machine gives the biggest payout, so you can make as much money as possible in the allocated time.

Each play of a machine (or pull of the bandit’s arm) corresponds to one time slot and you only get to play for a fixed number of time slots.

In this competition we're not playing slot machines, trying to win as much money as possible, instead we're using candy machines and trying to win Santa's Elves as many candy canes as possible. A much more important task!


# The Exploration-Exploitation Dilemma

When trying the candy machines we're faced with the problem of not knowing which machine will give the most candy. We therefore need to explore the possible choices in search of the best one.

However, because we've only got a limited number of tries, we can’t spend all our time searching for the best machine. To get the maximum amount of candy we'll need to exploit the knowledge we've gained, so that we don’t waste time trying bad machines that don't return any candy.

This is an example of the classic exploration-exploitation dilemma, in which you want to explore the possible options in search of the best one while, at the same time, wanting to exploit the information that has already been obtained, so that you can gain the maximum possible overall reward.


# Multi-Armed Bandit Strategies

So, to get the maximum amount of candy for the Elves, we've got to try the machines to find the good ones and, once found, we then need to use those machines. This problem, of finding the balance between exploration and exploitation, is what multi-armed bandit algorithms try to solve.

There are many such algorithms, but below I have implemented:

* **Optimistic Epsilon Greedy** 
* **Upper Confidence Bound (UCB)**
* **Bernoulli Thompson (Bayesian) Sampling**


## The Code

To implement the code for this problem I've broken it down into a couple of main classes:

* one to represent a candy machine & keep track of that machines performance
* one to keep the complete set of machines and decide which machine to try next

The various strategies for choosing a machine will then build on top of these classes.

### Install the kaggle environment for the competition

In [None]:
!pip install kaggle-environments --upgrade -q

### Setup test components

In [None]:
from kaggle_environments import make
env = make("mab", debug=True)

In [None]:
# best of five testing
def bo5(file1, file2):
    env = make("mab", debug=True)

    total_1 = 0
    total_2 = 0
    for i in range(5):
        env.run([file1, file2])
        p1_score = env.steps[-1][0]['reward']
        p2_score = env.steps[-1][1]['reward']
        env.reset()
        print(f"Round {i+1}: {p1_score} - {p2_score}")
        total_1 += p1_score
        total_2 += p2_score
            
    print(f"\nMean Scores: {total_1/5} - {total_2/5}")

# **Optimistic Epsilon Greedy** 

A greedy strategy would just always choose the machine that gives the most candy. Obviously, since this approach doesn't search the machines to find which is the best, it has a very good chance of not using the best machine. Epsilon-Greedy tries to fix this by introducing a measure of exploration. By default it will use the machine its so far found to be th best, but at random intervals it will try one of the other machines. The degree of exploration is governed by its Epsilon parameter.

Additionally, if you start with the assumption that none of the machines are going to return any candy (i.e. all machines initially have an average reward of zero) then, as soon as you find a machine that does return some, it will instantly be better than all the others and will therefore become the default machine. As a result, all the other machines will have to wait for the epsilon-greedy search to reach them. So its going to take a very long time to find the best machine and discount bad machines. 

The optimistic greedy approach tries to fix this problem by, instead of assigning an initial reward of zero to each machine, it assigns a high initial expected reward. The effect of this is to make all machines be tried once during the first round of testing. In this way bad machines will be tried and rejected and potentially good machines will be left for further testing.

# Create the agent file

Specify the name of the file to create. We'll create this over a few cells, so after the first cell the attribute '-a' will be used to append to the file

In [None]:
agent_file = "optimistic_epsilon_greedy.py"

In [None]:
%%writefile {agent_file}

import math
import numpy as np
import random


"""
    Helper Functions
"""

# return the index of the largest value in the supplied list
# - arbitrarily select between the largest values in the case of a tie
# (the standard np.argmax just chooses the first value in the case of a tie)
def random_argmax(value_list):
  """ a random tie-breaking argmax"""
  values = np.asarray(value_list)
  return int(np.argmax(np.random.random(values.shape) * (values==values.max())))

# Define a Candy Machine

Each machine keeps track of:

* its average reward
* the number of times its been tried


To allow an [Optimistic Greedy](https://towardsdatascience.com/bandit-algorithms-34fd7890cb18) algorithm to be used, the initial reward estimate can be set to a value other than zero.

In [None]:
%%writefile -a {agent_file}

class CandyMachine:
    """ the base candy machine class """
    
    Q = 0   # the estimate of this machine's reward value                
    n = 0   # the number of times this machine has been tried      
    
    def __init__(self, **kwargs):       
        # get the initial estimate from the kwargs
        self.initial_estimate = kwargs.pop('initial_estimate', 0.)         
        self.initialize() # reset the machine                         
        
    def initialize(self):        
        # estimate of this machine's reward value 
        # - set to supplied initial value
        self.Q = self.initial_estimate                  
        
        # the number of times this machine has been tried 
        # - set to 1 if an initialisation value is supplied
        self.n = 1 if self.initial_estimate  > 0 else 0        
        
                    
    def update(self,R):
        """ update this machine after it has returned reward value 'R' """     
    
        # increment the number of times this machine has been tried
        self.n += 1

        # the new estimate of the mean is calculated from the old estimate
        self.Q = (1 - 1.0/self.n) * self.Q + (1.0/self.n) * R
    
    def sample(self):
        """ return an estimate of the machine's reward value """
        return self.Q

# Create a machine tester

* This keeps a list of all the machines that the Elves can choose from.
* At each time step it calls the 'select_machine' function to choose one of the machines based on the current rewards.

In [None]:
%%writefile -a {agent_file}

class MachineTester():
    """ create and test a set of machines over a single test run """

    # the index of the last machine chosen
    machine_index = -1    
    
    # the total reward accumulated so far
    total_reward = 0
    
    def __init__(self, configuration, **kwargs):
        self.machine_count = configuration.banditCount        
        self.machines = [CandyMachine(**kwargs) for i in range(self.machine_count )]
        
    def __call__(self, observation):  
        
        if self.machine_index > -1:
            # the observation reward is the total reward plus the last reward received
            # - subtract the total reward to the find the reward received from the last machine
            machine_reward = observation.reward - self.total_reward        

            # update reward estimate of the machine that was last used
            self.machines[self.machine_index].update( machine_reward )

            # update the total reward
            self.total_reward = observation.reward
        
        # choose a new machine
        self.machine_index = self.select_machine()
        return self.machine_index

    def select_machine( self ):
        """ choose the machine with the current highest mean reward 
            or arbitrarily select a machine in the case of a tie """
        return random_argmax([machine.sample() for machine in self.machines])  

### Optimistic Epsilon-Greedy Implementation

Create a specialized CandyMachine that uses the [Epsilon Greedy](https://towardsdatascience.com/bandit-algorithms-34fd7890cb18) algorithm.

* On most time steps this will choose the machine with the current best reward.
* Now and again a machine will be chosen at random, from the set of all machines, if a random value is less than a value 'Epsilon'.

This uses the main MachineTester class as its base, so will inherit the main functionality from there.

In [None]:
%%writefile -a {agent_file}

class EpsilonGreedyTester( MachineTester ):

    def __init__(self, configuration, **kwargs ):  
        
        # create a machine tester
        super().__init__(configuration, **kwargs) 
        
        # get the probability of selecting a non-greedy action from the kwargs
        self.epsilon = kwargs.pop('epsilon', 0.0)        
        
    
    def select_machine( self ):
        """ Epsilon-Greedy Selection"""
        
        # probability of selecting a random machine
        p = np.random.random()

        # if the probability is less than epsilon then a random machine is chosen from the complete set
        if p < self.epsilon:
            machine_index = np.random.choice(self.machine_count)
        else:
            # choose the machine with the current highest mean reward or arbitrary select a machine in the case of a tie            
            machine_index = random_argmax([machine.sample() for machine in self.machines])                 
        
        return machine_index

# Create an agent


Create the agent that will be tested by making an instance of the EpsilonGreedyTester class.
At each time step this will be called to select a new machine.
Additionally it will be passed the reward obtained from the previous time step, so that the last selected machine can be updated, to keep track of how its performing.

In [None]:
%%writefile -a {agent_file}

machine_tester = None

def agent(observation, configuration):    
    global machine_tester    
    if machine_tester is None: 
        machine_tester = EpsilonGreedyTester(configuration,initial_estimate = 1.1,epsilon = 0.1)                
    return machine_tester(observation) 

# Test the agent

- run the Optimistic Epsilon Greedy algorithm against the supplied default

In [None]:
# play against the default agent that's provided by the competition
env.run(["../input/santa-2020/submission.py", f"{agent_file}"])
env.render(mode="ipython", width=800, height=800)

In [None]:
# best of 5
print(f'Default vs {agent_file}')
bo5("../input/santa-2020/submission.py", f"{agent_file}")

# **Upper Confidence Bounds (UCB)**

When choosing which machine to get candy from it would be best to pick the best one every time (i.e. the one with the highest probability of giving us some candy). This is what's known as the *optimal policy*. As you may have realised, the optimal policy is only theoretical, since we don't actually know which is the best machine and so have to spend some time exploring, during which we'll be using sub-optimal machines.

The optimal policy, although only theoretical, can however be used to evaluate other policies, to see how close they come to being optimal. The difference between the return that would be achieved by the optimal policy and the amount of return actually achieved by the policy under test is known as the <b><i>regret</i></b>.

Epsilon-Greedy has linear regret. It continues to explore the set of all actions, long after it has gained sufficient knowledge to know which of these actions are bad actions to take.

A better approach, in terms of maximising the total reward, would be to restrict the sampling over time to the actions showing the best performance. This is the exact approach taken by the [Upper Confidence Bound (UCB)](https://towardsdatascience.com/the-upper-confidence-bound-ucb-bandit-algorithm-c05c2bf4c13f) strategy. UCB is based on the principle of “<i>optimism in the fact of uncertainty</i>”, which basically means if you don’t know which action is best then choose the one that currently looks to be the best.

### UCB Implementation

In [None]:
agent_file = "upper_confidence_bounds.py"

In [None]:
%%writefile {agent_file}

import math
import numpy as np
import random


"""
    Helper Functions
"""

# return the index of the largest value in the supplied list
# - arbitrarily select between the largest values in the case of a tie
# (the standard np.argmax just chooses the first value in the case of a tie)
def random_argmax(value_list):
  """ a random tie-breaking argmax"""
  values = np.asarray(value_list)
  return int(np.argmax(np.random.random(values.shape) * (values==values.max())))

The CandyMachine base class is identical to the one used in Epsilon Greedy.
It is only repeated here to put it into the UCB python file.

In [None]:
%%writefile -a {agent_file}

class CandyMachine:
    """ the base candy machine class """
    
    Q = 0   # the estimate of this machine's reward value                
    n = 0   # the number of times this machine has been tried      
    
    def __init__(self, **kwargs):       
        # get the initial estimate from the kwargs
        self.initial_estimate = kwargs.pop('initial_estimate', 0.)         
        self.initialize() # reset the machine                         
        
    def initialize(self):        
        # estimate of this machine's reward value 
        # - set to supplied initial value
        self.Q = self.initial_estimate                  
        
        # the number of times this machine has been tried 
        # - set to 1 if an initialisation value is supplied
        self.n = 1 if self.initial_estimate  > 0 else 0        
        
                    
    def update(self,R):
        """ update this machine after it has returned reward value 'R' """     
    
        # increment the number of times this machine has been tried
        self.n += 1

        # the new estimate of the mean is calculated from the old estimate
        self.Q = (1 - 1.0/self.n) * self.Q + (1.0/self.n) * R
    
    def sample(self):
        """ return an estimate of the machine's reward value """
        return self.Q

The UCBCandyMachine extends the CandyMachine base class. It overrides the "sample" function to create one that is based on uncertainty. 
(Note this now takes the time step, since uncertainty decreases with time)

In [None]:
%%writefile -a {agent_file}

class UCBCandyMachine( CandyMachine ):

    def __init__( self, **kwargs ):    
        """ initialize the UCB Candy Machine """                  
        
        # store the confidence level controlling exploration
        self.confidence_level = kwargs.pop('confidence_level', 2.0)       
                
        # initialize the base Candy Machine
        super().__init__()           
        
    def uncertainty(self, t): 
        """ calculate the uncertainty in the estimate of this machine's mean """
        if self.n == 0: return float('inf')                         
        return self.confidence_level * (np.sqrt(np.log(t) / self.n))         
        
    def sample(self,t):
        """ the UCB reward is the estimate of the mean reward plus its uncertainty """
        return self.Q + self.uncertainty(t)

The MachineTester class is pretty much identical to the one used for Epsilon Greedy, except now the time step is passed when selecting a machine.

In [None]:
%%writefile -a {agent_file}

class MachineTester():
    """ create and test a set of machines over a single test run """

    # the index of the last machine chosen
    machine_index = -1    
    
    # the total reward accumulated so far
    total_reward = 0
    
    def __init__(self, configuration, **kwargs):
        self.machine_count = configuration.banditCount        
        self.machines = [UCBCandyMachine(**kwargs) for i in range(self.machine_count )]
        
    def __call__(self, observation):  
        
        if self.machine_index > -1:
            # the observation reward is the total reward plus the last reward received
            # - subtract the total reward to the find the reward received from the last machine
            machine_reward = observation.reward - self.total_reward        

            # update reward estimate of the machine that was last used
            self.machines[self.machine_index].update( machine_reward )

            # update the total reward
            self.total_reward = observation.reward
        
        # choose a new machine
        self.machine_index = self.select_machine(observation.step)
        return self.machine_index

    def select_machine( self, t ):
        """ choose the machine with the current highest mean reward 
            or arbitrarily select a machine in the case of a tie """
        return random_argmax([machine.sample(t+1) for machine in self.machines])  

In [None]:
%%writefile -a {agent_file}

machine_tester = None

def agent(observation, configuration):    
    global machine_tester    
    if machine_tester is None: 
        machine_tester = MachineTester(configuration,confidence_level=0.9)                
    return machine_tester(observation) 

# Test the agent

- run the UCB algorithm against the supplied default

In [None]:
# play against the default agent that's provided by the competition
env.run(["../input/santa-2020/submission.py", f"{agent_file}"])
env.render(mode="ipython", width=800, height=800)

In [None]:
# best of 5
print(f'Default vs {agent_file}')
bo5("../input/santa-2020/submission.py", f"{agent_file}")

# Bernoulli Thompson (Bayesian) Sampling

Up until now, all of the methods we’ve seen for tackling the Bandit Problem have selected their actions based on the current averages of the rewards received from those actions. [Thompson Sampling](https://towardsdatascience.com/thompson-sampling-fc28817eacb8) (also sometimes referred to as the <i>Bayesian Bandits</i> algorithm) takes a slightly different approach; rather than just refining an estimate of the mean reward it extends this, to instead build up a probability model from the obtained rewards, and then samples from this to choose an action.

In this way, not only is an increasingly accurate estimate of the possible reward obtained, but the model also provides a level of confidence in this reward, and this confidence increases as more samples are collected. This process of updating your beliefs as more evidence becomes available is known as <i>Bayesian Inference</i>.

When a random variable has only two possible outcomes its behaviour can be described by the Bernoulli distribution. When, as in this case, the available rewards are binary (win or lose) then the Beta distribution is ideal to model this type of probability. The Beta distribution takes two parameters, ‘α’ (alpha) and ‘β’ (beta). In the simplest terms these parameters can be thought of as respectively the count of successes and failures. So, when we win a candy cane that machine's alpha value will be increased. When instead we lose, beta will be increased.

In [None]:
agent_file = "bernoulli_thompson_sampling.py"

In [None]:
%%writefile {agent_file}

import math
import numpy as np
import random


"""
    Helper Functions
"""

# return the index of the largest value in the supplied list
# - arbitrarily select between the largest values in the case of a tie
# (the standard np.argmax just chooses the first value in the case of a tie)
def random_argmax(value_list):
  """ a random tie-breaking argmax"""
  values = np.asarray(value_list)
  return int(np.argmax(np.random.random(values.shape) * (values==values.max())))

In [None]:
%%writefile -a {agent_file}

class CandyMachine:
    """ the base candy machine class """
    
    Q = 0   # the estimate of this machine's reward value                
    n = 0   # the number of times this machine has been tried      
    
    def __init__(self, **kwargs):       
        # get the initial estimate from the kwargs
        self.initial_estimate = kwargs.pop('initial_estimate', 0.)         
        self.initialize() # reset the machine                         
        
    def initialize(self):        
        # estimate of this machine's reward value 
        # - set to supplied initial value
        self.Q = self.initial_estimate                  
        
        # the number of times this machine has been tried 
        # - set to 1 if an initialisation value is supplied
        self.n = 1 if self.initial_estimate  > 0 else 0        
        
                    
    def update(self,R):
        """ update this machine after it has returned reward value 'R' """     
    
        # increment the number of times this machine has been tried
        self.n += 1

        # the new estimate of the mean is calculated from the old estimate
        self.Q = (1 - 1.0/self.n) * self.Q + (1.0/self.n) * R
    
    def sample(self,t):
        """ return an estimate of the machine's reward value """
        return self.Q

In [None]:
%%writefile -a {agent_file}

class BernoulliThompsonCandyMachine( CandyMachine ):
    def __init__( self, **kwargs ):             
                
        self.α = 1  # the number of times this machine returned a candy cane
        self.β = 1  # the number of times no candy was returned
            
        super().__init__(**kwargs)          
                    
    def update(self,R):
        """ increase the number of times this machine has been used and 
            update the counts of the number of times it has and has not 
            returned some candy (alpha and beta) """
        self.n += 1    
        self.α += R
        self.β += (1-R)
        
    def sample(self,t):
        """ return a value sampled from the beta distribution """
        return np.random.beta(self.α,self.β)

In [None]:
%%writefile -a {agent_file}

class MachineTester():
    """ create and test a set of machines over a single test run """

    # the index of the last machine chosen
    machine_index = -1    
    
    # the total reward accumulated so far
    total_reward = 0
    
    def __init__(self, configuration, **kwargs):
        self.machine_count = configuration.banditCount        
        self.machines = [BernoulliThompsonCandyMachine(**kwargs) for i in range(self.machine_count )]
        
    def __call__(self, observation):  
        
        if self.machine_index > -1:
            # the observation reward is the total reward plus the last reward received
            # - subtract the total reward to the find the reward received from the last machine
            machine_reward = observation.reward - self.total_reward        

            # update reward estimate of the machine that was last used
            self.machines[self.machine_index].update( machine_reward )

            # update the total reward
            self.total_reward = observation.reward
        
        # choose a new machine
        self.machine_index = self.select_machine(observation.step)
        return self.machine_index

    def select_machine( self, t ):
        """ choose the machine with the current highest mean reward 
            or arbitrarily select a machine in the case of a tie """
        return random_argmax([machine.sample(t+1) for machine in self.machines]) 

In [None]:
%%writefile -a {agent_file}

machine_tester = None

def agent(observation, configuration):    
    global machine_tester    
    if machine_tester is None: 
        machine_tester = MachineTester(configuration)                
    return machine_tester(observation) 

# Test the agent

- run the algorithm against the supplied default

In [None]:
# best of 5
print(f'Default vs {agent_file}')
bo5("../input/santa-2020/submission.py", f"{agent_file}")

## Head-to-Head Test

In [None]:
# best of 5
print(f'Optimistic Epsilon-Greedy vs UCB')
bo5("optimistic_epsilon_greedy.py", "upper_confidence_bounds.py")

In [None]:
print(f'Optimistic Epsilon-Greedy vs Bernoulli Thompson Sampling')
bo5("optimistic_epsilon_greedy.py", "bernoulli_thompson_sampling.py")

In [None]:
print(f'UCB vs Bernoulli Thompson Sampling')
bo5("upper_confidence_bounds.py", "bernoulli_thompson_sampling.py")