# Intro

This notebook shows how to use multi-armed bandit.

Multi-armed bandit is a widely used RL-algorithm because it is very balanced in terms of exploitation/exploration.

Algorithm logic:

- At each step for each bandit generate a random number from B(a+1, b+1). B - beta-distribution, a - total reward from this bandit, b - number of this bandits's historical losses.
- Select the bandit with the largest generated number and use it to generate the next step.
- Repeat

# Multi-armed bandits logic (high level description)

## Beta-distribution


We use the Beta distribution to model the probability of win of each bandit B(a+1, b+1). Beta distribution has 2 parameters, `a` and `b.` We can interpret them as the number of wins and number of losses for each bandit. The mode value of the distribution in this case is a/(a+b) - the average win rate, mean = 1/(1 + (b+1)/(a+1)) - close to win rate if a and b are big enough.

Alternatively, we can use B(a,b) and initialize everything by calling each bandit once - but the results will be very close.

Let's look at how the B(a+1, b+1) probability density function looks like:

In [None]:
import numpy as np
from matplotlib import pyplot as plt
from scipy import stats

space = np.linspace(0, 1, 1000)
fig, ax = plt.subplots(figsize = (20,10))
for a, b in [(0,0), (2, 8), (10, 40), (20, 80), (40, 160)]:
    ax.plot(space, stats.beta.pdf(space, a+1, b+1),label=f'a={a}, b={b}')
ax.legend(prop={'size': 20})

As we can see, initially, the distribution is uniform, but when we increase a and b, the distribution density concentrates around the current win rate. This distribution reflects our win probability expectations, given the current number of wins and losses using a particular bandit. The more data we have, the more confident our estimation becomes.

# Selecting bandit on each step

Let's look at the simplified version of this competition: assume that we have only 3 bandits and don't have any decay and competitor agent - so we play with 3 bandits by ourselves.

Let's also assume that these bandits have win probabilities (that are unknown for our agent) of 20%, 50%, and 80%.

We have 2 parameters, `a` and `b` - the number of wins and losses of each bandit. At each step, we sample some number p_i for each bandit from the corresponding beta distribution and select the bandit with the largest number.

Initially all bandits have a = 0, b = 0. Beta distribution for each of them is uniform, and we just select the random bandit. After a few steps, the distributions will be different - for high win probability bandits, the win rate will be higher. Let's look at some artificial example: 

In [None]:
fig, ax = plt.subplots(figsize = (20,10))
for a, b, proba in [(2, 9, '20%'), (8, 7, '50%'), (15, 3, '80%')]:
    ax.plot(space, stats.beta.pdf(space, a+1, b+1),label=proba)
ax.legend(prop={'size': 20})

After 40-50 rounds, we can see that distributions are already very different, and now when we sample p_i, most probably the highest number will have a bandit with the highest win rate. So, we will call the best bandit more and more often and maximize the total reward. After 100 steps, we will see something like that:

In [None]:
g, ax = plt.subplots(figsize = (20,10))
for a, b, proba in [(3, 11, '20%'), (12, 11, '50%'), (45, 12, '80%')]:
    ax.plot(space, stats.beta.pdf(space, a+1, b+1),label=proba)
ax.legend(prop={'size': 20})

This is how the multi-armed bandit algorithm works with exploration/exploitation balance. Each bandit call contributes to the exploration part as we update a or b and become more confident. But at the same time, the more promising bandit is, the more often we call it. After 1000 rounds, we will call "80%" bandit much more times than two others and have a relatively high total reward.

# Competition specific changes

I have described the general logic above, but this competition has two additional factors we need to consider:
1. Decay
2. Competitor

After every call, the probability of reward from the particular bandit decays by 3%. To adjust it, I decrease the `a` param of the bandit each time any agent calls it. In the current version, I don't use any other adjustments.

# Submission

In [None]:
!pip install kaggle-environments --upgrade

In [None]:
%%writefile submission.py

import json
import numpy as np
import pandas as pd

bandit_state = None
total_reward = 0
last_step = None
    
def multi_armed_bandit_agent (observation, configuration):
    global history, history_bandit

    step = 1.0 #you can regulate exploration / exploitation balacne using this param
    decay_rate = 0.97 # how much do we decay the win count after each call
    
    global bandit_state,total_reward,last_step
        
    if observation.step == 0:
        # initial bandit state
        bandit_state = [[1,1] for i in range(configuration["banditCount"])]
    else:       
        # updating bandit_state using the result of the previous step
        last_reward = observation["reward"] - total_reward
        total_reward = observation["reward"]
        
        # we need to understand who we are Player 1 or 2
        player = int(last_step == observation.lastActions[1])
        
        if last_reward > 0:
            bandit_state[observation.lastActions[player]][0] += last_reward * step
        else:
            bandit_state[observation.lastActions[player]][1] += step
        
        bandit_state[observation.lastActions[0]][0] = (bandit_state[observation.lastActions[0]][0] - 1) * decay_rate + 1
        bandit_state[observation.lastActions[1]][0] = (bandit_state[observation.lastActions[1]][0] - 1) * decay_rate + 1

#     generate random number from Beta distribution for each agent and select the most lucky one
    best_proba = -1
    best_agent = None
    for k in range(configuration["banditCount"]):
        proba = np.random.beta(bandit_state[k][0],bandit_state[k][1])
        if proba > best_proba:
            best_proba = proba
            best_agent = k
        
    last_step = best_agent
    return best_agent

In [None]:
%%writefile random_agent.py

import random

def random_agent(observation, configuration):
    return random.randrange(configuration.banditCount)

In [None]:
from kaggle_environments import make

env = make("mab", debug=True)

env.reset()
env.run(["random_agent.py", "submission.py"])
env.render(mode="ipython", width=800, height=700)

In [None]:
env.reset()
env.run(["submission.py", "submission.py"])
env.render(mode="ipython", width=800, height=700)