#### Forked from [Ilia Larchenko](https://www.kaggle.com/ilialar/simple-multi-armed-bandit) and then made changes.

Please upvote the original Notebook as well.

### Multi-armed bandit problems are some of the simplest reinforcement learning (RL) problems to solve. We have an agent which we allow to choose actions, and each action has a reward that is returned according to a given, underlying probability distribution. The game is played over many episodes (single actions in this case) and the goal is to maximize your reward.

To exaplain further, how do you most efficiently identify the best machine to play, whilst sufficiently exploring the many options in real time? This problem is not an exercise in theoretical abstraction, it is an analogy for a common problem that organisations face all the time, that is, how to identify the best message to present to customers (message is broadly defined here i.e. webpages, advertising, images) such that it maximises some business objective (e.g. clickthrough rate, signups).

The classic approach to making decisions across variants with unknown performance outcomes is to perform multiple A/B tests. These are typically run by evenly directing a percentage of traffic across each of the variants over a number of weeks, then performing statistical tests to identify which variant is the best. This is perfectly fine when there are a small number of variations of the message (e.g. 2–4), but can be quite inefficient in terms of both time and opportunity cost when there are many.

One simple example is in the optimization of click-through rates (CTR) of online ads. Perhaps you have 10 ads that essentially say the same thing
(maybe the words and designs are slightly different from one another). At first, you want to know which ad performs best and yields the highest CTR.


Another similar problem, let’s say you have a limited resource (e.g.,advertising budget) and some choices (10 ad variants). How will you allocate your resource among those choices so you can maximize your gain?

First, you have to “explore” and try the ads one by one. Of course, if you’re seeing that Ad 1 performs unusually well, you’ll “exploit” it and run it for the rest of the campaign. You don’t need to waste your money on underperforming ads. Stick to the winner and continuously exploit its performance. There’s one catch, though. Early on, Ad 1 might be performing well, so we’re tempted to use it again and again. But what if Ad 2 catches up
and if we let things unfold Ad 2 will produce higher gains? We’ll never know because the performance of Ad 1 was already exploited. There will always be tradeoffs in many data analysis and machine learning
projects. That’s why it’s always recommended to set performance targets beforehand instead of wondering about the what-ifs later. Even in the most sophisticated techniques and algorithms, tradeoffs and constraints are always there.

This is where Reinforcement Learning (RL) comes in. In a nutshell, RL is about reinforcing the correct or desired behaviors as time passes. A reward
for every correct behavior and a punishment otherwise. 

The general reinforcement learning problem is a very general setting. Actions affect subsequent observations. Rewards are only observed corresponding to the chosen actions. The environment may be either fully or partially observed. Accounting for all this complexity at once may ask too much of researchers. Moreover, not every practical problem exhibits all this complexity. As a result, researchers have studied a number of special cases of reinforcement learning problems.
When the environment is fully observed, we call the RL problem a Markov Decision Process (MDP). When the state does not depend on the previous actions, we call the problem a contextual bandit problem. When there is no state, just a set of available actions with initially unknown rewards, this problem is the classic multi-armed bandit problem.

While in most learning problems we have a continuously parametrized function f where we want to learn its parameters (e.g., a deep network), in a bandit problem we only have a finite number of arms that we can pull, i.e., a finite number of actions that we can take.


---

#### From this [post](https://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html), a more mathematical explanation...

A Bernoulli multi-armed bandit can be described as a tuple of ⟨A,R⟩, where:

We have K machines with reward probabilities, ${θ1,…,θK}$.
At each time step t, we take an action a on one slot machine and receive a reward r.
A is a set of actions, each referring to the interaction with one slot machine. The value of action a is the expected reward, $Q(a)=E[r|a]=θ$. If action at at the time step t is on the i-th machine, then Q(at)=θi.
R is a reward function. In the case of Bernoulli bandit, we observe a reward r in a stochastic fashion. At the time step t, rt=R(at) may return reward 1 with a probability Q(at) or 0 otherwise.
It is a simplified version of Markov decision process, as there is no state S.

The goal is to maximize the cumulative reward $∑Tt=1rt$. If we know the optimal action with the best reward, then the goal is same as to minimize the potential regret or loss by not picking the optimal action.

The optimal reward probability θ∗ of the optimal action a∗ is:

$θ∗=Q(a∗)=maxa∈AQ(a)=max1≤i≤Kθi$

Our loss function is the total regret we might have by not selecting the optimal action up to the time step T:

$LT=E[∑t=1T(θ∗−Q(at))]$

---

### Bandit Strategies

Based on how we do exploration, there several ways to solve the multi-armed bandit.

No exploration: the most naive approach and a bad one.
Exploration at random
Exploration smartly with preference to uncertainty

#### numpy.random.beta - has the probability distribution function

![img](https://i.imgur.com/TRlNMCd.png)

---

The below function is inspired by this [Kernel](https://www.kaggle.com/ilialar/simple-multi-armed-bandit#Competition-specific-changes)

In [None]:
!pip install kaggle-environments --upgrade

In [None]:
%%writefile my-sub-file.py

import json
import numpy as np
import pandas as pd

basic_state = None
reward_full = 0
step_ending = None

def basic_mab (observation, configuration):


    no_reward_step = 0.1
    decay_rate = 0.99

    global basic_state,reward_full,step_ending

    if observation.step == 0:
        basic_state = [[1,1] for i in range(configuration["banditCount"])]
    else:
        reward_final = observation["reward"] - reward_full
        reward_full = observation["reward"]

        player = int(step_ending == observation.lastActions[1])

        if reward_final > 0:
            basic_state[observation.lastActions[player]][0] += reward_final
        else:
            basic_state[observation.lastActions[player]][1] += no_reward_step

        basic_state[observation.lastActions[0]][0] = (basic_state[observation.lastActions[0]][0] - 1) * decay_rate + 1
        basic_state[observation.lastActions[1]][0] = (basic_state[observation.lastActions[1]][0] - 1) * decay_rate + 1

#     implementing Beta distribution to generate random number, for each agent and select the most lucky one
    best_probability = -1
    agent_optimal = None
    for k in range(configuration["banditCount"]):
        probability = np.random.beta(basic_state[k][0],basic_state[k][1])
        if probability > best_probability:
            best_probability = probability
            agent_optimal = k

    step_ending = agent_optimal
    return agent_optimal

#### References

- CS229 Supplemental Lecture notes: [Hoeffding’s inequality](http://cs229.stanford.edu/extra-notes/hoeffding.pdf).

- RL Course by David Silver - Lecture 9: E[xploration and Exploitation](https://youtu.be/sGuiWX07sKw)

- Olivier Chapelle and Lihong Li. “An empirical evaluation of thompson sampling.” NIPS. 2011.

-  https://web.eecs.umich.edu/~teneket/pubs/MAB-Survey.pdf

-  https://arxiv.org/pdf/1904.07272.pdf

In [None]:
%%writefile random_agent.py

import random

def random_agent(observation, configuration):
    return random.randrange(configuration.banditCount)

In [None]:
from kaggle_environments import make

env = make("mab", debug=True)

env.reset()
env.run(["random_agent.py", "my-sub-file.py"])
env.render(mode="ipython", width=800, height=700)

In [None]:
env.reset()
env.run(["my-sub-file.py", "my-sub-file.py"])
env.render(mode="ipython", width=800, height=700)