In [None]:
# default_exp noncontextual_bandits

# Intro



The aim of this notebook is to provide a low tech overview over MAB solvers, specifically A/B testing and Thompson Sampler. Others such as Epsilon Greedy and UCB MIGHT follow, but since TS is relatively undisputed when it comes to decision making for a couple of reasons, this might not happen.

We will first look at the problem at hand, the Multi Armed Bandit and how we can emulate an environment that houses one. Then we turn towards the classical solution to MABs: A/B testing. Through the evaluation of some of it's weaknesses stemming from the distinction between exploration and exploitation phase, we will arrive at a theoretical construct that already almost resembles Thompson Sampling, from where it is an easy jump to the full thing.

![alt](https://miro.medium.com/max/1596/1*EnNqxjvgYcgP3qcvceEeHg.jpeg)


This notebook is aimed at people looking to apply TS with a working understanding of what it does but foregoes the mathematical descriptions and proofs necessary to argue for TS in an academic setting. 

Note that we're focussing on the simplest use case imaginable, which means we can use simple beta-binomial distribution to compute our thompson sampling. This is does little more than elaborate counting AND STILL it is useful for outshining A/B tests in a categorical case. 

Note also that we stripped anything complex from the notebook to focus on the underlying concepts rather than, say, computing how many people we need in our groups to get statistically significant (a concept we're also ignoring because... Bayes Rule(s)!).

# What is a multi armed bandit situation?

In the multi armed bandit setting, we are tasked with finding the best performing action out of a selection of possible actions. An example could be the color of a button on a website. We could make it Blue or Yellow. Our theories are that Blue is easier to see but Yellow is a more inviting color. We now have to figure out which of these effects is stronger, i.e. which button color gets clicked more often(1).


We do not know which theory is correct, and, unfortunately, we will never KNOW which button is better, since for that we'd need to know what would have happened if the users that saw yellow were presented with blue and vice versa, for all eternity. Maybe the users that saw it so far preferred yellow because it was summer. Or the background of an ad that popped up made it very hard to see the blue button. 
This is, in essence, what we refer to as the dilemma of 'Exploration vs Exploitation': Do we rely on the information we have available to make a decision or do we want to collect more data so that our future decisions will be a better informed one?

![alt](http://www.primarydigit.com/uploads/2/0/1/6/20168087/2155136.png?387)

In this Notebook, we want to look at a few possible solutions to this problem.


# How do we test these approaches

To do so, we need to simulate a situation in which a customer makes a choice we do not know up front, since we don't have actual customers to ask. To do that, we generate an 'average customer'. This average customer is going to react to the color of the button with probabilities proportional to how the real population MIGHT react (this is of course simulated). This value is stored in our variable 'Theta'; it reflects how likely the general population is to click on a blue vs a yellow button. We convert this probability to a reaction by sampling a number between 0 and 1 and comparing it to the theta value for that choice, see customer_reaction().




 (1) This is not a real life situation, in which we might take into account different additional parameters such as page background - which we'd optimally test, too.

In [None]:
import numpy as np

In [None]:
from thompson_sampling.helpers import showcase_code

In [None]:
from thompson_sampling.multi_armed_bandits import non_contextual_categorical_bandit

In [None]:
showcase_code('thompson_sampling/multi_armed_bandits.py',method_name = 'non_contextual_categorical_bandit')