**Thompson Sampling** is a simple **reinforcement** algorithm. In reinforcement learning, the agent generates its training data by interacting with the world. The agent learns the consequences of its actions through trial and error, instead of being fed explicity.

A very popular use of the UCB algorithm is determining the advertisement that produces the maximum reward, netflix item based recommender systems, bidding and stock exchange, traffic light control and automation in industries.

The basic algorithm is: <br/>
- Step 1: At each round n, we consider two numbers for each  i,<br/>
        i. the number of times i was got reward 1 upto round n<br/>
        ii. the number of times i was got reward 0 upto round n
- Step 2: From each ad i, we take a random draw from distribution<br/>
- Step 3: We select the i with highest distribution

First we import the libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math

Then, we import the dataset

In [None]:
dataset=pd.read_csv('ad.csv') #Use whatever dataset is available to you

In [None]:
Then,we implement the Thompson Sampling algorithm

In [None]:
import random
N=10000 #Total number of times we advertise
d=10 #Total number of ads
ads_selected=[]
number_of_rewards_1=[0]*d
number_of_rewards_0=[0]*d
total_reward=0
for n in range(0,N):
    ad=0
    max_random=0
    for i in range(0,d):
        random_beta=random.betavariate(number_of_rewards_1[i]+1,number_of_rewards_0[i]+1)
        if(random_beta > max_random):
            max_random=random_beta
            ad=i
    ads_selected.append(ad)
    reward=dataset.values[n,ad]
    if reward==1:
        number_of_rewards_1[ad]+=1
    else:
        number_of_rewards_0[ad]+=1
    total_reward=total_reward+reward

At last, we visualize the results

In [None]:
plt.hist(ads_selected)
plt.title('Histogram of Ads seletions')
plt.xlabel('Ads selected')
plt.ylabel('Number of times each ad was seected')
plt.show()