<a href="https://www.kaggle.com/code/salimhammadi07/multi-armed-bandits-using-greedy-policy?scriptVersionId=129532730" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# I. Understanding the problem

In probability theory and machine learning, the multi-armed bandit problem (sometimes called the K-[1] or N-armed bandit problem) is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. This is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma. The name comes from imagining a gambler at a row of slot machines (sometimes known as "one-armed bandits"), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. The multi-armed bandit problem also falls into the broad category of stochastic scheduling.

![](https://upload.wikimedia.org/wikipedia/commons/8/82/Las_Vegas_slot_machines.jpg)

Source  : https://en.wikipedia.org/wiki/Multi-armed_bandit

# II. Importing dependicies 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import seaborn as sns
import os
import random
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('fivethirtyeight') 
np.random.seed(0)

# III. Setting up the envirement 

In [None]:
class Environment:

    def __init__(self, nbr_bandits, probs = []):
        self.probs = probs #success probabilities for each bandit
        self.nbr_bandits = nbr_bandits
           
            
    def step(self, action):
        # Pull arm and get stochastic reward (1 for success, 0 for failure)
            return 1 if (np.random.random()  < self.probs[action]) else 0
    
    def brandits(self):
        # print the brandits probabilites 
        for i in range(self.nbr_bandits):
            print ("Bandit #{} = {}% success rate ".format(i,self.probs[i] *100))

In [None]:
nbr_bandits = 10
probs = []
for i in range(nbr_bandits):
    probs.append(random.random()) 
    
env = Environment(nbr_bandits, probs) 

 # III. Setting up the Agent

In [None]:
class Agent:
    def __init__(self, nbr_action, epsilon = 0.1):
        self.nbr_action = nbr_action # number of total actions the agent is allowed to do 
        self.epsilon = epsilon # Represent the threshold, Based on it the Agent will know how much it can exploit per round and how much it can explore
        self.n = [0]*nbr_action # The number each action the agent did
        self.Q = [0]*nbr_action # value Q(a)
        
    def Q_value(self, action, reward):
        # Update Q action-value given (action, reward)
        self.n[action] += 1
        self.Q[action] += (1.0/self.n[action]) * (reward - self.Q[action])
        
    def act(self):
        # Epsilon-greedy policy
        
        # explore
        if np.random.random() < self.epsilon: 
            return np.random.randint(self.nbr_action)
        # exploit
        else: 
            return self.Q.index(max(self.Q))

# VI. Expermerit

## VI.1 ℇ-Greedy Policy

### VI.1.1 Implementation

In [None]:
episodes = 10000
agent = Agent(nbr_bandits)  
actions, rewards = [], []
for episode in tqdm(range(episodes)):
    action = agent.act() # sample policy
    reward = env.step(action) # take step + get reward
    agent.Q_value(action, reward) # update Q
    actions.append(action)
    rewards.append(reward)
#     print("Running multi-armed bandits with Action = {}, Reward obtained = {}".format(action, reward))

### VI.1.2 Evaluation 

In [None]:
env.brandits()

In [None]:
print('Total rewards earned is {} out 10000'.format(sum(rewards)))

In [None]:
plt.figure(figsize = (20, 10))

ax = pd.DataFrame(actions).value_counts().plot(kind='barh')
for index, value in enumerate(pd.DataFrame(actions).value_counts()):
    ax.text(value, index, str(value))

In [None]:

df = pd.DataFrame({"The Estimated Rewards Qt(a)" : agent.Q,
                   "The Expected Rewards Q*(a)" : probs})
df.plot(kind="bar",figsize = (20,10))
plt.title("Comparing the The Estimated Rewards to Expexted Rewards")
plt.xlabel("Bandits")
plt.ylabel("Brandits probabilites ")

In [None]:
acc_rewards = []
for i in range(episodes):
    if i == 0 :
        acc_rewards.append(0)
    else:
        acc_rewards.append(sum(rewards[:i])/i)

df = pd.DataFrame({"steps" : range(episodes) ,
                   "Average Rewards" : acc_rewards})
df.plot(x = 'steps',y ='Average Rewards',figsize = (20,10))
plt.title("Average Rewards per step")
plt.xlabel("Steps")
plt.ylabel("Average Rewards")

In [1]:
acc_rewards = []
for i in range(episodes):
    if i == 0 :
        acc_rewards.append(0)
    else:
        acc_rewards.append(sum(actions[:i])/i)

df = pd.DataFrame({"steps" : range(episodes) ,
                   "% Optimal Action" : acc_rewards})
df.plot(x = 'steps',y ='% Optimal Action',figsize = (20,10))
plt.title("% Optimal Action per step")
plt.xlabel("Steps")
plt.ylabel("% Optimal Action")

NameError: name 'episodes' is not defined