# Preparation

This notebook implements an epsilon-greedy multi-armed bandit.



Install and import gym library:

In [None]:
!pip install gym > /dev/null 2>&1
import gym

Gym library provides a large collection of 'environments' with a shared interface to test various reinforcement models. You can find a listing of these environments below, as follows. An environment for bandit problems isn't provided by default and needs to be downloaded and installed:

In [None]:
!git clone https://github.com/JKCooper2/gym-bandits.git > /dev/null 2>&1
!pip install /content/gym-bandits/. > /dev/null 2>&1
import gym_bandits

This challenge compares 10 bandits and picks the one with the highest payout. Although the distribution is the same for each, the average payout differs.

![alt text](https://i.stack.imgur.com/SazYv.png)

Re-run this code to re-initialize the environment at random value:

In [None]:
import numpy as np
np.random.seed(42) 

Create an variable to contain a 10-arm bandit environment:

In [None]:
env = gym.make('BanditTenArmedGaussian-v0')

Below is a summary of OpenAI Gym implementation of reinforcement learning, which is going to be useful for the upcoming tasks:

1.   **Agent** is a machine learning algorithm
2.   An agent runs one of the possible **actions** from the defined list of actions
3.   An action is fed to the **environment**, which in our example above is the Atari game environment
4.   The environment evaluates the action and produces a **reward** signal (for example, positive or negative.
5.   The environment then produces an **observation** representing the current status.
6.   The observation & reward signals are fed back to the agent and influence the decition for its next action.


# Tasks

In [None]:
import numpy as np

env.seed(34)

numberofbandits = 10
q_table = np.zeros(numberofbandits)
n_table = np.ones(numberofbandits)

epsilon = 0.9

Implement the multi-armed bandit model, based on the following pseudo-code: 

```
Create a for loop and run it 1000 times

      If statement: generate a random number between 0 and 1; if this number is less than 
      epsilon, use the best-scoring bandit discovered so far
      
            Inside the if statement: get the position (index) in our array of the current max value within our
            table, this index is the bandit that is giving the best payout so far.
            
            Inside the if statement: set your action variable equal to the index we discovered in the last 
            statement
            
      Else: if the number is greater than or equal to epsilon,
      choose a random bandit 
      
          
            Inside the if statement: generate a random number between 0 and the total number of bandits
            
            Inside the if statement: set your action variable to equal the  generated random number.
            
        
            
      Inside the loop: feed the action variable into our environment by updating it with a step generated 
      by either of the steps above
      
      Inside the loop: now that we have gained some new information from our environment, we need to update our 
      Q_table. We do this based on the formula: Q_n+1 = Q_n + (R - Q_n), in other words:
      NewQvalue = OldQvalue + ((reward - OldQvalue)/numberOfTimesLeverHasBeenPulledForThisBandit)
      

      Inside the loop: now that we have updated our Q table, we also need to update the table that is keeping
      track of how many times each bandit's lever has been pulled. Do this by adding +1 in the position
      of our currently selected bandit in the n_table array
      
      
Outside the loop: once everything is done, we would like to print the Bandit with the highest score! Using
a print statement, and numpy's argmax function, using our Q table, print the bandit with the highest
average payout
      
   
   
  
```



In [None]:
#Create a for loop and run it 1000 times
loop_size = 1000
for _ in range(loop_size):

 # If statement: generate a random number between 0 and 1; if this number is less than 
 # epsilon, use the best-scoring bandit discovered so far
 if random.uniform(0, 1) < epsilon:
    
    #Inside the if statement: get the position (index) in our array of the current max value within our
    #table, this index is the bandit that is giving the best payout so far.
    index_current_max = np.argmax(q_table)
    
    #Inside the if statement: set your action variable equal to the index we discovered in the last 
    #statement
    action = index_current_max
 
 #Else: if the number is greater than or equal to epsilon, choose a random bandit 
 #Inside the if statement: generate a random number between 0 and the total number of bandits
 #Inside the if statement: set your action variable to equal the  generated random number.
 else:  action = random.randrange(len(n_table))
     
 #Inside the loop: feed the action variable into our environment by updating it with a step generated 
 #by either of the steps above
 new_state, reward, done, info = env.step(action)
 
 #Inside the loop: now that we have gained some new information from our environment, we need to update our 
 #Q_table. We do this based on the formula: Q_n+1 = Q_n + (R - Q_n), in other words:
 #NewQvalue = OldQvalue + ((reward - OldQvalue)/numberOfTimesLeverHasBeenPulledForThisBandit)
 old_qvalue = q_table[action]
 new_qvalue = old_qvalue + ((reward - old_qvalue)/ n_table[action])
 q_table[action] = new_qvalue
 #Inside the loop: now that we have updated our Q table, we also need to update the table that is keeping
 #track of how many times each bandit's lever has been pulled. Do this by adding +1 in the position
 #of our currently selected bandit in the n_table array
 n_table[action] = n_table[action] + 1
 
#Outside the loop: once everything is done, we would like to print the Bandit with the highest score! Using
#a print statement, and numpy's argmax function, using our Q table, print the bandit with the highest
#average payout
print('Bandint with highes payout:', np.argmax(q_table))

Bandint with highes payout: 3


In [None]:
q_table


array([ 1.04311304, -0.34239422,  0.09329693,  1.56039608, -0.65421688,
       -0.57121621,  1.39800432,  1.46131413, -0.89527164,  0.33579774])

In [None]:
n_table

array([ 10.,  11.,  20., 883.,   9.,  10.,  30.,  11.,  10.,  16.])