# Multi-arm Bandits
## Feedback
**Evaluative Feedback**: 
- In it's pure form, depends only on the action taken
- Tells us how good was the action we took
- Doesn't tell us which action was best

**Instructive Feedback**:  
- In it's pure form, independent of the action taken
- Tells us which action was best to take
- Doesn't indicate how well our action (or any other for that matter) performed
- Used in its pure form for Supervised Learning


## K-armed Bandit

Faced with a choice of K different options / actions.  After each action, you receive an immidiate numerical reward (depends on the action).  
The objective is to Maximize the total reward over N time steps

$A_t$: Action selected at time $t$  
$R_t$: Reward for $A_t$  
$\large{q_*(a)}$: $\large{\mathbb{E}[R_t|A_t=a]}$    //The expected value of reward, given action $a$ is selected

So what is the problem?    We can just always choose the highest expected reward action!  
This is true! When we know $q*(a)$ we can be *greedy* and **exploit** this information and choose the most valueable action.  
BUT, In most situations we will not know what is $q*(a)$.  
Because we don't know what $q*(a)$ is, we will need to **explore** and increase our certainty about $q*(a)$ for different actions.  (Make $Q_t(a)$ as close to $q*(a)$ as possible)





## Action-Value Methods
This methods are used to evalate the true *value* of an action

### Method 1: *Sample Average*
The Value of an action is the mean reward from doing that action up to current time.  
We can easily formulate it to:
$\Large{Q_t(a) = \frac{\sum_{i=1}^{t-1}{R_i * \mathbb{1}_{A_i=a}}}{\sum_{i=1}^{t-1}{\mathbb{1}_{A_i=a}}}}$  
$\mathbb{1}_{predicate}$ = 1 if true, else 0  

In this equation, $\sum_{i=1}^{t-1}{\mathbb{1}_{A_i=a}} \rightarrow \infty$, $Q_t(a)$ $\rightarrow$ $q*(a)$  

We can couple this equation, with the selection method: $A_t=\underset{a}{\arg\max} {Q_t(a)}$ for a *greedy* selection process

#### $\large\epsilon-{greedy}$ Selection Method
Since we want to support **exploration** factor e for ~ $\epsilon$ of the times, we can set a rule so that:  
$A_t(e)= \{ \array{\underset{a}{\arg\max} {Q_t(a)} & with \ probability & 1-\epsilon \\ Random(a) & with \ probability & \epsilon } \}$  

In this case, we know that we **Explore** for $\epsilon$ of the time, and **Exploit** for $1-\epsilon$ of the time  

-----

**Exercise 2.1**: In $\epsilon$-greedy action selection, for the case of two actions and $\epsilon$ = 0.5, what is the probability that the greedy action is selected?  

**Answer**: The greedy action is selected $1-\epsilon$ of the times, so:  
$\epsilon=0.5 \| 1-\epsilon=0.5 \\ {Or} \\ $  
$\Pr(e|e\geq0.5) = 0.5 = 50\%$  

-----

## 10-armed Testbed

In [24]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

class bandit_arm():
    def __init__(self, mean: float, std: float):
        self.mean = mean
        self.std = std
        self.distribution = np.random.normal
        
    def r(self):
        while True:
            yield self.distribution(loc=self.mean, 
                                    scale=self.std)

class bandit():
    def __init__(self, k: int, eps: float):
        self.A = self._create_arms(k)
        self.R = dict()
        self.epsilon = eps
        self.total_reward = 0
        self.cum_rewards = []
        
    def game(self, T: int):
        self.R = self._init_rewards()
        self.total_reward = 0
        for i in range(T):
            Rt = self.play()
            self.total_reward += Rt
            self.cum_rewards.append(self.total_reward)
        return self.total_reward
            
    def play(self):
        a = self._choose_action()
        return self._do_action(a)
        
    def _create_arms(self, k: int):
        return [bandit_arm(np.random.normal(loc=0, scale=20), abs(np.random.normal(loc=0, scale=2))) for a in range(k)]
    
    def _init_rewards(self):
        rewards = dict()
        for arm in range(len(self.A)):
            rewards[arm] = np.zeros(1)
        return rewards
        
    def _Q(self, Ra: np.array):
        def _sample_average(Ra: np.array):
            mu = Ra.mean()
            return mu
    
        value_function = _sample_average
        return value_function(Ra)
    
    def _choose_action(self):
        r = np.random.uniform()
        if r > self.epsilon:
            Qt_with_indexes = [(b._Q(b.R[k]), k) for k in b.R.keys()]
            Qt = [r[0] for r in Qt_with_indexes]
            if Qt:
                chosen_arm =  np.argmax(Qt)
                chosen_arm = Qt_with_indexes[chosen_arm][1]
            else:
                chosen_arm = np.random.choice(range(len(self.A)))
        else:
            chosen_arm = np.random.choice(range(len(self.A)))
        return chosen_arm
    
    def _do_action(self, a):
        Rt = next(self.A[a].r())
        c = self.R.setdefault(a, np.array([]))
        self.R[a] = np.append(self.R[a], Rt)
        return Rt

k = 5
epsilon = 0.1
b = bandit(k, epsilon)
T = 2000
total_reward = b.game(T)
print(f'Total Reward: {total_reward}')
for i in b.R:
    print(f'arm {i}:\t # chosen: {len(b.R[i])}\t mean: {np.mean(b.R[i])}\t true: {b.A[i].mean}\t delta: {abs(b.A[i].mean-np.mean(b.R[i]))}\t delta %: {abs(1-b.A[i].mean/np.mean(b.R[i]))}\n')

Total Reward: 31896.887336695825
arm 0:	 # chosen: 40	 mean: -6.421229486912091	 true: -6.961145778345863	 delta: 0.5399162914337712	 delta %: 0.08408300817378378

arm 1:	 # chosen: 32	 mean: -24.183908519868076	 true: -25.01095101124033	 delta: 0.8270424913722536	 delta %: 0.03419804911570856

arm 2:	 # chosen: 107	 mean: 0.6090254975355811	 true: 0.6560828318406078	 delta: 0.04705733430502668	 delta %: 0.07726660787675388

arm 3:	 # chosen: 47	 mean: -2.6495736440723805	 true: -2.4598047298925043	 delta: 0.18976891417987618	 delta %: 0.07162243427519999

arm 4:	 # chosen: 1779	 mean: 18.54243160305969	 true: 18.466928280679273	 delta: 0.0755033223804169	 delta %: 0.004071921309822035



-----

**Exercise 2.2**: Bandit example Consider a k-armed bandit problem with k = 4 actions, denoted 1, 2, 3, and 4. Consider applying to this problem a bandit algorithm using "-greedy action selection, sample-average action-value estimates, and initial estimates of Q1(a) = 0, for all a. Suppose the initial sequence of actions and rewards is A1 = 1, R1 = 1,A2 =2,R2 =1,A3 =2,R3 = 2,A4 =2,R4 =2,A5 =3,R5 =0. Onsome of these time steps the " case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?

**Answer**:
According to the definition we have: $k=4$, Using Sample-Average and 0 initialization for $Q_1(a)$  
Lets track the algorithm:
- A1 = 1 / R1 = -1  # Must be random, all $Q_i(a)=0$
- A2 = 2 / R2 = 1   # Random, but only between $Q_i(a)\ where\ i \neq 1$
- A3 = 2 / R3 = -2  # Greedy action, $Q_2(A_2) = 1$
- A4 = 2 / R4 =2    # Random, $Q_3(A_2) = -0.5$ while $Q_3(A_3)\ \& \ Q_3(A_4) = 0$
- A5 = 3 / R5 =0    # Random / Greedy-Random betwen $A_3\ \& \ A_4$

-----

**Exercise 2.3**: In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.

**Answer**: Since we are being asked on the long run, we will simplify and assume that the algorithm always chosses the optimal action when not in $\epsilon$ (Exploratory) mode.  
Now, let's look at the algorithms:  

The difference between them is the $\epsilon$, And we will look at an example of:  
- $T: 1000$
- $RL1: \epsilon=0.1$ 
- $RL2: \epsilon=0.3$
- Notations:
  - $\mu(a^*)$: Mean of optimal action
  - $\mu(a)$: Mean for not-optimal action

So, what will be the difference in cummulative rewards between 1 and 2?

$ T \times [(1-\epsilon_1)\mu(a^*) + \epsilon_1\mu(a)] - [(1-\epsilon_2)\mu (a^*) + \epsilon_2\mu(a)] = $  
$= T \times [(\epsilon_2 - \epsilon_1)\mu(a^*) + (\epsilon_1 - \epsilon_2)\mu(a)] =$  
$= T \times [(\epsilon_2 - \epsilon_1)\mu(a^*) - (\epsilon_2 - \epsilon_1)\mu(a)] =$  
$= T \times [(\epsilon_2 - \epsilon_1)\times(\mu(a^*) - \mu(a))] =$  
$= T \times \Delta\epsilon\times\Delta\mu =$  
$= 1000 \times 0.2\times(\mu(a^*)-\mu(a)) = 200 \times \Delta\mu $

-----