# Multi-arm Bandits
## Feedback
**Evaluative Feedback**: 
- In it's pure form, depends only on the action taken
- Tells us how good was the action we took
- Doesn't tell us which action was best

**Instructive Feedback**:  
- In it's pure form, independent of the action taken
- Tells us which action was best to take
- Doesn't indicate how well our action (or any other for that matter) performed
- Used in its pure form for Supervised Learning


## K-armed Bandit

Faced with a choice of K different options / actions.  After each action, you receive an immidiate numerical reward (depends on the action).  
The objective is to Maximize the total reward over N time steps

$A_t$: Action selected at time $t$  
$R_t$: Reward for $A_t$  
$\large{q_*(a)}$: $\large{\mathbb{E}[R_t|A_t=a]}$    //The expected value of reward, given action $a$ is selected

So what is the problem?    We can just always choose the highest expected reward action!  
This is true! When we know $q*(a)$ we can be *greedy* and **exploit** this information and choose the most valueable action.  
BUT, In most situations we will not know what is $q*(a)$.  
Because we don't know what $q*(a)$ is, we will need to **explore** and increase our certainty about $q*(a)$ for different actions.  (Make $Q_t(a)$ as close to $q*(a)$ as possible)





## Action-Value Methods
This methods are used to evalate the true *value* of an action

### Method 1: *Sample Average*
The Value of an action is the mean reward from doing that action up to current time.  
We can easily formulate it to:
$\Large{Q_t(a) = \frac{\sum_{i=1}^{t-1}{R_i * \mathbb{1}_{A_i=a}}}{\sum_{i=1}^{t-1}{\mathbb{1}_{A_i=a}}}}$  
$\mathbb{1}_{predicate}$ = 1 if true, else 0  

In this equation, $\sum_{i=1}^{t-1}{\mathbb{1}_{A_i=a}} \rightarrow \infty$, $Q_t(a)$ $\rightarrow$ $q*(a)$  

We can couple this equation, with the selection method: $A_t=\underset{a}{\arg\max} {Q_t(a)}$ for a *greedy* selection process

#### $\large\epsilon-{greedy}$ Selection Method
Since we want to support **exploration** factor e for ~ $\epsilon$ of the times, we can set a rule so that:  
$A_t(e)= \{ \array{\underset{a}{\arg\max} {Q_t(a)} & for & 1 \geq e \gt \epsilon \\ Random(a) & for & \epsilon \geq e \geq 0 } \}$  

In this case, we know that we **Explore** for $\epsilon$ of the time, and **Exploit** for $1-\epsilon$ of the time  

-----

**Exercise 2.1**: In "$\epsilon$-greedy action selection, for the case of two actions and $\epsilon$ = 0.5, what is the probability that the greedy action is selected?  

**Answer**: The greedy action is selected $1-\epsilon$ of the times, so:  
$\epsilon=0.5 \| 1-\epsilon=0.5 \\ {Or} \\ $  
$\Pr(e|e\geq0.5) = 0.5 = 50\%$  

-----