<a href="https://colab.research.google.com/github/shengy90/reinforcement-learning-an-introduction/blob/master/Chapter_2_Multiarmed_Bandits.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this chapter, we look at the simplest form of a Reinforcement Learning problem - the K-armed bandit problem. 

#2.1 A k-armed Bandit Problem

**Problem specification**
- you have to make a choice amongst k different possible action
- after each choice, you receive a numerical reward chosen from a *stationary probability distribution* that depends on your chosen action
- your objective is to maximise the *expected total reward* over some time period 

**Important concepts:**

- each of the *k* actions has an expected (or mean) reward given that the action was selected: $$q_{*}(a) = \mathbb{E}\big[R_{t} | A_{t} \dot= a\big]$$ where 
$A_{t}$ is the action selected at time $t$ and its corresponding reward being $R_{t}$, and $q_{*}(a)$ represents the expected reward given that $a$ was selected *(aka value of action a)*.

- we may *(or may not, though we might have estimates)* know with certainty the value of each action. Let's call the *estimated value of action $a$ and time step $t$ as $Q_{t}(a)$*, where $Q_{t}(a)$ is as close as $q_{*}(a)$ as possible. 

- *greedy actions* are actions where the *estimated value* $Q_{t}(a)$ is the greatest. Choosing a *greedy* action is also known as *exploiting*. *Exploration* refers to choosing a *nongreedy* action. Reward might be shorter in the short run when exploring, but it *could* be higher in the long run as it allows us to discover new states that might have higher rewards. 

- there are many ways to 'balance' (or optimise) between exploration and exploitation, though many of these methods assume *stationarity*, which may or may not apply

In this chapter, we'll explore simple balancing methods for the k-armed bandit problem and show that they work better than *always-exploit* methods.

# 2.2 Action-value Methods

**`Action-value methods`** are methods that *estimate the values of action* and uses these estimates to *make decisions*. An example:



$$Q_{t}(a) \dot = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}} = \frac{\sum_{i=1}^{t-1} R_{i} \cdotp \mathbb{1}_{A_{i=a}}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_{i=a}} }$$

where:
- $\mathbb{1}_{predicate}$ = random variable that is 1 if predicate is true and 0 if not 
- if denominator is 0, we define $Q_{t}(a)$ as some default e.g. 0 (to avoid div by zero error) 
- As denominator tends to infinity (choosing action a many many times), then by definition $Q_{t}(a)$ tends to $q_{*}(a)$
- this is also known as the `sample-average` method as it essentially is estimating action values by average the rewards of a by sampling action a

**Recall that `greedy` action refers to the action with the highest value**, we can define a `greedy` action as:  

$$A_{t} \dot = \arg\max_a Q_{t}(a)$$

An alternative to always greedy, is to behave greedily *most of the time*, but every once in a while with a small probability $\varepsilon$, select a random non-greedy action instead (regardless of its action value). We call this method the $\varepsilon$-greedy method.


**Benefits of $\varepsilon$-greedy methods**
- as $t$ tends to $\infty$, by definition, all actions would evebntually be selected $\infty$ times, ensuring that $Q_t(a) \to q_{*}(a)$ for all $(a)$
- this also implies that probability of selecting the optimal action converges to greater than $1-\varepsilon$

#2.3 The 10-armed Testbed

TODO: Recreate the example to show that $\varepsilon > 0$ has higher average reward compared to $\varepsilon = 0$.

- The advantage of $\varepsilon$-greedy methods depends on the reward variance; noisier rewards takes more exploration to find the optimal action, meaning $\varepsilon$-greedy methods will fare far better than greedy methods. 
- On the other hand, if reward variance was 0, then greedy method might perform best albeit only in stationary setting (recall our stationarity assumption). In a nonstationarity setting, any learned experience would be outdated and without exploration, we'll never find the new optimal policy. 

TODO: exercises 2.2 and 2.3

#2.4 Incremental Implementation

#### **Computing the sample averages of observed rewards in a computationally effficient manner**

For any given action:

$$Q_{n} \dot = \frac{1}{n-1} \sum_{i=1}^{N=n-1} R_{i}$$

where $R_{i}$ refers to the reward after the $i$th selection of this action, and $Q_n$ is the estimate of its action after being selected $n$-1 times. 

The simplest solution would simply to keep every $R$s observed. But this naive implementation would grow over time as t tends to a very large number. Instead we could use 'dynamic programming' to compute $Q_{n+1}$ given $Q_{n}$ and $R_{n}$ : $Q_{n+1} = Q_n + \frac{1}{n}\big[R_n - Qn]$.

> **TL;DR** New Estimate $\leftarrow$ Old Estimate + Step Size [Target - Old Estimate], where [Target - Old Estimate] is also called the 'error' in the estimate, and is *reduced* by taking a step towards the *target*. 

##### **Derivation**

![derivation](https://github.com/shengy90/reinforcement-learning-an-introduction/blob/master/misc/fig1.png?raw=true)

#2.5 Tracking a Nonstationary Problem

The above examples are stationary examples where reward probabilities *do not change over time*. This is however not the case in real world application most of the time. When reward probabilities aren't constant over time, it makes more sense to give more weight to recent than long-past rewards. 

A common way to do this is to use a constant 'step-size' parameter, $\alpha$:

$$ Q_{n+1} \dot = Q_{n} + \alpha \big[R_n - Q_n \big]$$

where $\alpha \epsilon (0,1]$. $Q_{n+1}$ now becomes a weighted average of past rewards and initial estimate $Q_1$. 

##### **Derivation (Todo)**

#### **Varying $\alpha$**

There might be times where we wish to vary $\alpha$. In this case, there are be 2 conditions that we might care about:

1. $\sum_{n=1}^{\infty} \alpha_{n}(a) = \infty$ : this guarantees that alpha would be large enough to overcome any initial conditions/ random fluctions, i.e. not get stuck in local minima and not reach the global minima.

2. $\sum_{n=1}^{\infty} \alpha_{n}(a)^2 \lt \infty$ : this guarantee that estimates converge. **For non-stationary situations, we might not want the estimates to converge** but rather, to vary based on the most recent rewards! 

#### **TODO:PROGRAMMING EXAMPLES**

In [0]:
#🤷‍♂️

#2.6 Optimistic Initial Values

- the above methods to some extend, depends on the initial action-value estimates $Q_1(a)$, although in sample-average with constant $\alpha$, this bias decrease overtime. 
- this bias could be helpful in some cases, e.g. supplying some prior knowledge of what level of rewards to be expected etc
- it could also used as a simple way to encourage exploration in stationary problems by setting it to a 'high' value. We call this an `optimistic initial value`. 
- It is however not suited in nonstationary problems as (recall above) the influence of the bias decreases over time (whereas we would want the agent to routinely explore in nonstationary setting). 

**TODO: EXAMPLES**

# 2.7 Upper-Confidence-Bound Action Selection

- $\epsilon$-greedy methods are better than greedy ones becayse they force the agent to explore new actions which could uncover higher value states never seen before.
- but $\epsilon$-greedy methods explores *indiscriminately* with no preference for those that are *nearly greedy* vs *particularly uncertain*
- it might be better to select non-greedy actions based on how likely they are to actually be optimal.

One way to effectively do this:

$$A_t \dot = \arg\max_a \Bigg[Q_t(a) + c \sqrt{\frac{\ln t}{N_{t}(a)}} \Bigg],$$

where $N_t(a)$ denotes the number of times action a was selected prior to time t, whilst c controls the degree of exploration. The square-root term is a measure of the uncertainty (variance) in the estimate of $a$'s value. Each time a is selected ($N_t(a)$ increases), this uncertainty is being reduced, whilst an action other than a being selected, $\ln t$ increases but not $N_t(a)$. 

#2.8 Gradient Bandit Algorithms

So far, we've conosidered methods that estimate action values and use these estimates to select actions. There are other ways to do this, e.g. adding a 'preference' of one action over another. A way to do this is to apply a preference based on a *soft-max distribution*.:

$$ Pr(A_t=a) \dot = \frac{e^{H_t(a)}}{\sum_{b=1}^k e^{H_t(b)}} \dot = \pi_{t}(a)$$

where $\pi_{t}(a)$ refers to the probability of taking action $a$ at time $t$ and $H_{t}(a)$ refers to the preference such that:

$$H_{t+1}(a) \dot = H_t(a) - \alpha(R_t - \bar R_t) \pi_t(a) \space \space \space \space \text{for all } a \neq A_t,$$


where $\bar R_t \epsilon \mathbb{R}$ is the average of the rewards up $t-1$ (with $\bar R_1 = R_1$). $\bar R_t$ serves as a 'baseline' for which the reward is compared; if the reward is higher than the baseline, then probability of taking $A_t$ in the future is increased and vice versa. 

#2.9 Associative Search (Contextual Bandits)

So far, we've only looked at `nonassociate tasks`, i.e. tasks in which we don't need to *associate different actions with different situations*. In some cases where we might want to map situations to actions that are best in those situations. For example in a self-driving car scenario, you could have *if traffic light is red, stop; else go* etc. 