<a href="https://colab.research.google.com/github/shengy90/reinforcement-learning-an-introduction/blob/master/Chapter_6_Temporal_Difference_Learning_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Temporal-Difference Learning:
- A combination of Monte Carlo and Dynamic Programming ideas 
- TD methods can learn directly from raw experience without a model of the environment's dynamics 
- TD methods update estimpates based on other learned estimates 

**2 types of problems in TD methods:**
1. `policy evaluation` or `prediction`: estimating the value function $v_{\pi}$ for a given policy $\pi$.
2. `finding optimal policy` or `control` : using generalised policy iteration to discover the optimal policy 

**2 ways of finding the optimal policy:**
- `on-policy` : refers to the value-action function $q(s,a)$ from actions taken under the current policy.
- `off-policy` refers to the value-action function $q(s,a)$ that includes all types of action, including explarotary ones.

# 6.1 TD Prediction

Given some experience following policy $\pi$, we update their estimate of $v_\pi$ for the nonterminal states $S_t$:

$$
V(S_t) \leftarrow V(S_t) + \alpha \big[R_{t+1} + \gamma V(S_{t+1}) - V(S_{t}) \big]
$$

On transition to the state $S_{t+1}$ (and thereby receiving $R_{t+1}$), we update $S_{t}$ immediately.

We can think of TD update as *sort of an error*, because it's kind of measuring the difference between the estimated value of $S_t$ and a better estimate $R_{t+1} + \gamma V(S_{t+1})$. 

The update term $R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is also known as the `TD error`, $\delta_t$.

**Note that:**
- the TD error term at each timestamp is the error in the estimate *made at that time*
- The TD error term depends on the next state and the next reward and is not actualy available until one step later! 

# 6.2 Advantages of TD Prediction Methods

- TD methods update theri estimates based in part on other estimates. They learn a guess *from a guess*.

**compared to DP methods:**
- TD methods don't require a model of the environment, its rewards, and probability distributions of the next state. 
- TD methods are implemented in an online incremental fashion - they update in real time instead of waiting for the episode to finish (Like in Monte Carlo methods)

# 6.3 Optimality of TD(0)

**Suppose that there is available only a finite amount of experience, say 10 episodes, or 100 time steps...**
- common approach with incremental learning is to present the experience repeatedly until method converges 
- Given an approximate value function V, the *update terms* are computed for every timestep for a nonterminal state but the value function is only changed once, by the sum of all the increments
- then all available experience is processed again with the new value function to produce a new update term 

> This is also called 'batch updating' because updates are only made after processing each complete batch of training data

**Under batch updating**
- TD(0) converges deterministically to a single answer independent of the learning rate, $\alpha$, as long as $\alpha$ is sufficiently small

# 6.4 Sarsa: On-policy TD Control

**Recall:**
1. An episode consists of an alternating sequence of states and state-action pairs:
    $$... S_t \to A_t \to R_{t+1} \to S_{t+1} \to A_{t+1} \to ...$$
2. We're interested in learning the action-value pair $q(s,a)$ instead of the state-value pair $v(s,a)$, because the latter requires us to know everything about the environment, which is a criteria we're relaxing for the TD method.


$$
Q(S_t, A_t) \leftarrow 
Q(S_t, A_t) + \alpha \big[R_{t+1} + 
\gamma Q(S_{t+1}, A_{t+1}) - Q(S_{t},A_{t}) \big]
$$

**SARSA**:

- This update is done after every transition from a nonterminal state $S_t$. If $S_{t+1}$ is terminal, then $Q(S_{t+a},A_{t+1})$ is 0.
- This rule uses every element of the quintuple of events : $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ and thus, *SARSA*!
- As this is an on-policy method, we contiually estimate $q_\pi$ for the behaviour policy, $\pi$ and change $\pi$ towards the greediness with respect to $q_\pi$
- SARSA converges to the optimal policy and action-value function as long as all state-action pairs are visited infinite amount of times if $t \to \infty$

**Pseudo-code**

- Parameters: $\alpha \in (0,1], \text{small } \epsilon > 0$ 
- Initialise $Q(s,a)$ for all $s \in S^+$, $\alpha \in A(s)$, arbitrarily except $Q(\text{terminal},\cdot )=0$ 
- Loop for each episode: 
    - Initialise $S$
    - Choose $A$ from $S$ using policy derived from Q (e.g. $\epsilon$-greedy)
    - Loop for each step of episode:
        - Take action A, observe R, S'
        - Choose A' from S' using policy derived from Q (e.g. $\epsilon$-greedy)
        - Update Q(S,A) $\leftarrow$ Q(S,A) + $\alpha$[R + $\gamma$Q(S',A') - Q(S,A)]
        - Update $S \leftarrow S', A \leftarrow A'$
    - until S is terminal 

# 6.5 Q-learning: Off-policy TD Control

$$
Q(S_t,A_t) \leftarrow Q(S_t, A_t) + 
\alpha \big[ R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t) \big]
$$

In this case, the learned action-value function directly approximates $q*$, the optimal action-value function, independent of the policy being followed, meaning all it needs for convergence is that all pairs continue to be updated. 

**Pseudo-code**
- Parameters: step size $\alpha \in (0,1], \text{ small } \epsilon \gt 0$
- Initialise $Q(s,a)$ for all $s \in S^+, a \in A(s)$, arbitrarily except $Q(terminal, \cdot) = 0$ 
- Loop for each episode:
    - Initialise $S$
    - Loop for each step of episode:
        - Choose A from S using policy derived from Q e.g. ($\epsilon$-greedy)
        - Take action A, observe R, S'
        - Update Q(S,A) $\leftarrow$ Q(S,A) + $\alpha$[R + $\gamma \max_{a}$Q(S',a) - Q(S,A)]
        - Update $S \leftarrow S', A \leftarrow A'$
    - until S is terminal 

# 6.6 Expected Sarsa

A modification of Q-learning, in that it takes the *expected* value instead of the *max* to update Q(S,A):

$$
\begin{align}
Q(S_t,A_t)
& \leftarrow Q(S_t,A_t) + \alpha \big[ 
    R_{t+1} + 
    \gamma \mathbb{E}_\pi [Q(S_{t+1},A_{t+1} | S_{t+1} - Q(S_t,A_t)] 
    \big] \\
& \leftarrow Q(S_t,A_t) + \alpha \big[ 
    R_{t+1}
    + \gamma \sum_{a} \pi(a|S_{t+1})Q(S_{t+1},a) - Q(S_t,A_t) 
    \big] \\
\end{align}
$$

This algorithm moves *deterministically* in the same direction as Sarsa moves *in expectation*, therefore is called *expected SARSA*. It's more computationally complex, but eliminates the variance due to the random selection of $A_{t+1}$ and therefore, performs slightly better than SARSA.

# 6.7 Maximisation Bias and Double Learning

All of these methods attempt to maximise their target policies. For Q-learning, it's the greedy policy given current action values (defined with a max). In SARSA, it's the $\epsilon$-greedy policy. This however can lead to positive bias. 

Imagine if the true value of q(s,a) are all zero but the estimates Q(s,a) are uncertain and distributed around 0. The maximum of the true values is zero, but the maximum of the estimate is positive (therefore a positive bias!). This is called *maximisation bias*.

One way to elimiate this maximisation bias is through something called **Double Learning**. Basically, we learn 2 estimates, but at each timestep only 1 of the 2 are randomly updated. Then, we could just use the average of both estimates as our expectation of the maximum. 