diff --git a/statistics/reinforcement-learning/4 RL Model-Free Prediction.md b/statistics/reinforcement-learning/4 RL Model-Free Prediction.md index ec3b463..6183a4e 100644 --- a/statistics/reinforcement-learning/4 RL Model-Free Prediction.md +++ b/statistics/reinforcement-learning/4 RL Model-Free Prediction.md @@ -407,6 +407,62 @@ $$G_t^\lambda = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}$$ > [!tip] Eligibility traces > ![[Pasted image 20231027090342.png]] +> +> If you observe this sequence, what do you think caused the bell? This is the *credit assignment problem*, did the bell or the light cause the shock? There are a few ways to think about this: +> +> - Frequency heuristic: assign credit to most frequent state +> - Recency heuristic: assign credit to most recent state +> +> The idea of *eligibility traces* is to combine both heuristics. +> - Look over time at the states that we visit; every time we visit state $s$, we increase the eligibility trace, and decrease it when we start to not visit it. +> - The diagram below shows the eligibility trace for one given state, the vertical lines at the bottom mark every time that the state is visited. +> +> ![[Pasted image 20231027103240.png]] +> +> $$ +> \begin{align} +> E_0(s) &= 0 \\ +> E_t(s) &= \gamma \lambda E_{t-1}(s) + 1(S_t = s) +> \end{align} +> $$ +> +> When we see an error, we update the value function for each state in proportion to the eligibility trace/how much we think being in that state caused the error. + +- We keep an eligibility trace for every state $s \in \mathbb{S}$ +- We update the value function $V(s)$ for state $s$ in proportion to TD-error $\delta_t$ and eligibility trace $E_t(s)$ + - Recall that $\delta_t=R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ + - Hence our update equation is now $V(s) \leftarrow V(s) + \alpha \delta_t E_t(s)$ + +![[Pasted image 20231027104019.png]] + +- This is like the TD-error being broadcast backwards to every state in the past, and the value function at every state being updated in proportion to the TD-error. + +> [!info] $TD(\lambda)$ and $TD(0)$ +> When $\lambda = 0$, only the current state is updated: +> +> $$ +> \begin{align} +> E_t(s) &= 1(S_t = s) \\ +> V(s) &\leftarrow V(s) + \alpha \delta_t E_t(s) +> \end{align} +> $$ +> +> This is equivalent to the $TD(0)$ update: $V(S_t) \leftarrow V(S_t) + \alpha \delta_t$ + +> [!info] $TD(\lambda)$ and MC +> When $\lambda=1$, credit is deferred until the end of the episode. +> - Consider episodic environments with off-line updates +> - Over the course of an episode, total update for $TD(1)$ is the same as the total update for MC +> +> **Theorem** +> +> *The sum of off-line updates is identical for forward view and backward view $TD(\lambda)$* +> +> $$\sum_{t=1}^{T} \alpha \delta_t E_t(s) = \sum_{t=1}^{T} \alpha (G_t - V(S_t)) 1(S_t = s)$$ +## Summary + +![[Pasted image 20231027105327.png]] + + -1:30:38 diff --git a/statistics/reinforcement-learning/attachments/Pasted image 20231027103240.png b/statistics/reinforcement-learning/attachments/Pasted image 20231027103240.png new file mode 100644 index 0000000..f5b3f30 Binary files /dev/null and b/statistics/reinforcement-learning/attachments/Pasted image 20231027103240.png differ diff --git a/statistics/reinforcement-learning/attachments/Pasted image 20231027104019.png b/statistics/reinforcement-learning/attachments/Pasted image 20231027104019.png new file mode 100644 index 0000000..f207424 Binary files /dev/null and b/statistics/reinforcement-learning/attachments/Pasted image 20231027104019.png differ diff --git a/statistics/reinforcement-learning/attachments/Pasted image 20231027105327.png b/statistics/reinforcement-learning/attachments/Pasted image 20231027105327.png new file mode 100644 index 0000000..50ed394 Binary files /dev/null and b/statistics/reinforcement-learning/attachments/Pasted image 20231027105327.png differ