# Moving to Parameterized Functions

## Parameterizing the Value Function

- approximate $v_\pi(s)$ as $\hat{v}(s,\mathbf{w})$

$$
\hat{v}(s,\underbrace{\mathbf{w}}_{\text{weights}}) \approx v_\pi(s)
$$

- Example of Parameterized Value Function

$$
\hat{v}(s,\mathbf{w}) \doteq \underbrace{w_1 X + w_2 Y}_{\text{We only have to store the two weights}}
$$

## Linear Value Function Approximation

$$
\begin{align*}
\hat{v}(s,\mathbf{w}) &\doteq \sum{\underbrace{w_i x_i(s)}_{\text{features}}}\\
                      &=<\mathbf{w},\mathbf{x}(s)>
\end{align*}
$$

## Generalization and Discrimination

- Tabular representations have provided good *discrimination*, but no *generalization*
- Generalization is important for faster learning
- Having both generalization and discrimination is ideal




## Framing Value Estimation as Suprevised Learning

- Monte-Carlo

$$
\{(S_1,G_1),(S_2,G_2),(S_3,G_3),\ldots\}
$$

- TD

$$
\{(S_1,R_2+\gamma\hat{v}(S_2,\mathbf{w})),(S_2,R_3+\gamma\hat{v}(S_3,\mathbf{w})),(S_3,R_4+\gamma\hat{v}(S_4,\mathbf{w})),\ldots\}
$$

## The Function Approximator should be Compatible with Online Updates


- We can frame the policy evaluation task as a *suprevised learning* problem
- But not all methods from supervised learning are ideal for reinforcement learning
  - If we want to use a function approximation technique, we should make sure it can work in the online setting
  - The data in reinforcement learning is always correlated

## The Function Approximator should be Compatible with Bootstrapping

- TD: Target depends on $\mathbf{w}$
- Supervised Learning: Target is fixed and given

## The Mean Squared Value Error Objective

- Mean Squared Value Error

$$
\sum_{s}{\mu(s)[v_\pi(s)-\hat{v}(s,\mathbf{w})]^2}
$$

- How to choose $\mu(s)$?
  - The fraction of time we spend in $S$ when following policy $\pi$

## Adapting the Weights to Minimize the Mean Squared Value Error Objective

$$
\overline{VE}=\sum_{s}{\mu(s)[v_\pi(s)-\hat{v}(s,\mathbf{w})]^2}
$$


## Introducing Gradient Descent

- Gradient Descent

$$
\mathbf{w}_{t+1} \doteq \mathbf{w}_t - \alpha \nabla J(\mathbf{w}_t)
$$

- Gradient Descent can be used to find stationary points of objectives
- These solutions are not always globally optimal


## Gradient of the Mean Squared Value Error Objective

$$
\begin{align*}
&\nabla \sum_{s\in\mathcal{S}}{\mu(s)[v_\pi(s)-\hat{v}(s,\mathbf{w})]^2}\\
&= \sum_{s\in\mathcal{S}}{\nabla\mu(s)[v_\pi(s)-\hat{v}(s,\mathbf{w})]^2}\\
&= -\sum_{s\in\mathcal{S}}{\mu(s)2[v_\pi(s)-\hat{v}(s,\mathbf{w})]} \nabla \hat{v}(s,\mathbf{w})& \text{(chain rule)}
\end{align*}
$$

so

$$
\begin{align*}
&\Delta \mathbf{w} \propto \sum_{s\in\mathcal{S}}{\nabla\mu(s)[v_\pi(s)-\hat{v}(s,\mathbf{w})]}& (\because\Delta\hat{v}(s,\mathbf{w})=\mathbf{x}(s))
\end{align*}
$$

## From Gradient Descent to Stochastic Gradient Descent

### Gradient Monte Carlo

$$
\begin{gather*}
\mathbf{w_{t+1}}\doteq\mathbf{w_t}+\alpha[\underbrace{v_\pi(S_t)}_{\text{?}}-\hat{v}(S_t,\mathbf{w})\nabla\hat{v}(S_t,\mathbf{w})]\\
\mathbf{w_{t+1}}\doteq\mathbf{w_t}+\alpha[G_t-\hat{v}(S_t,\mathbf{w})\nabla\hat{v}(S_t,\mathbf{w})]
\end{gather*}
$$

so

$$
\begin{gather*}
\mathbb{E}_\pi[2[v_\pi(S_t)-\hat{v}(S_t,\mathbf{w})]\nabla\hat{v}(S_t,\mathbf{w})]\\
=\mathbb{E}_\pi[2[G_t-\hat{v}(S_t,\mathbf{w})]\nabla\hat{v}(S_t,\mathbf{w})]
\end{gather*}
$$

## State Aggregation

- State aggregation treats certain states as the same
- State aggregation is another example of linear function approximation


## Semi-Gradient TD for Policy Evaluation

### The TD Update for Function Approximation

$$
\begin{gather*}
\mathbf{w} \leftarrow \mathbf{w} + \alpha [U_t - \hat{v}(S_t, \mathbf{w})]\nabla \hat{v}(S_t, \mathbf{w})\\
U_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})
\end{gather*}
$$

- $U_t$: biased $\rightarrow$ $\mathbf{w}$ may not converge to a local optimum

### TD is a semi-gradient method

$$
\begin{align*}
\nabla \frac{1}{2} [U_t - \hat{v}(S_t, \mathbf{w})]^2 &= (U_t - \hat{v}(S_t, \mathbf{w})) (\nabla U_t - \nabla\hat{v}(S_t, \mathbf{w})) \\
                                                      &\neq \underbrace{- (U_t - \hat{v}(S_t, \mathbf{w}))\nabla\hat{v}(S_t, \mathbf{w})}_{\text{The TD Update}}
\end{align*}
$$

For TD:

$$
\begin{align*}
\nabla U_t &= \nabla (R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}))\\
           &= \gamma \nabla\hat{v}(S_t, \mathbf{w})\\
           &\neq 0
\end{align*}
$$

## Comparing TD and Monte Carlo with State Aggregation

- The TD update for function approximation can be *biased*
- We often prefer TD learning over Monte Carlo anyway because it can converge more quickly


## The Linear TD Update

- recall: semi gradient TD

$$
\begin{align*}
\mathbf{w} &\leftarrow \mathbf{w} + \alpha \delta_t \nabla \hat{v}(S_t, \mathbf{w})&  (\delta_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})-\hat{v}(S_{t}, \mathbf{w})) \\
\mathbf{w} &\leftarrow \mathbf{w} + \alpha \delta_t \mathbf{x}(S_t)
\end{align*}
$$

- Tabular TD is a special case of linear TD

### The Utility of Linear Function Approximation

- Linear methods are *simpler to understand and analyze* mathematically
- With *good features*, linear methods can learn quickly and achieve good prediction accuracy



## The True Objective for TD

### The Expected TD Update

$$
\begin{align*}
\mathbf{w}_{t+1} &\doteq \mathbf{w}_t + \alpha [R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t)-\hat{v}(S_t, \mathbf{w}_t)]\mathbf{x}_t & (\hat{v}(s, \mathbf{w})\doteq \mathbf{w}^T\mathbf{x}(s))\\
                 &= \mathbf{w}_t + \alpha [R_{t+1} + \gamma \mathbf{w}^T_{t}\mathbf{x}_{t+1}-\mathbf{w}^T_{t}\mathbf{x}_{t}]\mathbf{x}_t \\
                 &= \mathbf{w}_t + \alpha [\underbrace{R_{t+1}\mathbf{x}_t}_{\mathbf{b}} - \underbrace{\mathbf{x}_{t}(\mathbf{x}_{t}-\gamma \mathbf{x}_{t+1})^T}_{\mathbf{A}}\mathbf{w}_t]
\end{align*}
$$

so

$$
\begin{gather*}
&\mathbb{E}[\Delta\mathbf{w}_t] = \alpha(\mathbf{b}-\mathbf{A}\mathbf{w}_t)&(\mathbf{b}\doteq \mathbb{E}[R_{t+1}\mathbf{x}_t],\ \mathbf{A}\doteq\mathbb{E}[\mathbf{x}_t(\mathbf{x}_t-\gamma\mathbf{x}_{t+1})^T])
\end{gather*}
$$

### The TD Fixed Point

$$
\begin{gather*}
\mathbb{E}[\Delta \mathbf{w}_{TD}] = \alpha (\mathbf{b}-\mathbf{A}\mathbf{w}_{TD}) = 0\\
\Rightarrow \mathbf{w}_{TD} = \mathbf{A}^{-1}\mathbf{b} \\
\mathbf{w}_{TD}\text{ minimizes }(\mathbf{b}-\mathbf{A}\mathbf{w})^T(\mathbf{b}-\mathbf{A}\mathbf{w})
\end{gather*}
$$

### Relating the TD Fixed Point and the Minimum of the Value Error

$$
\overline{VE}(\mathbf{w}_{TD})\leq\frac{1}{1-\gamma}\min_{\mathbf{w}}\overline{VE}(\mathbf{w})
$$

### Summary

- Linear semi-gradient TD is guaranteed to converge to a fixed point, called the *TD fixed point*
- The TD fixed point relates to the minimum mean squared value error