# Moving to Parameterized Functions

## Parameterizing the Value Function

- approximate $v_\pi(s)$ as $\hat{v}(s,\mathbb{w})$

$$
\hat{v}(s,\underbrace{\mathbb{w}}_{\text{weights}}) \approx v_\pi(s)
$$

- Example of Parameterized Value Function

$$
\hat{v}(s,\mathbb{w}) \doteq \underbrace{w_1 X + w_2 Y}_{\text{We only have to store the two weights}}
$$

## Linear Value Function Approximation

$$
\begin{align*}
\hat{v}(s,\mathbb{w}) &\doteq \sum{\underbrace{w_i x_i(s)}_{\text{features}}}\\
                      &=<\mathbb{w},\mathbb{x}(s)>
\end{align*}
$$

## Generalization and Discrimination

- Tabular representations have provided good *discrimination*, but no *generalization*
- Generalization is important for faster learning
- Having both generalization and discrimination is ideal




## Framing Value Estimation as Suprevised Learning

- Monte-Carlo

$$
\{(S_1,G_1),(S_2,G_2),(S_3,G_3),\ldots\}
$$

- TD

$$
\{(S_1,R_2+\gamma\hat{v}(S_2,\mathbb{w})),(S_2,R_3+\gamma\hat{v}(S_3,\mathbb{w})),(S_3,R_4+\gamma\hat{v}(S_4,\mathbb{w})),\ldots\}
$$

## The Function Approximator should be Compatible with Online Updates


- We can frame the policy evaluation task as a *suprevised learning* problem
- But not all methods from supervised learning are ideal for reinforcement learning
  - If we want to use a function approximation technique, we should make sure it can work in the online setting
  - The data in reinforcement learning is always correlated

## The Function Approximator should be Compatible with Bootstrapping

- TD: Target depends on $\mathbb{w}$
- Supervised Learning: Target is fixed and given

## The Mean Squared Value Error Objective

- Mean Squared Value Error

$$
\sum_{s}{\mu(s)[v_\pi(s)-\hat{v}(s,\mathbb{w})]^2}
$$

- How to choose $\mu(s)$?
  - The fraction of time we spend in $S$ when following policy $\pi$

## Adapting the Weights to Minimize the Mean Squared Value Error Objective

$$
\overline{VE}=\sum_{s}{\mu(s)[v_\pi(s)-\hat{v}(s,\mathbb{w})]^2}
$$


## Introducing Gradient Descent

- Gradient Descent

$$
\mathbb{w}_{t+1} \doteq \mathbb{w}_t - \alpha \nabla J(\mathbb{w}_t)
$$

- Gradient Descent can be used to find stationary points of objectives
- These solutions are not always globally optimal


## Gradient of the Mean Squared Value Error Objective

$$
\begin{align*}
&\nabla \sum_{s\in\mathcal{S}}{\mu(s)[v_\pi(s)-\hat{v}(s,\mathbb{w})]^2}\\
&= \sum_{s\in\mathcal{S}}{\nabla\mu(s)[v_\pi(s)-\hat{v}(s,\mathbb{w})]^2}\\
&= -\sum_{s\in\mathcal{S}}{\mu(s)2[v_\pi(s)-\hat{v}(s,\mathbb{w})]} \nabla \hat{v}(s,\mathbb{w})& \text{(chain rule)}
\end{align*}
$$

so

$$
\begin{align*}
&\Delta \mathbb{w} \propto \sum_{s\in\mathcal{S}}{\nabla\mu(s)[v_\pi(s)-\hat{v}(s,\mathbb{w})]}& (\because\Delta\hat{v}(s,\mathbb{w})=\mathbb{x}(s))
\end{align*}
$$

## From Gradient Descent to Stochastic Gradient Descent

### Gradient Monte Carlo

$$
\begin{gather*}
\mathbb{w_{t+1}}\doteq\mathbb{w_t}+\alpha[\underbrace{v_\pi(S_t)}_{\text{?}}-\hat{v}(S_t,\mathbb{w})\nabla\hat{v}(S_t,\mathbb{w})]\\
\mathbb{w_{t+1}}\doteq\mathbb{w_t}+\alpha[G_t-\hat{v}(S_t,\mathbb{w})\nabla\hat{v}(S_t,\mathbb{w})]
\end{gather*}
$$

so

$$
\begin{gather*}
\mathbb{E}_\pi[2[v_\pi(S_t)-\hat{v}(S_t,\mathbb{w})]\nabla\hat{v}(S_t,\mathbb{w})]\\
=\mathbb{E}_\pi[2[G_t-\hat{v}(S_t,\mathbb{w})]\nabla\hat{v}(S_t,\mathbb{w})]
\end{gather*}
$$