# Policy Gradient



## Learning Policies Directly

### Constraints on the Policy Parameterization

$$
\begin{gather*}
\pi(a|s,\theta) \geq 0 & \forall a \in \mathcal{A},s \in \mathcal{S}\\
\sum_{a\in\mathcal{A}}{\pi(a|s,\theta)=1} & \forall s \in \mathcal{S}
\end{gather*}
$$

### The Softmax Policy Parameterization

$$
\pi(a|s,\theta)\doteq\frac{\underbrace{e^{h(s,a,\theta)}}{\text{Action Preference}}}{\sum_{b\in\mathcal{A}}{e^{h(s,b,\theta)}}}
$$


## Advantages of Policy Parameterization

### Parameterized stochastic policies are useful because

- They can autonomously *decrease exploration* over time
- They can avoid failures due to deterministic policies with *limited function approximation*
- Sometimes the policy is less complicated than the value function



## The Objective for Learning Policies

### The Average Reward Objective

$$
r(\pi)=\underbrace{\sum_{s}\mu(s)\underbrace{\sum_{a}\pi(a|s,\theta)\underbrace{\sum_{s',r}p(s',r|s,a)r}_{\mathbb{E}[R_t|S_t=s,A_t=a]}}_{\mathbb{E}_\pi[R_t|S_t=s]}}_{\mathbb{E}_\pi[R_t]}
$$

### Optimizing The Average Reward Objective

- Policy-Gradient Method

$$
\nabla r(\pi)=\nabla\sum_{s}\mu(s)\sum_{a}\pi(a|s,\theta)\sum_{s',r}p(s',r|s,a)r
$$

### The Challenge of Policy Gradient Methods

- We can use the average reward as an objective for policy optimization

$$
\nabla_\theta r(\pi)=\nabla_\theta\sum_{s}\underbrace{\mu(s)}_{\text{Depends on }\theta}\sum_{a}\pi(a|s,\theta)\sum_{s',r}p(s',r|s,a)r
$$

$$
\begin{align*}
\nabla_\mathbf{w}\overline{VE} &= \nabla_\mathbf{w}\sum_{s}\mu(s)[v_\pi(s)-\hat{v}(s,\mathbf{w})]^2 \\
                               &= \sum_{s}\mu(s)\nabla_\mathbf{w}[v_\pi(s)-\hat{v}(s,\mathbf{w})]^2
\end{align*}
$$



## The Policy Gradient Theorem

### The Gradient of the Objective

$$
\begin{align*}
\nabla r(\pi) &= \nabla\sum_{s}\mu(s)\sum_{a}\pi(a|s,\theta)\sum_{s',r}p(s',r|s,a)r \\
              &= \sum_{s}\nabla\mu(s)\sum_{a}\pi(a|s,\theta)\sum_{s',r}p(s',r|s,a)r + \sum_{s}\mu(s)\nabla\sum_{a}\pi(a|s,\theta)\sum_{s',r}p(s',r|s,a)r
\end{align*}
$$

### The Policy Gradient Theorem

- The *policy gradient theorem* gives an expression for the gradient of the average reward

$$
\nabla r(\pi) = \sum_{s}\mu(s)\sum_{a}\nabla \pi(a|s,\theta)q_\pi(s,a)
$$



## Estimating the Policy Gradient

### Getting Stochastic Samples of the Gradient

$$
\begin{gather*}
\nabla r(\pi)=\sum_{s}\mu(s)\sum_{a}\nabla \pi(a|s,\theta)q_\pi(s,a)\\
\theta_{t+1}\doteq\theta_{t}+\alpha\sum_{a}\nabla\pi(a|S_t,\theta_t)q_\pi(S_t,a)\\
\\
S_0,A_0,R_1,S_1,A_1,\ldots,S_t,A_t,R_{t+1},\ldots
\end{gather*}
$$

### Unbiasedness of the Stochastic Samples

$$
\begin{align*}
\nabla r(\pi) &=\sum_{s}\mu(s)\sum_{a}\nabla\pi(a|s,\theta)q_\pi(s,a)\\
              &=\mathbb{E}_\mu[\sum_{a}\nabla\pi(a|S,\theta)q_\pi(S,a)]
\end{align*}
$$

### Getting Stochastic Samples with One Action

$$
\begin{align*}
&\sum_{a}\nabla\pi(a|S,\theta)q_\pi(S,a)\\
&=\sum_{a}\pi(a|S,\theta)\frac{1}{\pi(a|S,\theta)}\nabla\pi(a|S,\theta)q_\pi(S,a)\\
&=\mathbb{E}_\pi[\frac{\nabla\pi(A|S,\theta)}{\pi(A|S,\theta)}q_\pi(S,A)]
\end{align*}
$$

### Stochastic Gradient Ascent for Policy Parameters

$$
\begin{align*}
\theta_{t+1}&\doteq\theta_{t}+\alpha\frac{\nabla\pi(A_t|S_t,\theta_t)}{\pi(A_t|S_t,\theta_t)}q_\pi(S_t,A_t)\\
            &=\theta_{t}+\alpha\nabla\ln\pi(A_t|S_t,\theta_t)q_\pi(S_t,A_t)&(\because \nabla\ln\left(f(x)\right)=\frac{\nabla f(x)}{f(x)})
\end{align*}
$$

### Computing the Update

$$
\theta_{t+1}=\theta_{t}+\alpha\underbrace{\nabla\ln\pi(A_t|S_t,\theta_t)}_{\text{gradient of the policy (computable)}}\underbrace{q_\pi(S_t,A_t)}_{\text{estimate of the differntial valus (computable)}}
$$
