# Policy Gradient Methods

## Introduction

In this chapter we consider methods that instead learn a
parameterized policy that can select actions without consulting a value function. A value
function may still be used to learn the policy parameter, but is not required for action
selection. We will use the notation $\theta \in R^{d^\prime}$ for the policy's parameter vector. Thus we write $\pi(a | s, \theta) = P(A_t = a | S_t=s, \theta_t = \theta)$ for the probability that action a is taken at time t given state s and current parameter $\theta$.
If a method uses a learned value function as well, the nthe value function's weight vector is denoted $w \in R^d$ as usual, as in $\hat{v} (s, w)$.

In this chapter we consider methods for learning the policy parameter based on the gradient of some scalar performance measure $J(\theta)$ with respect to the policy parameter. These methods seek to maximize performance, so their updates approximate gradient ascent in J:

$\theta_{t+1} = \theta_t + \alpha \hat{\nabla  J(\theta_t)} $

Where $\hat{\nabla  J(\theta_t)} \in R^{d^{\prime}}$ is a stochastic estimate whose expectation approximates the gradient of the performance measure with respect to its argument $\theta_t$ (ie. gradient of a batch). All methods that follow this general schema we call policy gradient methods, whether or not they also learn an approximate value function. Methods that learn approximations to both policy and value functions are often called
actor-critic methods, where 'actor' is a reference to the learned policy, and 'critic' refers to the learned value function, usually a state-value function. First we treat the episodic case, in which **performance is defined as the value of the start state under the parameterized policy**, before going on to consider the continuing case, in which performance is defined as the average reward rate.

## Policy Approximation and its Advantages

In policy gradient methods, the policy can be parameterized in any way, as long as $\pi(a | s, \theta)$ is differentiable with respect to its parameters, that is, as long as $\nabla \pi(a | s, \theta)$ (ie. the column vector of partial derivatives of $\pi(a | s, \theta)$) with respect to the components of $\theta$ exists and is finite for all $s \in S, a \in A(s), \theta \in R^{d^\prime}$. in practice, **to ensure exploration we generally require that the policy never becomes deterministic**. In this section we introduce the most common parameterization for discrete action spaces and point out the advantages it offers over action-value methods. Policy-based methods also offer useful ways of dealing with continuous action spaces.

If the action space is discrete and not too large, then a natural and common kind of parameterization is to form parameterized numerical preference $h(s, a, \theta) in R$ for each state-action pair. The actions with the highest preferences in each state are given the highest probabilities of being selected, for example, according to an exponential soft-max distribution:

$\pi (a | s, \theta) = \frac{e^{h(s, a | \theta)}}{\sum_{b} e^{h(s, b | \theta)}}$

Note that the denominator here is just what is required so that the action probabilities in each state sum to one. We call this kind of policy parameterization **soft-max in action preferences**.

The action preferences themselves can be parameterized arbitrarily (ie $h(s, a | \theta)$). For example, they might be computed by a DNN, where $\theta$ is the vector of all the connection weights of the network. Or the preferences could simply be linear in features,

$h(s, b | \theta) = \theta^T \phi(s, a)$

One advantage of parameterizing policies according to the soft-max in action preferences is that the approximate policy can approach a deterministic policy, whereas with $\epsilon-$greedy, action selection over action values there is always an $\epsilon$ probability of selecting a random action. Of course, one could select according to a soft-max distribution based on action values, but this alone would not allow the policy to approach a deterministic policy. Instead, the action-value estimates woudl converge to ehir corresponding true values, which would differ by a finite amount, translating to specific probabilities other than 0 and 1. If the soft-max distribution included a temperature parameter, then the temperature could be reduced over time to approach determinism, but in practice it would be difficult to choose
the reduction schedule, or even the initial temperature, without more prior knowledge of the true action values than we would like to assume. Acton preferences are different because they do not approach specific values; instead they are driven to produce the optimal stochastic policy. If the optimal policy is terministic, then the preferences of the optimal actions will be driven infinitely higher than all suboptimal actions (if permitted by the parameterization).

A second advantage of parameterizing policies according to the soft-max in action preferences is that it enables the selection of actions with arbitrary probabilities. In problems with significant function approximation, the best approximate policy may be stochastic. For example, in card games with imperfect information the optimal play is often to do two different things with specific probabilities such as when bluffing in Poker. **Action-value methods have no natural way of finding stochastic optimal policies, whereas policy approximating methods can**.

Perhaps the simplest advantage that policy parameterization may have over action-value parameterization is that the policy may be a simpler function to approximate. Problems vary in the complexity of their policies and action-value functions. For some,
the action-value function is simpler and thus easier to approximate. For others, the policy
is simpler. In the latter case a policy-based method will typically learn faster and yield a
superior asymptotic policy.

Finally, we note that the choice of policy parameterization is sometimes a good way
of injecting prior knowledge about the desired form of the policy into the reinforcement
learning system. This is often the most important reason for using a policy-based learning
method.

## The Policy Gradient Theorem

In addition to the practical advantages of policy parameterization over $\epsilon-$greedy action selection, there is also an important theoretical advantage. With continuous policy parameterization the action probabilities change smoothly as a function of the learned parameter, whereas in $\epsilon-$greedy selection the action probabilities may change dramatically for an arbitrarily small change in the estimated action values, if that change results in a different action having the maximal value. Largely because of this, stronger convergence guarantees are available for
policy-gradient methods than for action-value methods. In particular, it is the continuity of the policy dependence on the parameters that enables policy-gradient methods to approximate gradient ascent.

The episodic and continuing cases define the performance measure, $J(\theta)$, differently and thus have to be treated separately to some extent. Nevertheless, we will try to present both cases uniformly, and we develop a notation so that the major theoretical results can be described with a single set of equations.

In this section, we treat the episodic case, for which we define the performance measure as the value of the start state of the episode. We can simplify the notation without losing any meaningful generality by **assuming that every episode starts in some particular (non-random) state $s_0$**. Then, in the episodic case, we define performance as

$J(\theta) = v_{\pi_{\theta}} (s_0)$

Where $v_{\pi_{\theta}}$ is the true value function for $\pi_{\theta}$, the policy determined by $\theta$. From here on in our discussion, we will assume no discounting ($\gamma = 1$) for the episodic case, although for completeness, we do include the possibility of discounting in the boxed algorithm.

With function approximation, it may seem challenging to change the policy parameter in a way that ensures improvement. The problem is that performance depends on both the action selections, and the distribution of states in which those selections are made, and that both of these are affected by the policy parameter. Given a state, the effect of the policy parameter on the actions, and thus on reward, can be computed in a relatively straightforward way from knowledge of the parameterization. But the effect of the policy on the state distribution ($\mu$) is a function of the environment and is typically unknown.
How can we estimate the performance gradient with  respect to the policy parameter when the gradient depends on the unkonwn effect of policy changes on the state distribution?

Fortunately, there is an excellent theoretical answer to this challenge in the form of policy gradient theorem, which provides an analytic expression for the gradient of performance with respect to the policy parameter (which is what we need to approximate for gradient ascent), that does not involve the derivative of the state distribution. The policy gradient theorem for the episodic case establishes that

$\nabla J(\theta) \propto \sum_{s} \mu (s) \sum_a q_{\pi} (s, a) \nabla \pi(a | s, \theta)$

Where the gradients are column vectors of partial derivatives w.r.t the components of $\theta$, and $\pi$ denotes the policy corresponding to parameter vector $\theta$. In the episodic case, the constant of proportionality is the average length of an episode, and in the continuing case it is 1, so that the relationship is actually an equality. The distribution $\mu$ here is the on-policy distribution under $\pi$ (the state distribution).

<img src='pngs/on-policy distribution.png'>
<img src='pngs/on-policy distribution 2.png'>
<img src='pngs/proof_PGT_1.png'>

Let $\eta (s) = \sum_{k=0}^{\infty} P^{k}_{\pi}(s | s_0)$ (This is the expected visits to state s given we start at $s_0$)

Then

$\nabla J(\theta) $

$= \nabla v_{\pi} (s_0) $

$= \sum_{s} \eta (s) \sum_{a} \nabla \pi(a | s) q_{\pi} (s, a) $

$= \sum_{s^\prime} \eta (s^\prime) \sum_{s} \frac{\eta (s)}{\sum_{s^{\prime}} \eta (s^{\prime})} \sum_{a} \nabla \pi(a | s) q_{\pi} (s, a)$

$= \sum_{s^\prime} \eta (s^\prime) \sum_{s} \mu (s) \sum_{a} \nabla \pi(a | s) q_{\pi} (s, a)$

$\propto  \sum_{s} \mu (s) \sum_{a} \nabla \pi(a | s) q_{\pi} (s, a)$

## REINFORCE: Monte Carlo Policy Gradient

