# Policy Gradient Theorem Derivation

Consider an MDP with state space $S$, discrete action space $A = \{a_1, a_2, a_3\}$, reward function $R$, discount factor $\gamma$, and a policy parameterized via a softmax function as follows:

$$ 
\pi(a|s) = \frac{\exp(z(s,a))}{\sum_{a' \in A} \exp(z(s,a'))},
$$

where $z(s,a)$ are the parameters (or scores) corresponding to the state-action pair $(s,a)$. The performance objective is defined as

$$ 
J(\pi) = \mathbb{E}_{s \sim d^\pi}[V^\pi(s)],
$$

with $d^\pi(s)$ being the steady-state distribution induced by the policy $\pi$, and the state-value function given by

$$ 
V^\pi(s) = \sum_{a \in A} \pi(a|s) Q^\pi(s,a).
$$

Our goal is to show that the gradient of $J(\pi)$ with respect to the policy parameter $z(s,a)$ is

$$ 
\frac{\partial J(\pi)}{\partial z(s,a)} = d^\pi(s) \; \pi(a|s) \; A^\pi(s,a),
$$

where the advantage function is defined as

$$ 
A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s).
$$

## Step 1: Policy Gradient Theorem

The policy gradient theorem states that

$$ 
\nabla_{\theta} J(\pi) = \sum_{s \in S} d^\pi(s) \sum_{a \in A} \nabla_{\theta} \pi(a|s) \; Q^\pi(s,a),
$$

where $\theta$ are the parameters of the policy. In our case, $z(s,a)$ serves as these parameters. Using the likelihood ratio trick, we have

$$ 
\nabla_{\theta} \pi(a|s) = \pi(a|s) \; \nabla_{\theta} \log \pi(a|s).
$$

Substituting this expression into the gradient gives

$$ 
\nabla_{\theta} J(\pi) = \sum_{s \in S} d^\pi(s) \sum_{a \in A} \pi(a|s) \; \nabla_{\theta} \log \pi(a|s) \; Q^\pi(s,a).
$$

## Step 2: Differentiating $\log \pi(a|s)$ with Respect to $z(s,a)$

Since the policy is parameterized as a softmax, we write its log as

$$ 
\log \pi(a|s) = z(s,a) - \log \Bigl( \sum_{a' \in A} \exp(z(s,a')) \Bigr).
$$

Taking the derivative with respect to $z(s,a)$ (for fixed $s$ and $a$):

1. The derivative of $z(s,a)$ with respect to itself is $1$.

2. For the second term, by the chain rule,

$$ 
\frac{\partial}{\partial z(s,a)} \log \Bigl( \sum_{a' \in A} \exp(z(s,a')) \Bigr) = \frac{\exp(z(s,a))}{\sum_{a' \in A} \exp(z(s,a'))} = \pi(a|s).
$$

Thus,

$$ 
\frac{\partial}{\partial z(s,a)} \log \pi(a|s) = 1 - \pi(a|s).
$$

## Step 3: Combining with the Policy Gradient Expression

Substitute the derivative back into the gradient expression. Noting that the derivative with respect to $z(s,a)$ is nonzero only for that specific $(s,a)$, we obtain

$$ 
\frac{\partial J(\pi)}{\partial z(s,a)} = d^\pi(s) \; \pi(a|s) \Bigl[ Q^\pi(s,a) - \sum_{a' \in A} \pi(a'|s) \; Q^\pi(s,a') \Bigr].
$$

Since the state-value function is defined as

$$ 
V^\pi(s) = \sum_{a' \in A} \pi(a'|s) \; Q^\pi(s,a'),
$$

the expression inside the brackets becomes the advantage function

$$ 
A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s).
$$

Thus, the gradient simplifies to

$$ 
\frac{\partial J(\pi)}{\partial z(s,a)} = d^\pi(s) \; \pi(a|s) \; A^\pi(s,a).
$$

**Final Result:**

$$ 
\nabla_z J(\pi) = \frac{\partial J(\pi)}{\partial z(s,a)} = d^\pi(s) \; \pi(a|s) \; A^\pi(s,a).
$$

This derivation shows that under the softmax policy parameterization, the gradient of the expected return with respect to the parameter $z(s,a)$ is given by the product of the steady-state distribution $d^\pi(s)$, the policy probability $\pi(a|s)$, and the advantage function $A^\pi(s,a)$.