# Policy gradient intuition

Policy gradient is one of the most popular algorithms in deep reinforcement learning.
As we have learned, policy gradient is a policy-based method by which we can find
the optimal policy without computing the Q function. It finds the optimal policy by
directly parameterizing the policy using some parameter $\theta$


The policy gradient method uses a stochastic policy. We have learned that with a
stochastic policy, we select an action based on the probability distribution over the
action space. Say we have a stochastic policy π , then it gives the probability of taking
an action $a$ given the state $s$. It can be denoted by $\pi_{}(a|s)$ . In the policy gradient
method, we use a parameterized policy, so we can denote our policy as $\pi_{\theta}(a|s)$ ,
where θ indicates that our policy is parameterized.


Wait! What do we mean when we say a parameterized policy? What is it exactly?
Remember with DQN, we learned that we parameterize our Q function to compute
the Q value? We can do the same here, except instead of parameterizing the Q
function, we will directly parameterize the policy to compute the optimal policy.
That is, we can use any function approximator to learn the optimal policy, and θ is
the parameter of our function approximator. We generally use a neural network as
our function approximator. Thus, we have a policy π parameterized by θ where θ is
the parameter of the neural network.

Say we have a neural network with a parameter θ. First, we feed the state of the
environment as an input to the network and it will output the probability of all
the actions that can be performed in the state. That is, it outputs a probability
distribution over an action space. We have learned that with policy gradient, we use
a stochastic policy. So, the stochastic policy selects an action based on the probability
distribution given by the neural network. In this way, we can directly compute the
policy without using the Q function.

Let's understand how the policy gradient method works with an example. Let's take
our favorite grid world environment for better understanding. We know that in the
grid world environment our action space has four possible actions: up, down, left,
and right.

Given any state as an input, the neural network will output the probability
distribution over the action space. That is, as shown in the following figure, when we feed the
state E as an input to the network, it will return the probability distribution over all
actions in our action space. Now, our stochastic policy will select an action based on
the probability distribution given by the neural network. So, it will select action up
10% of the time, down 10% of the time, left 10% of the time, and right 70% of the time:

![title](Images/3.png)

We should not get confused with the DQN and the policy gradient method. With
DQN, we feed the state as an input to the network, and it returns the Q values of all
possible actions in that state, then we select an action that has a maximum Q value.
But in the policy gradient method, we feed the state as input to the network, and it
returns the probability distribution over an action space, and our stochastic policy
uses the probability distribution returned by the neural network to select an action.

Okay, in the policy gradient method, the network returns the probability distribution
(action probabilities) over the action space, but how accurate are the probabilities?
How does the network learn?

__We will discuss this in detail in the next section.__