# Deep Deterministic Policy Gradient

DRL algorithm for continuous control.  DDPG is adapted specifically for environments with continuous action spaces. 
It extends DQN to work with the continuous action space by introducing a deterministic actor that directly outputs continuous actions. The algorithm is model-free, online and off-policy.  


### Experience Replay buffer
Typically applied to off-policy reinforcement learning algorithms. The set $\mathcal D$ shows all previous experiences of the agent. The idea is to capture all the samples generated by an agent interacting with its environment and then storing them for later reuse. This does not store associated values (Q-values) but rather the raw data. 

### Continuous action spaces.
When the action space is continuous, we can’t exhaustively evaluate the space, and solving the optimization problem is highly non-trivial. Normal optimization algorithm would make calculating $max_a Q^*(s,a)$ a painfully expensive subroutine that would need to be run every time the agent must decide on an action. 
Since action space is continuous, the function $Q^*(s,a)$ is presumed to be differentiable with respect to the action argument. We can exploit the differentiability of a the system by using gradient descent methods to approximate a policy $\mu(a)$


## The Q-learning side
the Bellman equation describing the optimal action-value function, $Q^*(s,a)$. The optimal action can be found by solving
$a^*(s)= argmax Q^*(s,a)$

If we use a neural network ($Q_{\phi}$) to approximate the optimal action-value function. DDPG employs the use of mean-squared Bellman error (MSBE) function which estimates how close $Q_{\phi}$ comes close to satisfying the Bellman equation 

$L(\phi, D)= E[(Q_{\phi}(s,a)-(r+\gamma(1-d)Q_{\phi}(s', a')))]$



#### Target Networks
Use of a Target network to deal with non-stationary target values and make the learning more stable. The target
is the value we want our approximator $Q_{\phi}$ to take on.

$r+\gamma(1-d)Q_{\phi}(s', a')$
The problem is that the targer takes on the parameters $\phi$ that we are trying to train. This makes MSBE minimization unstable. The solution is to use a set of parameters which comes close to $\phi$, but with a time delay. This set of parameters will be trained using a **Target network**. The parameters of the target network are denoted $\phi_{target}$
Target network is updated once per main network by polyak averaging:

$\phi_{target} \leftarrow \phi_{target}p +(1-p)\phi_{target}$

Polyak Averaging is an optimization technique that sets final parameters to an average of (recent) parameters visited in the optimization trajectory

**The Target policy network**  computes an action which approximately maximizes $Q_{\phi_{\text{targ}}}$. The target policy network is found the same way as the target Q-function: by polyak averaging.

Q-learning in DDPG is performed by minimizing the following MSBE loss with stochastic gradient descent:

$L(\phi, D)= E[(Q_{\phi}(s,a)-(r+\gamma(1-d)Q_{\phi}(s', \mu_{\theta_{target}}(s)')))]$


## The Policy Learning Side
The goal is to learn a deterministic policy $\mu_{\theta}(s)$ which gives the action that maximizes $Q_{\phi}(s,a)$. We can perform gradient ascent (with respect to policy parameters only) to solve

$\max_{\theta} \underset{s \sim {\mathcal D}}{{\mathrm E}}\left[ Q_{\phi}(s, \mu_{\theta}(s)) \right].$
