<div align='center'>
    <h1> 
        <a href='https://arxiv.org/pdf/1509.02971'> Deep Deterministic Policy Gradient (DDPG) </a> 
    </h1>
</div>

# Intro

Deep Deterministic Policy Gradient (DDPG) is a `model-free` (no transition probability) `off-policy` (uses a replay buffer), and `actor-critic` (two neural networks) algorithm that combines elements of policy gradient methods with deep Q-learning. DDPG is an extension of DQN for `continuous action space`. 

- DDPG uses four neural networks:
    - The Actor network.
    - The Critic network.
    - The target Actor network.
    - The target Critic network.

$ \ $
- DDPG has two objective loss functions: one for the Actor network (`deterministic policy gradient loss`) $L^{DPG} = J(\pi_{\theta}) = \mathbb{E}_{s \sim \mathcal{D}}[Q_{\phi} (s, \mu_{\theta}(s))$ and another for the Critic network (`mean-squared Bellman error (MSBE) loss`). While the Target Actor and Target Critic networks use the polyak averaging update rule.

- DDPG maximizes $L^{DPG}$ with `gradient ascent` and minimizes MSBE with `gradient descent`.

- DDPG uses `temporal difference learning` (bootstrapping) and a `experience replay buffer` (for stability) to learn an `off-policy Action-Value function` $Q(s, a)$ (estimated by the `Critic Network`) via the `Bellman optimality equation`.
  
- DDPG uses the aforementioned $Q(s, a)$ value function to learn a `deterministic policy` $a_t = \mu_{\theta} (s_t)$ estimated by the `Actor Network`.

- Unlike DQN, DDPG does not use $\epsilon$-greedy policy (exploitation) for action selection. Instead, In DDPG, the behavior policy for action selection is derived from the actions generated by the Actor network with the addition of `noise` to encourage `exploration` in the environment.

- DDPG is suitable for `continuous action spaces`.

# The RL Goal in DDPG

The goal of reinforcement learning is to maximize the expected cumulative reward (a.k.a expected return) $J(\pi_{\theta})$ under a parameterized policy $\pi_{\theta}$:

$$\text{max } J(\pi_{\theta}) = \text{max } \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau) | \pi_{\theta_t}].$$

In DDPG, the RL goal is translated to learning a parameterized deterministic policy $\mu_{\theta}(s)$, represented by the Actor network, whose actions maximize the expected Q-value estimated by the `Critic network` $Q_{\phi}$:

$$\text{max } J(\pi_{\theta})  = \text{max }_{\theta} \mathbb{E}_{s \sim \mathcal{D}}[Q_{\phi} (s, \mu_{\theta}(s))].$$

Where the state $s$ is sampled from a replay buffer: $s \sim \mathcal{D}$.

In DDPG, the parameters of the `Critic Network` are updated by taking the `gradient descent` of the mean-squared Bellman error (MSBE) loss function to improve its Q-value predictions. While the parameters of the `Actor Network` $\mu_{\theta}$ are updated by taking the `gradient ascent` of the Q-value function with respect to the Actor's parameters:

\begin{eqnarray}
\theta^{\mu}_{t+1} &=& \theta^{\mu}_t + \alpha \nabla_{\theta} J(\mu_{\theta})\\
&=& \theta^{\mu}_t + \alpha \nabla_{\theta^{\mu}} \mathbb{E}_{s \sim \mathcal{D}}[Q_{\phi} (s, \mu_{\theta}(s))] \\
&=& \theta^{\mu}_t + \alpha \mathbb{E}_{s \sim \mathcal{D}} \Bigg[ \nabla_{\theta^{\mu}} Q_{\phi} (s, \mu_{\theta}(s)) \Bigg].
\end{eqnarray}

# Deriving the Deterministic Policy Gradient

The general form of the Policy Gradient, which follows from the **Policy Gradient Theorem** (Sutton et al., 1999), is: 

$$ \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} \Bigg[ \sum_{t=0}^T \nabla_{\theta} log \left( \pi_{\theta} (a_t | s_t) \right) \Phi_t \Bigg].$$

In the case of DDPG, $\Phi = Q_{\phi} (s, \mu_{\theta}(s))$. 

The **Policy Gradient Theorem** in DPPG can be reformulated as:

\begin{eqnarray}
        \nabla_{\theta^{\mu}} \Bigg( J(\mu_{\theta}) \Bigg) &=& \nabla_{\theta^{\mu}} \Bigg( \int  \underbrace{\mathbb{P}(\tau | \theta)}_{\text{Trajectory prob.}} Q_{\phi} (s, \mu_{\theta}(s)) \Bigg) \\
        &=& \nabla_{\theta^{\mu}} \Bigg( \mathbb{E}_{s \sim \mathcal{D}}[Q_{\phi} (s, \mu_{\theta}(s))] \Bigg)\\
        &=&  \mathbb{E}_{s \sim \mathcal{D}} \Bigg[ \nabla_{\theta^{\mu}} Q_{\phi} (s, \mu_{\theta}(s)) \Bigg] \\
        &\underbrace{=}_{\text{after chain rule}}&  \mathbb{E}_{s \sim \mathcal{D}} \Bigg[ \nabla_a Q_{\phi} (s, a)|_{a=\mu_{\theta}(s)} \nabla_{\theta^{\mu}} \mu_{\theta}(s) \Bigg] \\
        &\approx&  \underbrace{\frac{1}{|B|} \sum_{t, s \in \mathcal{D}}}_{\text{empirical average}} \Bigg[ \nabla_a Q_{\phi} (s_t, a_t)|_{a_t=\mu_{\theta}(s_t)} \nabla_{\theta^{\mu}} \mu_{\theta}(s_t) \Bigg].\\
\end{eqnarray}

Where $|B|$ is the number of transitions stored in the replay buffer $\mathcal{D}$.

# Algorithm

---
**Algorithm (Pseudocode): Deep Deterministic Policy Gradient (adapted from Open AI)**

---

- Initialize the environment to a random state $s_t$. Initialize an empty replay buffer $\mathcal{D}$. Initialize random parameters $\theta^{\mu}$ and $\phi^{Q}$ for Actor and Critic networks, respectively. Set the target Actor and target Critic parameters to the main parameters: $\theta^{\mu_{targ}} \leftarrow \theta^{\mu}$, $\phi^{Q_{targ}} \leftarrow \phi^{Q}$.

- **Repeat:**

    - Feed the current state $s_t$ to the Actor neural network $\mu_{\theta}$ that will return an action value $a_t = \mu_{\theta} (s_t)$ (a continuous number, not a probability, since the policy is deterministic). 

    - Apply a noise, typically Gaussian $\epsilon \sim \mathcal{N}$, to the action $a_t$.
    
    - Drive the agent with action $a_t$ in the environment that will return a reward $r_t$, the next state $s_{t+1}$, and a possibly done (boolean) value $d$ that tells whether the episode has ended.
 
    - Store the experience/transition as a tuple $B=(s_t, a_t, r_{t}, s_{t+1}, d$) into the replay buffer $\mathcal{D}$. The replay buffer is used to ensure stability.

    - If the next state $s_{t+1}$ is the terminal state, reset the environment.

    - **Update Critic network (Q function)**:

        - Sample a random mini-batch of transitions $B=(s_t, a_t, r_t, s_{t+1},d)$ from the replay buffer $\mathcal{D}$.

        - Use `target Actor network` $\mu_{\theta_{targ}}$ to get the target action for the next state: $\alpha_{\mu_{targ}}=\mu_{\theta_{targ}}(s_{t+1})$.

        - Feed the previous action from the target Actor to the `target Critic network` $Q_{\phi_{targ}}$ to get the target value: $$y_t (r, s_{t+1}, d) = r_t + \gamma(1-d) Q_{\phi_{targ}}(s_{t+1}, \mu_{\theta_{targ}}(s_{t+1})).$$

        - Feed the state and action sampled from the replay buffer to the `Critic network` $Q_{\phi}$ to get the predicted value $Q_{\phi}(s_t, \alpha_t)$. 

        - With the transitions sampled from the replay buffer ($\tau \sim B$), update the Critic network's parameters $\theta^{Q}$ by computing the `gradient descent` to minimize the `mean-squared Bellman error (MSBE) loss function`:

        $$\nabla_{\phi} \frac{1}{|B|} \sum_{t, (\tau \sim B)} \left(y_t (r, s_{t+1}, d) - Q_{\phi}(s_t, \alpha_t) \right)^2.$$

    - **Update Actor network (Policy)**:

        - Sample a random state $s_t$ from the memory buffer and feed it to the `Actor network` $\mu_{\theta}$ to get the respective action value $a_t^{\mu}=\mu_{\theta}(s_t)$. This action-value might be different than the ones stored in the replay buffer.
    
        - Feed the previous state and action pair to the `Critic network` to get the critic value $Q_{\phi}(s_t, \mu_{\theta}(s_t)$.
    
        - Update Actor network's parameters $\theta^{\mu}$ by computing the `gradient ascent` of the $L^{DPG}$ loss (performance objective) w.r.t the Actor's parameters $\theta^{\mu}$:

        \begin{eqnarray}
        \theta^{\mu}_{t+1} &=&  \theta^{\mu}_t + \alpha \nabla_{\theta^{\mu}} J(\mu_{\theta}).\\
        \nabla_{\theta^{\mu}} J(\mu_{\theta})
        &=& \nabla_{\theta^{\mu}} \Bigg( \mathbb{E}_{s \sim \mathcal{D}}[Q_{\phi} (s, \mu_{\theta}(s))] \Bigg) \\
        &=& \nabla_{\theta^{\mu}} \frac{1}{|B|} \sum_{t, s\in B} Q_{\phi}(s_t, \mu_{\theta}(s_t))\\
        &=& \frac{1}{|B|} \sum_{t, s \in \mathcal{D}} \Bigg[ \nabla_a Q_{\phi} (s_t, a_t)|_{a_t=\mu_{\theta}(s_t)} \nabla_{\theta^{\mu}} \mu_{\theta}(s_t) \Bigg]
        \end{eqnarray}
   
    - **Update the parameters of the target Actor and target Critic networks via polyak averaging**:

        \begin{align}
        \theta^{\mu_{targ}} &= \rho \theta^{\mu} + (1-\rho) \theta^{\mu_{targ}} . \\
        \phi^{Q_{targ}} &= \rho \phi^{Q} + (1-\rho) \phi^{Q_{targ}}.
        \end{align}

        Where $\rho$ is a hyperparameter.

**until convergence or until a maximum number of episodes.**

---

# Implementation

# References

[1] [Continuous Control With Deep Reinforcement Learning, Lillicrap et al. 2016.](https://arxiv.org/pdf/1509.02971)

[2] [Deterministic Policy Gradient Algorithms, Silver et al. 2014.](http://proceedings.mlr.press/v32/silver14.pdf)

[3] https://spinningup.openai.com/en/latest/algorithms/ddpg.html