<div align='center'>
    <h1> 
        <a href='#'> Deep Deterministic Policy Gradient (DDPG) </a> 
    </h1>
</div>

# Intro

Deep Deterministic Policy Gradient (DDPG) is a model-free (no transition probability) `off-policy`, and `actor-critic` algorithm that combines elements of policy gradient methods with deep Q-learning. DDPG is an extension of DQN for continuous action space. It uses `temporal difference learning` (bootstrapping) and `experience replay buffer` (off-policy) to learn the Q-value function (represented by the Critic network) via the Bellman optimality equation. 

The goal of reinforcement learning is to maximize the expected cumulative reward (a.k.a expected return) under policy $\pi_{\theta}$:

$$\text{max } J(\pi_{\theta}) = \text{max } \mathbb{E}_{\tau \sim \pi_{\theta}}[\mathcal{R}(\tau) | \pi_{\theta_t}].$$

In DDPG, the RL goal is translated to learning a deterministic policy $\mu_{\theta}(s)$ whose actions maximize the expected Q-value estimated by the `Critic network`:

$$\text{max }_{\theta} \mathbb{E}_{s \sim \mathcal{D}}[Q_{\phi} (s, \mu_{\theta}(s))].$$

Where the states $s$ are sampled from a replay buffer: $s \sim \mathcal{D}$.

To achieve the RL goal, the parameters of the `Actor Network` are updated by taking the `gradient ascent` of the Q-value function with respect to the Actor's parameters. The `Critic Network's` parameters are updated by taking the `gradient descent` of its loss function (e.g., Mean Squared Error) to improve its Q-value predictions.

Unlike DQN, DDPG does not use $\epsilon$-greedy policy (exploitation) for action selection. Rather, In DDPG, the behavior policy for action selection is derived from the actions generated by the Actor network (which is a deterministic target policy) with the addition of `noise` to encourage `exploration` in the environment.

- DDPG optimizes a `deterministic policy` $\mu_{\theta}$ and it is suitable for `continuous action spaces`.

- DDPG uses four neural networks:
    - The Actor network.
    - The Critic network.
    - The target Actor network.
    - The target Critic network.

# Algorithm

---
**Algorithm (Pseudocode): Deep Deterministic Policy Gradient (adapted from Open AI)**

---

- Initialize the environment to a random state $s_t$. Initialize an empty replay buffer $\mathcal{D}$. Initialize random parameters $\theta^{\mu}$ and $\phi^{Q}$ for Actor and Critic networks, respectively. Set the target Actor and target Critic parameters to the main parameters: $\theta^{\mu'} \leftarrow \theta^{\mu}$, $\phi^{Q'} \leftarrow \phi^{Q}$.

- **Repeat:**

    - Feed the current state $s_t$ to the Actor neural network $\mu_{\theta}$ that will return an action value $a_t = \mu_{\theta} (s_t)$ (a continuous number, not a probability, since the policy is deterministic). 

    - Apply a noise, typically Gaussian $\epsilon \sim \mathcal{N}$, to the action $a_t$ to drive the agent in the environment that will return a reward $r_t$, the next state $s_{t+1}$, and a possibly done (boolean) value $d$ that tells whether the episode has ended.

    - At each time step, store the experience/transition as a tuple $B=(s_t, a_t, r_{t}, s_{t+1}, d$) into the replay buffer $\mathcal{D}$. The replay buffer is used to ensure stability.

    - **Update Critic network (Q function)**:

        - Sample a random mini-batch of transitions $B=(s_t, a_t, r_t, s_{t+1},d)$ from the replay buffer $\mathcal{D}$.

        - Use `target Actor network` $\mu_{\theta}'$ to get the target action for the next state: $\alpha^{\mu'}=\mu_{\theta}'(s_{t+1})$.

        - Feed the previous action from the target Actor to the `target Critic network` $Q'$ to get the target value: $$y_t (r, s_{t+1}, d) = r_t + \gamma(1-d) Q_{\phi}'(s_{t+1}, \mu_{\theta}'(s_{t+1})).$$

        - Feed the state and action sampled from the replay buffer to the `Critic network` $Q_{\phi}$ to get the predicted value $Q_{\phi}(s_t, \alpha_t)$. 

        - With the transitions sampled from the replay buffer ($\tau \sim B$), update the Critic network's parameters $\theta^{Q}$ by computing the `gradient descent` to minimize the mean squared error loss function:

        $$\nabla_{\phi} \frac{1}{|B|} \sum_{t, (\tau \sim B)} (y_t (r, s_{t+1}, d) - Q_{\phi}(s_t, \alpha_t)^2.$$

    - **Update Actor network (Policy)**:

        - Sample a random state $s_t$ from the memory buffer and feed it to the `Actor network` $\mu_{\theta}$ to get the respective action value $a_t^{\mu}=\mu_{\theta}(s_t)$. This action-value might be different than the ones stored in the replay buffer.
    
        - Feed the previous state and action pair to the `Critic network` to get the critic value $Q_{\phi}(s_t, \mu_{\theta}(s_t)$.
    
        - Update Actor network's parameters $\theta^{\mu}$ by computing the `gradient ascent` of the Q-value (with actions from the parameterized policy) w.r.t the Actor's parameters $\theta^{\mu}$:

        \begin{eqnarray}
        \theta^{\mu}_{t+1} &=&  \theta^{\mu}_t + \alpha \nabla_{\theta^{\mu}} \mathbb{E}_{s \sim \mathcal{D}} Q_{\phi}(\cdot).\\
        \nabla_{\theta^{\mu}} \mathbb{E}_{s \sim \mathcal{D}}[Q_{\phi} (s, \mu_{\theta}(s))]
        &=& \nabla_{\theta^{\mu}} \frac{1}{|B|} \sum_{t, s\in B} Q_{\phi}(s_t, \mu_{\theta}(s_t)).
        \end{eqnarray}
   
    - **Update the parameters of the target Actor and target Critic networks using the soft update rule**:

        \begin{align}
        \theta^{\mu'} &= \rho \theta^{\mu} (1-\rho) \theta^{\mu'} . \\
        \phi^{Q'} &= \rho \phi^{Q} (1-\rho) \phi^{Q'}.
        \end{align}

        Where $\rho$ is a hyperparameter.

**Repeat until convergence.**

---

# Implementation

# References

[1] https://spinningup.openai.com/en/latest/algorithms/ddpg.html