# DDPG (Deep Deterministic Policy Gradient) — Continuous Control

DDPG is an **off-policy actor–critic** algorithm for **continuous action spaces**.

It learns:

- an **actor** $\mu_	heta(s)$ that outputs a *deterministic* action
- a **critic** $Q_\phi(s,a)$ that estimates the action-value

DDPG is a foundational algorithm: understanding it makes TD3/SAC much easier.

## Learning goals

By the end you should be able to:

- explain DDPG’s **actor–critic** structure and why it’s **off-policy**
- derive the **critic target** using **target networks**
- write the **deterministic policy gradient** update for the actor
- understand the role of **experience replay** and **exploration noise**

For a full low-level PyTorch implementation (with Plotly diagnostics), see `01_ddpg_from_scratch.ipynb`.

## The moving parts (precisely)

### 1) Actor (policy)

A deterministic policy network:

$$a = \mu_	heta(s)$$

In continuous-control Gym/Gymnasium environments, we typically use a `tanh` head and **scale** to the action bounds.

### 2) Critic (Q-function)

A state–action value function approximator:

$$Q_\phi(s,a) pprox Q^{\mu}(s,a) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t r_{t} \mid s_0=s, a_0=a, a_{t>0}=\mu(s_t)ight]$$

### 3) Target networks (stabilization)

Maintain slowly-updated copies:

- target actor $\mu_{	heta'}$
- target critic $Q_{\phi'}$

**Soft update** after each gradient step:

$$	heta' \leftarrow 	au\,	heta + (1-	au)\,	heta'$$
$$\phi' \leftarrow 	au\,\phi + (1-	au)\,\phi'$$

### 4) Replay buffer (off-policy learning)

Store transitions $(s,a,r,s',d)$ and sample i.i.d. minibatches to reduce correlation and improve data efficiency.

### 5) Exploration noise

Because $\mu_	heta$ is deterministic, exploration usually uses additive noise:

$$a_t = \mu_	heta(s_t) + \epsilon_t,\quad \epsilon_t \sim \mathcal{N}(0, \sigma^2)
$$

## Core updates (equations)

### Critic update

Given a replay minibatch, define the TD target (using target networks):

$$y = r + \gamma(1-d)\,Q_{\phi'}(s', \mu_{	heta'}(s'))$$

Minimize MSE:

$$\mathcal{L}(\phi) = \mathbb{E}ig[(Q_{\phi}(s,a) - y)^2ig]$$

### Actor update (deterministic policy gradient)

Maximize the critic’s value under the actor:

$$J(	heta) = \mathbb{E}_{s\sim\mathcal{D}}ig[Q_\phi(s, \mu_	heta(s))ig]$$

Gradient (applied by autograd in PyTorch):

$$
abla_	heta J(	heta) = \mathbb{E}\left[
abla_a Q_\phi(s,a)vert_{a=\mu_	heta(s)}\,
abla_	heta \mu_	heta(s)ight]$$

In code we typically minimize **actor loss** $\mathcal{L}_{actor} = -\mathbb{E}[Q_\phi(s, \mu_	heta(s))]$.

## Next notebook

- `01_ddpg_from_scratch.ipynb`: full DDPG in low-level PyTorch (replay buffer, target networks, training loop) + Plotly visuals for:
  - score per episode (learning curve)
  - Q-values / targets over training
  - policy evolution on fixed probe states


In [None]:
import platform

import numpy as np
import plotly
import plotly.graph_objects as go
import os
import plotly.io as pio

try:
    import torch
    TORCH_AVAILABLE = True
except Exception as e:
    TORCH_AVAILABLE = False
    _TORCH_IMPORT_ERROR = e

try:
    import gymnasium as gym
    GYMNASIUM_AVAILABLE = True
except Exception:
    GYMNASIUM_AVAILABLE = False

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

print('Python', platform.python_version())
print('NumPy', np.__version__)
print('Plotly', plotly.__version__)
print('Torch', torch.__version__ if TORCH_AVAILABLE else _TORCH_IMPORT_ERROR)
print('Gymnasium', gym.__version__ if GYMNASIUM_AVAILABLE else 'not installed')


In [None]:
# A tiny Plotly sketch: actor–critic block diagram (conceptual)
fig = go.Figure()
fig.add_shape(type='rect', x0=0.05, x1=0.35, y0=0.55, y1=0.85, line=dict(width=2))
fig.add_annotation(x=0.20, y=0.70, text='Actor\n$\\mu_\\theta(s)$', showarrow=False, font=dict(size=14))

fig.add_shape(type='rect', x0=0.55, x1=0.95, y0=0.55, y1=0.85, line=dict(width=2))
fig.add_annotation(x=0.75, y=0.70, text='Critic\n$Q_\\phi(s,a)$', showarrow=False, font=dict(size=14))

# arrows
fig.add_annotation(x=0.45, y=0.70, ax=0.35, ay=0.70, xref='paper', yref='paper', axref='paper', ayref='paper',
                   text='', showarrow=True, arrowhead=3, arrowsize=1.2)
fig.add_annotation(x=0.55, y=0.65, ax=0.35, ay=0.65, xref='paper', yref='paper', axref='paper', ayref='paper',
                   text='', showarrow=True, arrowhead=3, arrowsize=1.2)

fig.add_annotation(x=0.45, y=0.80, text='$s$', showarrow=False)
fig.add_annotation(x=0.45, y=0.64, text='$a$', showarrow=False)

fig.update_xaxes(visible=False, range=[0, 1])
fig.update_yaxes(visible=False, range=[0, 1])
fig.update_layout(title='DDPG: actor produces action; critic evaluates (s,a)', height=280)
fig