# DDPG (Deep Deterministic Policy Gradient) — from scratch in PyTorch

DDPG is an **off-policy actor–critic** algorithm for **continuous control**.

This notebook implements DDPG at a **low level** in PyTorch:

- replay buffer
- actor + critic networks
- target networks with **soft updates**
- exploration noise for deterministic policies
- a clean training loop + Plotly diagnostics

We’ll train on a simple continuous environment (default: `Pendulum-v1`) and plot:

- **score per episode** (learning curve)
- **Q-values / TD targets** during learning
- **policy evolution** on fixed probe states


## Learning goals

By the end you should be able to:

- explain the **actor–critic** factorization in DDPG and what each network learns
- write the **critic target** with target networks *precisely*
- understand why DDPG needs (1) **experience replay** and (2) **target networks**
- implement DDPG updates in **low-level PyTorch** (no RL libraries)
- interpret common diagnostics: returns, losses, Q-values, and policy drift

## Prerequisites

- basic PyTorch (`nn.Module`, optimizers)
- Bellman equation / TD learning intuition
- continuous action spaces (Box)


## 1) DDPG structure (actor–critic) and target networks

### Actor (deterministic policy)

The actor is a deterministic policy network:

$$a = \mu_\theta(s)$$

In practice we output a `tanh`-bounded action and then **scale** to match the environment’s action bounds.

### Critic (action-value function)

The critic estimates the Q-value for a state–action pair:

$$Q_\phi(s,a) \approx Q^{\mu}(s,a)$$

### Target networks (the stabilizer)

Bootstrapping makes the target depend on the current function approximators.
To reduce moving-target instability, DDPG maintains slowly-updated copies:

- target actor: $\mu_{\theta'}$
- target critic: $Q_{\phi'}$

Soft-update them after each gradient step:

$$\theta' \leftarrow \tau\,\theta + (1-\tau)\,\theta'$$
$$\phi' \leftarrow \tau\,\phi + (1-\tau)\,\phi'$$

### Critic target (precise)

For a transition $(s,a,r,s',d)$ sampled from replay (where $d\in\{0,1\}$ indicates terminal), the TD target is

$$y = r + \gamma(1-d)\,Q_{\phi'}\big(s',\mu_{\theta'}(s')\big)$$

and we fit the critic via

$$\mathcal{L}(\phi) = \mathbb{E}\big[(Q_\phi(s,a)-y)^2\big].$$

### Actor objective (deterministic policy gradient)

The actor is trained to maximize the critic’s value under its actions:

$$J(\theta) = \mathbb{E}_{s\sim\mathcal{D}}\big[Q_\phi(s,\mu_\theta(s))\big].$$

In code we minimize the actor loss

$$\mathcal{L}_{actor}(\theta) = -\mathbb{E}\big[Q_\phi(s,\mu_\theta(s))\big].$$

The gradient is the deterministic policy gradient:

$$\nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_a Q_\phi(s,a)\rvert_{a=\mu_\theta(s)}\,\nabla_\theta \mu_\theta(s)\right].$$

PyTorch computes this automatically when we backprop through `Q(s, actor(s))`.


## 2) Algorithm sketch (pseudocode)

1. Initialize actor $\mu_\theta$, critic $Q_\phi$
2. Initialize target networks $\mu_{\theta'}\leftarrow\mu_\theta$, $Q_{\phi'}\leftarrow Q_\phi$
3. Initialize replay buffer $\mathcal{D}$
4. For each environment step:
   - act with exploration: $a=\mu_\theta(s)+\epsilon$
   - store $(s,a,r,s',d)$ in $\mathcal{D}$
   - sample minibatch from $\mathcal{D}$
   - critic: regress $Q_\phi(s,a)$ to $y=r+\gamma(1-d)Q_{\phi'}(s',\mu_{\theta'}(s'))$
   - actor: ascend $\nabla_\theta Q_\phi(s,\mu_\theta(s))$
   - soft update targets: $(\theta',\phi')\leftarrow \tau(\theta,\phi)+(1-\tau)(\theta',\phi')$


In [None]:
import math
import platform
import time
from dataclasses import dataclass

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly
import os
import plotly.io as pio

try:
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    TORCH_AVAILABLE = True
except Exception as e:
    TORCH_AVAILABLE = False
    _TORCH_IMPORT_ERROR = e

# Gymnasium first; fall back to gym
try:
    import gymnasium as gym
    GYM_BACKEND = 'gymnasium'
except Exception:
    import gym
    GYM_BACKEND = 'gym'

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

print('Python', platform.python_version())
print('NumPy', np.__version__)
print('Pandas', pd.__version__)
print('Plotly', plotly.__version__)
print('Gym backend', GYM_BACKEND, 'version', gym.__version__)
print('Torch', torch.__version__ if TORCH_AVAILABLE else _TORCH_IMPORT_ERROR)


In [None]:
# --- Run configuration ---
FAST_RUN = True  # set False for longer training

ENV_ID = 'Pendulum-v1'
SEED = 42

NUM_EPISODES = 40 if FAST_RUN else 250
MAX_STEPS_PER_EPISODE = None  # None means use env default

REPLAY_SIZE = 200_000
BATCH_SIZE = 128
GAMMA = 0.99
TAU = 0.005

ACTOR_LR = 1e-3
CRITIC_LR = 1e-3

START_STEPS = 2_000  # random actions before using the actor + noise
UPDATE_AFTER = 1_000  # start gradient updates after this many steps
UPDATES_PER_STEP = 1

NOISE_SIGMA = 0.1  # exploration noise std (in action units after scaling)

HIDDEN_SIZES = (256, 256)
GRAD_CLIP_NORM = 1.0

PROBE_N = 32
PROBE_EVERY_EPISODES = 5

DEVICE = 'cuda' if TORCH_AVAILABLE and torch.cuda.is_available() else 'cpu'
print('DEVICE:', DEVICE)


In [None]:
def set_global_seeds(seed: int):
    np.random.seed(seed)
    if TORCH_AVAILABLE:
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)


def env_reset(env, seed: int | None = None):
    out = env.reset(seed=seed) if seed is not None else env.reset()
    if isinstance(out, tuple):
        obs, info = out
    else:
        obs, info = out, {}
    return obs, info


def env_step(env, action):
    out = env.step(action)
    if len(out) == 5:
        next_obs, reward, terminated, truncated, info = out
        done = bool(terminated or truncated)
    else:
        next_obs, reward, done, info = out
        done = bool(done)
    return next_obs, float(reward), done, info


def make_env(env_id: str, seed: int):
    env = gym.make(env_id)
    _ = env_reset(env, seed=seed)
    try:
        env.action_space.seed(seed)
        env.observation_space.seed(seed)
    except Exception:
        pass
    return env


def action_scale_and_bias(action_space):
    # Works for gymnasium.spaces.Box and gym.spaces.Box
    high = np.asarray(action_space.high, dtype=np.float32)
    low = np.asarray(action_space.low, dtype=np.float32)
    scale = (high - low) / 2.0
    bias = (high + low) / 2.0
    return scale, bias


In [None]:
class ReplayBuffer:
    def __init__(self, obs_dim: int, act_dim: int, size: int, seed: int):
        self.obs_buf = np.zeros((size, obs_dim), dtype=np.float32)
        self.next_obs_buf = np.zeros((size, obs_dim), dtype=np.float32)
        self.act_buf = np.zeros((size, act_dim), dtype=np.float32)
        self.rew_buf = np.zeros((size, 1), dtype=np.float32)
        self.done_buf = np.zeros((size, 1), dtype=np.float32)

        self.max_size = int(size)
        self.ptr = 0
        self.size = 0
        self.rng = np.random.default_rng(seed)

    def add(self, obs, act, rew: float, next_obs, done: bool):
        self.obs_buf[self.ptr] = obs
        self.act_buf[self.ptr] = act
        self.rew_buf[self.ptr] = rew
        self.next_obs_buf[self.ptr] = next_obs
        self.done_buf[self.ptr] = float(done)

        self.ptr = (self.ptr + 1) % self.max_size
        self.size = min(self.size + 1, self.max_size)

    def sample(self, batch_size: int):
        idx = self.rng.integers(0, self.size, size=batch_size)
        batch = dict(
            obs=self.obs_buf[idx],
            act=self.act_buf[idx],
            rew=self.rew_buf[idx],
            next_obs=self.next_obs_buf[idx],
            done=self.done_buf[idx],
        )
        return batch


In [None]:
def mlp(sizes, activation=nn.ReLU, output_activation=nn.Identity):
    layers = []
    for i in range(len(sizes) - 1):
        act = activation if i < len(sizes) - 2 else output_activation
        layers += [nn.Linear(sizes[i], sizes[i + 1]), act()]
    return nn.Sequential(*layers)


class Actor(nn.Module):
    def __init__(self, obs_dim: int, act_dim: int, hidden_sizes, action_scale, action_bias):
        super().__init__()
        self.net = mlp([obs_dim, *hidden_sizes, act_dim], activation=nn.ReLU, output_activation=nn.Tanh)
        self.register_buffer('action_scale', torch.as_tensor(action_scale, dtype=torch.float32))
        self.register_buffer('action_bias', torch.as_tensor(action_bias, dtype=torch.float32))

    def forward(self, obs):
        a = self.net(obs)
        return self.action_scale * a + self.action_bias


class Critic(nn.Module):
    def __init__(self, obs_dim: int, act_dim: int, hidden_sizes):
        super().__init__()
        self.net = mlp([obs_dim + act_dim, *hidden_sizes, 1], activation=nn.ReLU, output_activation=nn.Identity)

    def forward(self, obs, act):
        x = torch.cat([obs, act], dim=-1)
        return self.net(x)


In [None]:
@dataclass
class DDPGConfig:
    gamma: float = GAMMA
    tau: float = TAU
    actor_lr: float = ACTOR_LR
    critic_lr: float = CRITIC_LR
    batch_size: int = BATCH_SIZE
    grad_clip_norm: float | None = GRAD_CLIP_NORM


class DDPGAgent:
    def __init__(self, obs_dim: int, act_dim: int, action_scale, action_bias, hidden_sizes, device: str, cfg: DDPGConfig):
        self.device = torch.device(device)
        self.cfg = cfg

        self.actor = Actor(obs_dim, act_dim, hidden_sizes, action_scale, action_bias).to(self.device)
        self.critic = Critic(obs_dim, act_dim, hidden_sizes).to(self.device)

        # Target networks start as exact copies
        self.target_actor = Actor(obs_dim, act_dim, hidden_sizes, action_scale, action_bias).to(self.device)
        self.target_critic = Critic(obs_dim, act_dim, hidden_sizes).to(self.device)
        self.target_actor.load_state_dict(self.actor.state_dict())
        self.target_critic.load_state_dict(self.critic.state_dict())

        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=cfg.actor_lr)
        self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=cfg.critic_lr)

    @torch.no_grad()
    def act(self, obs: np.ndarray, noise_sigma: float = 0.0):
        obs_t = torch.as_tensor(obs, dtype=torch.float32, device=self.device).unsqueeze(0)
        action = self.actor(obs_t).cpu().numpy().squeeze(0)
        if noise_sigma > 0:
            action = action + np.random.normal(0.0, noise_sigma, size=action.shape).astype(np.float32)
        return action

    def update(self, batch):
        obs = torch.as_tensor(batch['obs'], dtype=torch.float32, device=self.device)
        act = torch.as_tensor(batch['act'], dtype=torch.float32, device=self.device)
        rew = torch.as_tensor(batch['rew'], dtype=torch.float32, device=self.device)
        next_obs = torch.as_tensor(batch['next_obs'], dtype=torch.float32, device=self.device)
        done = torch.as_tensor(batch['done'], dtype=torch.float32, device=self.device)

        # --- Critic update ---
        with torch.no_grad():
            next_act = self.target_actor(next_obs)
            target_q_next = self.target_critic(next_obs, next_act)
            y = rew + self.cfg.gamma * (1.0 - done) * target_q_next

        q = self.critic(obs, act)
        critic_loss = F.mse_loss(q, y)

        self.critic_opt.zero_grad(set_to_none=True)
        critic_loss.backward()
        if self.cfg.grad_clip_norm is not None:
            torch.nn.utils.clip_grad_norm_(self.critic.parameters(), self.cfg.grad_clip_norm)
        self.critic_opt.step()

        # --- Actor update ---
        self.actor_opt.zero_grad(set_to_none=True)
        actor_actions = self.actor(obs)
        actor_loss = -self.critic(obs, actor_actions).mean()
        actor_loss.backward()
        if self.cfg.grad_clip_norm is not None:
            torch.nn.utils.clip_grad_norm_(self.actor.parameters(), self.cfg.grad_clip_norm)
        self.actor_opt.step()

        # --- Soft update target networks ---
        with torch.no_grad():
            for p, p_targ in zip(self.actor.parameters(), self.target_actor.parameters()):
                p_targ.data.mul_(1.0 - self.cfg.tau)
                p_targ.data.add_(self.cfg.tau * p.data)

            for p, p_targ in zip(self.critic.parameters(), self.target_critic.parameters()):
                p_targ.data.mul_(1.0 - self.cfg.tau)
                p_targ.data.add_(self.cfg.tau * p.data)

        metrics = {
            'critic_loss': float(critic_loss.item()),
            'actor_loss': float(actor_loss.item()),
            'q_mean': float(q.detach().mean().item()),
            'y_mean': float(y.detach().mean().item()),
        }
        return metrics


In [None]:
def moving_average(x, window: int):
    x = np.asarray(x, dtype=np.float64)
    if len(x) < window:
        return x
    kernel = np.ones(window) / window
    return np.convolve(x, kernel, mode='valid')


def train_ddpg(env_id: str, seed: int):
    set_global_seeds(seed)
    env = make_env(env_id, seed=seed)

    obs_dim = int(np.prod(env.observation_space.shape))
    act_dim = int(np.prod(env.action_space.shape))

    act_scale, act_bias = action_scale_and_bias(env.action_space)

    buf = ReplayBuffer(obs_dim, act_dim, size=REPLAY_SIZE, seed=seed)
    agent = DDPGAgent(
        obs_dim=obs_dim,
        act_dim=act_dim,
        action_scale=act_scale,
        action_bias=act_bias,
        hidden_sizes=HIDDEN_SIZES,
        device=DEVICE,
        cfg=DDPGConfig(),
    )

    max_steps = MAX_STEPS_PER_EPISODE or getattr(env, '_max_episode_steps', 200)

    logs = {
        'episode': [],
        'episode_return': [],
        'episode_length': [],
        'global_step_end': [],
        # per-update metrics
        'update_step': [],
        'actor_loss': [],
        'critic_loss': [],
        'q_mean': [],
        'y_mean': [],
        # probe snapshots
        'probe_episode': [],
        'probe_action_stat': [],
        'probe_q': [],
    }

    probe_states = None

    global_step = 0
    update_step = 0

    t0 = time.time()
    for ep in range(1, NUM_EPISODES + 1):
        obs, _ = env_reset(env, seed=seed + ep)
        obs = np.asarray(obs, dtype=np.float32).reshape(-1)

        ep_return = 0.0
        ep_len = 0

        for _ in range(max_steps):
            if global_step < START_STEPS:
                action = env.action_space.sample()
            else:
                action = agent.act(obs, noise_sigma=NOISE_SIGMA)

            # clip to action bounds
            action = np.clip(action, env.action_space.low, env.action_space.high).astype(np.float32)

            next_obs, reward, done, _ = env_step(env, action)
            next_obs = np.asarray(next_obs, dtype=np.float32).reshape(-1)

            buf.add(obs, action, reward, next_obs, done)

            obs = next_obs
            ep_return += reward
            ep_len += 1
            global_step += 1

            # gradient updates
            if global_step >= UPDATE_AFTER and buf.size >= BATCH_SIZE:
                for _u in range(UPDATES_PER_STEP):
                    batch = buf.sample(BATCH_SIZE)
                    metrics = agent.update(batch)

                    logs['update_step'].append(update_step)
                    logs['actor_loss'].append(metrics['actor_loss'])
                    logs['critic_loss'].append(metrics['critic_loss'])
                    logs['q_mean'].append(metrics['q_mean'])
                    logs['y_mean'].append(metrics['y_mean'])
                    update_step += 1

            if done:
                break

        logs['episode'].append(ep)
        logs['episode_return'].append(ep_return)
        logs['episode_length'].append(ep_len)
        logs['global_step_end'].append(global_step)

        # Fix a set of probe states once replay has enough data
        if probe_states is None and buf.size >= max(PROBE_N, BATCH_SIZE):
            probe_states = buf.sample(PROBE_N)['obs']

        # Snapshot policy + Q on probe states to visualize policy evolution
        if probe_states is not None and (ep % PROBE_EVERY_EPISODES == 0 or ep == NUM_EPISODES):
            with torch.no_grad():
                ps = torch.as_tensor(probe_states, dtype=torch.float32, device=agent.device)
                pa = agent.actor(ps).cpu().numpy()
                pq = agent.critic(ps, agent.actor(ps)).cpu().numpy().reshape(-1)

            if pa.shape[1] == 1:
                policy_stat = pa[:, 0]  # 1D actions
            else:
                policy_stat = np.linalg.norm(pa, axis=1)  # multi-dim summary

            logs['probe_episode'].append(ep)
            logs['probe_action_stat'].append(policy_stat)
            logs['probe_q'].append(pq)

        if ep % 10 == 0 or ep == 1 or ep == NUM_EPISODES:
            elapsed = time.time() - t0
            print(f'Episode {ep:4d} | return {ep_return:8.1f} | len {ep_len:3d} | steps {global_step:6d} | elapsed {elapsed:6.1f}s')

    env.close()
    return logs


logs = train_ddpg(ENV_ID, seed=SEED)
print('Done. Episodes:', len(logs['episode']), 'Updates:', len(logs['update_step']))


## 3) Plotly diagnostics

DDPG can *look like it’s learning* while the critic is quietly diverging, so we’ll monitor:

- **episode return** (score)
- **critic loss** and **actor loss**
- **Q-values vs TD targets** (sanity check)
- **policy evolution** on fixed probe states (is the policy drifting smoothly?)


In [None]:
# --- Learning curve: score per episode ---
df_ep = pd.DataFrame({
    'episode': logs['episode'],
    'return': logs['episode_return'],
    'length': logs['episode_length'],
})

ma_window = 10
ma = moving_average(df_ep['return'].values, window=ma_window)
ma_x = df_ep['episode'].values[ma_window - 1:]

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_ep['episode'], y=df_ep['return'], mode='lines', name='Return'))
if len(ma) == len(ma_x):
    fig.add_trace(go.Scatter(x=ma_x, y=ma, mode='lines', name=f'Return (MA {ma_window})'))
fig.update_layout(title='DDPG learning curve (score per episode)', xaxis_title='Episode', yaxis_title='Return')
fig.show()


In [None]:
# --- Q-values, TD targets, and losses over update steps ---
df_up = pd.DataFrame({
    'update_step': logs['update_step'],
    'critic_loss': logs['critic_loss'],
    'actor_loss': logs['actor_loss'],
    'q_mean': logs['q_mean'],
    'y_mean': logs['y_mean'],
})

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_up['update_step'], y=df_up['q_mean'], mode='lines', name='Q(s,a) mean'))
fig.add_trace(go.Scatter(x=df_up['update_step'], y=df_up['y_mean'], mode='lines', name='TD target y mean'))
fig.update_layout(title='Critic outputs vs TD targets (mean over minibatch)', xaxis_title='Update step', yaxis_title='Value')
fig.show()

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_up['update_step'], y=df_up['critic_loss'], mode='lines', name='Critic loss (MSE)'))
fig.add_trace(go.Scatter(x=df_up['update_step'], y=df_up['actor_loss'], mode='lines', name='Actor loss (-Q)'))
fig.update_layout(title='Actor/Critic losses', xaxis_title='Update step', yaxis_title='Loss')
fig.show()


In [None]:
# --- Policy evolution on fixed probe states ---
if len(logs['probe_episode']) > 0:
    probe_eps = logs['probe_episode']
    z_action = np.stack(logs['probe_action_stat'], axis=1)  # (PROBE_N, T)
    z_q = np.stack(logs['probe_q'], axis=1)  # (PROBE_N, T)

    fig = go.Figure(data=go.Heatmap(
        z=z_action,
        x=probe_eps,
        y=list(range(z_action.shape[0])),
        colorscale='RdBu',
        zmid=0.0,
        colorbar=dict(title='action (1D) or ||a||'),
    ))
    fig.update_layout(title='Policy evolution on fixed probe states', xaxis_title='Episode snapshot', yaxis_title='Probe state index')
    fig.show()

    fig = go.Figure(data=go.Heatmap(
        z=z_q,
        x=probe_eps,
        y=list(range(z_q.shape[0])),
        colorscale='Viridis',
        colorbar=dict(title='Q(s, mu(s))'),
    ))
    fig.update_layout(title='Q-values on probe states (critic under current actor)', xaxis_title='Episode snapshot', yaxis_title='Probe state index')
    fig.show()
else:
    print('No probe snapshots recorded (try increasing NUM_EPISODES or reducing PROBE_EVERY_EPISODES).')


## 4) Stable-Baselines implementation (if you want a reference)

If you want a battle-tested baseline, Stable-Baselines has DDPG implementations.

Notes:

- `stable-baselines3` (PyTorch) and `stable-baselines` (TensorFlow) are different packages.
- This repository’s environment may not have them installed; the code below is for reference.

### Stable-Baselines3 (PyTorch)

```python
# pip install stable-baselines3
import numpy as np
import gymnasium as gym

from stable_baselines3 import DDPG
from stable_baselines3.common.noise import NormalActionNoise

env = gym.make('Pendulum-v1')

n_actions = env.action_space.shape[0]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

model = DDPG('MlpPolicy', env, action_noise=action_noise, verbose=1)
model.learn(total_timesteps=200_000)
```

### Stable-Baselines (TensorFlow; older/archived)

```python
# pip install stable-baselines
import numpy as np
import gym

from stable_baselines import DDPG
from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.common.noise import NormalActionNoise

env = gym.make('Pendulum-v1')

n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

model = DDPG(MlpPolicy, env, action_noise=action_noise, verbose=1)
model.learn(total_timesteps=200_000)
```


## 5) Pitfalls + diagnostics

- **Exploration**: deterministic policies need explicit noise; too little noise → no learning.
- **Q-value blow-up**: if $Q$ grows without bound, reduce learning rates, add gradient clipping, check reward scale.
- **Action scaling**: always scale `tanh` outputs to environment bounds; otherwise the critic learns on invalid actions.
- **Replay warm-up**: start updates only after enough diverse transitions exist.
- **Overestimation bias**: DDPG can overestimate; TD3 addresses this with twin critics + target smoothing.


## 6) Hyperparameters explained (the ones that matter)

### `GAMMA` ($\gamma$)
Discount factor in the TD target:

$$y=r+\gamma(1-d)Q_{\phi'}(s',\mu_{\theta'}(s')).$$

- closer to 1 → longer-horizon credit assignment, but bootstrapping is harder
- smaller → more myopic, often more stable

### `TAU` ($\tau$)
Soft-update rate for target networks:

$$\theta'\leftarrow\tau\theta+(1-\tau)\theta'.$$

- smaller (e.g. 0.001) → targets change slowly (stable, but may learn slower)
- larger (e.g. 0.02) → targets track faster (less bias, potentially less stable)

### `REPLAY_SIZE`
Maximum transitions stored.

- too small → poor diversity, correlated samples
- very large → more diversity but older data (off-policy mismatch) and more memory

### `BATCH_SIZE`
Minibatch size for gradient updates.

- larger → smoother gradients, higher compute
- smaller → noisier updates (can help exploration but can destabilize critic)

### `START_STEPS`
How long to act randomly before relying on the actor.

- helps fill replay with diverse transitions
- if too short, early actor updates overfit to narrow experience

### `UPDATE_AFTER`
Delay before starting gradient updates.

- ensures the critic’s first targets aren’t based on tiny replay buffers

### `UPDATES_PER_STEP`
How many gradient updates to do per environment step.

- `1` is the standard simple choice
- larger values increase sample reuse but can overfit to replay and amplify instability

### `NOISE_SIGMA`
Exploration noise standard deviation (added to the actor’s action).

- too small → agent may not discover better actions
- too large → behavior becomes too random; critic targets get noisy

### `HIDDEN_SIZES`
Network capacity for actor/critic.

- bigger networks can fit complex Q-functions but may be harder to train

### `GRAD_CLIP_NORM`
Gradient norm clipping (optional).

- helps prevent occasional exploding gradients in the critic


## 7) Exercises + references

### Exercises

1. Replace Gaussian exploration with **Ornstein–Uhlenbeck** noise and compare learning.
2. Add **LayerNorm** to the actor/critic MLPs; does it stabilize training?
3. Implement **TD3** changes (twin critics + target policy smoothing) and compare the Q-value diagnostics.

### References

- Lillicrap et al., *Continuous control with deep reinforcement learning* (DDPG): https://arxiv.org/abs/1509.02971
- OpenAI Spinning Up (DDPG explanation + tips): https://spinningup.openai.com/en/latest/algorithms/ddpg.html
- Stable-Baselines (archived TF implementations): https://github.com/hill-a/stable-baselines
- Stable-Baselines3 docs: https://stable-baselines3.readthedocs.io/
