## 🌀 04-d2 · Diffusion-Policy Pattern with Ray Train  
In this notebook you build a **mini diffusion-policy pipeline** on a **real Pendulum-v1 offline dataset** and run it end-to-end on an Anyscale cluster with **Ray Train V2**.

### What you’ll learn & take away  
* How to use **Ray Data** to stream and preprocess Gymnasium rollouts in parallel across CPU workers  
* How to scale training across **multiple A10G GPUs** using `TorchTrainer` with a minimal `LightningModule`  
* How to **checkpoint every epoch** with `ray.train.report()` for robust fault tolerance and auto-resume  
* How to log and visualize metrics using **Ray’s built-in results and observability tooling**  
* How to generate actions from a trained policy directly in-notebook, with **no need to repackage or redeploy**  
* How to run the full pipeline on **Anyscale Workspaces** with no infrastructure setup or cluster config required  

### 🔢 What problem are you solving? (Inverted Pendulum, Diffusion-Style)

You’re training a policy to **swing up and balance an inverted pendulum** — a classic control problem.  
In the Gym `Pendulum-v1` env|ironment, the agent sees the current state of the pendulum and must decide what **torque** to apply at the pivot.

---

### What's a policy?

A **policy** is a function that maps the current state to an action:

$$
\pi_\theta(s_{k}) \;\longrightarrow\; u_{k}
$$

Here:
- The **state** $s_k$ describes where the pendulum is and how fast it’s moving  
- The **action** $u_k$ is the torque you apply to influence future motion  
- The **goal** is to learn a policy that keeps the pendulum upright by generating the right torque at every step

---

### Environment state and action

At each timestep:

| Symbol        | Dim    | Meaning                           |
|---------------|--------|-----------------------------------|
| $\theta_{k}$    | scalar | Angle of the pendulum             |
| $\dot\theta_{k}$| scalar | Angular velocity                  |
| $u_{k}$         | scalar | Torque applied to the base        |

The pendulum starts hanging down and must swing up and balanced.

Encode the state as:

$$
s_{k} = [\cos\theta_{k},\ \sin\theta_{k},\ \dot\theta_{k}] \in \mathbb{R}^3
$$

This avoids angle discontinuities (no $\pm\pi$ jumps) and keeps values in $[-1, 1]$.

---

### 1. Dataset tuples

Train on a **log of actions** from a random policy, then inject artificial noise to simulate the diffusion process:

$$
\varepsilon_{k} \sim \mathcal{N}(0, 1), \quad t_{k} \sim \text{Uniform}\{0,\dots,T{-}1\}
$$

and construct a noisy action:

$$
\tilde{u}_k = u_{k} + \varepsilon_{k}
$$

---

### 2. Training objective

Train a model $f_\theta$ to predict the injected noise, given the state, the noisy action, and the timestep:

$$
\mathcal{L} = \mathbb{E}_{s_{k},\varepsilon_k,t_{k}}\ \big\|f_\theta(s_k, \tilde{u}_k, t_{k}) - \varepsilon_k\big\|_2^2
$$

Minimizing this loss teaches the model to **de-noise** $\tilde{u}_{k}$ back toward the expert action $u_k$.

---

### 3. Reverse diffusion (sampling)

At inference time, start from noise $x_T \sim \mathcal{N}(0, 1)$ and de-noise step by step:

$$
x_{t} \;\leftarrow\; x_{t} - \eta \cdot f_\theta(s, x_{t}, t), \quad t = T{-}1, \dots, 0
$$

After $T$ steps:

$$
x_0 \approx u^\star
$$

is a valid torque for the current state — a sample from your learned diffusion policy.

---

### 🧭 How you’ll scale this policy learning workload using Ray on Anyscale

This tutorial shows how to take a **local PyTorch + Gymnasium workflow** and migrate it to a fully **distributed, fault-tolerant Ray pipeline running on Anyscale** with minimal code changes.

Here’s how the transition works:

1. **Gym rollouts → Ray Dataset**  
   Generate simulation rollouts from `Pendulum-v1` and stream them directly into a **Ray Dataset**, enabling distributed preprocessing (For example, normalization) and automatic partitioning across workers.

2. **Local Training → Cluster-scale Distributed Training**  
   Wrap a minimal `LightningModule` in a Ray Train `train_loop`, then launch training with **TorchTrainer** across 8 A10G GPUs. Ray handles data sharding, worker setup, and device placement without boilerplate.

3. **Manual State Saving → Structured Checkpointing & Resumption**  
   At the end of each epoch, save model weights and metadata with `ray.train.report(checkpoint=...)`. Ray then **auto-resumes training** from the latest checkpoint after restarts. This requires no further logic.

4. **Ad-hoc Coordination → Declarative Orchestration**  
   Replace manual logging, retry logic, and resource management with **Ray-native configs** (`ScalingConfig`, `CheckpointConfig`, `FailureConfig`), letting Ray + Anyscale own the orchestration.

5. **Notebook-only Inference → Cluster-aware Evaluation**  
   After training, perform **reverse diffusion sampling** in-notebook using the latest checkpoint—but this can easily scale to distributed Ray tasks or serve as the basis for a production rollout.

This flow upgrades a local notebook into a **multi-node, resilient training + inference pipeline**, using Ray’s native abstractions and running seamlessly inside an Anyscale Workspace, without sacrificing dev agility.
