## 🕒 04c · Time-Series Workload Pattern with Ray Train  
In this notebook you tackle **New York City (NYC) taxi-demand forecasting** (2014 half-hourly counts) and scale a *sequence-to-sequence Transformer* across an Anyscale cluster using **Ray Train V2**.

### What you’ll learn & take away  
- **Ray Train V2 distributed loops** – wrap a PyTorch Transformer in `TorchTrainer` and run it across 8 GPUs with a *single* `ScalingConfig` line.  
- **Fault-tolerant checkpointing on Anyscale** – recover seamlessly from pre-emptions or node failures with automatic epoch-level checkpoints.  
- **Remote GPU inference from checkpoints** – spin up transient GPU actors for batch forecasts without redeploying the whole trainer.  
By the end you know exactly how to take a single-node notebook forecast and scale it—data, training, and inference—on any Anyscale cluster.  

### 🔢 What problem are you solving? (NYC Taxi Demand Forecasting with a Transformer)

YOu want to predict the **next 24 hours (48 half-hour slots)** of taxi pickups in NYC, given one week of historical demand.  
Accurate short-term forecasts help ride-hailing fleets, traffic planners, and dynamic pricing engines allocate resources efficiently.

---

### What's a Sequence-to-Sequence Transformer?

A **Transformer** models the joint distribution of a sequence by stacking self-attention layers that capture long-range dependencies without recurrence.  
Your architecture learns a function  

$$
f_\theta : \underbrace{\mathbb{R}^{T\times 1}}_{\text{past}} \;\longrightarrow\; \underbrace{\mathbb{R}^{F}}_{\text{future}}
$$

where $T=168$ half-hours (one week) and $F=48$.  
During training you use **teacher forcing**, feeding the shifted ground truth to the decoder, so the model can focus on learning residual patterns rather than inventing an initial context.

---

### 🧭 How you Migrate This Time-Series Workload to a Distributed Multi-Node Setup using Ray on Anyscale
This tutorial walks through the end-to-end process of **migrating a single-GPU PyTorch forecasting pipeline to a distributed Ray cluster running on Anyscale**.

Here’s how you make that transition:

1. **Local CSV → Shared Parquet**  
   Download the NYC Taxi dataset as a CSV, resample it to 30-minute intervals, normalize the values, and save it as **Parquet shards** in a shared filesystem (`/mnt/cluster_storage`) — the default storage for Anyscale clusters.

2. **Single-loop preprocessing → Sliding window generation for Distributed Data Parallel (DDP)**  
   Create overlapping input/output windows (past → future) to train a forecasting model. While this preprocessing is local and sequential here, it mirrors pipelines that parallelize with **Ray Data** in large-scale settings. (See other tutorials in this module that incorporate Ray Data for reference)

3. **Vanilla PyTorch → Distributed Ray Train**  
   Define a `train_loop_per_worker()` function and use **Ray Train** to launch **8 GPU workers** across the cluster. Each worker loads its own Parquet shard, trains independently under Distributed Data Parallel (DDP), and reports live metrics.

4. **Manual device logic → Scalable cluster orchestration**  
   Instead of managing GPUs or process groups manually, configure `ScalingConfig`, `RunConfig`, and `FailureConfig`. **Ray + Anyscale handle fault-tolerant execution across nodes.**

5. **Offline inference → Distributed forecasting with remote Ray tasks**  
   Define a `@ray.remote` forecasting function that loads a trained checkpoint and runs prediction on the latest data window. This allows **parallel, stateless inference** on any GPU in the cluster.

This pattern takes a local academic-style time-series workflow and scales it into a **cluster-resilient, fault-tolerant forecasting pipeline**, all while preserving your native PyTorch modeling code.