# 04c Time-Series workload pattern with Ray Train  
In this notebook you tackle **New York City (NYC) taxi-demand forecasting** (2014 half-hourly counts) and scale a *sequence-to-sequence Transformer* across an Anyscale cluster using **Ray Train V2**.

### What you learn and take away  
- **Ray Train V2 distributed loops**: wrap a PyTorch Transformer in `TorchTrainer` and run it across 8 GPUs with a *single* `ScalingConfig` line.  
- **Fault-tolerant checkpointing on Anyscale**: recover seamlessly from pre-emptions or node failures with automatic epoch-level checkpoints.  
- **Inference from checkpoints**: use **Ray Data** with stateful GPU actors to perform scalable batch forecasts directly from saved checkpoints.  
By the end, you’ll know how to take a single-node notebook forecast and scale data loading, training, and inference seamlessly across an Anyscale cluster.

### What problem are you solving? (NYC taxi demand forecasting with a Transformer)

You want to predict the **next 24 hours (48 half-hour slots)** of taxi pickups in NYC, given one week of historical demand.  
Accurate short-term forecasts help ride-hailing fleets, traffic planners, and dynamic pricing engines allocate resources efficiently.

---

### What's a sequence-to-sequence Transformer?

A **Transformer** models the joint distribution of a sequence by stacking self-attention layers that capture long-range dependencies without recurrence.  
Your architecture learns a function  

$$
f_\theta : \underbrace{\mathbb{R}^{T\times 1}}_{\text{past}} \;\longrightarrow\; \underbrace{\mathbb{R}^{F}}_{\text{future}}
$$

where $T=168$ half-hours (one week) and $F=48$.  
During training you use **teacher forcing** (a design choice), feeding the shifted ground truth to the decoder, so the model can focus on learning residual patterns rather than inventing an initial context.

---

### How to migrate this time-series workload to a distributed multi-node setup using Ray on Anyscale
This tutorial walks through the end-to-end process of **migrating a single-GPU PyTorch forecasting pipeline to a distributed Ray cluster running on Anyscale**.

Follow these steps to make the transition:

1. **Migrate local CSV data to shared Parquet**  
   Download the NYC taxi dataset as a CSV, resample it to 30-minute intervals, normalize the values, and save it as **Parquet shards** in a shared filesystem (`/mnt/cluster_storage`)—the default storage for Anyscale clusters.

2. **Create sliding window generation for Distributed Data Parallel (DDP)**  
   Create overlapping input and output windows (past to future) to train a forecasting model. While this preprocessing is local and sequential in this tutorial, it mirrors pipelines that parallelize with **Ray Data** in large-scale settings. 

3. **Define a vanilla PyTorch function to use distributed Ray Train**  
   Define a `train_loop_per_worker()` function and use **Ray Train** to launch **8 GPU workers** across the cluster. Each worker loads its own Parquet shard, trains independently under Distributed Data Parallel (DDP), and reports live metrics.

4. **Configure Ray for scalable cluster orchestration**  
   Instead of managing GPUs or process groups manually, configure `ScalingConfig`, `RunConfig`, and `FailureConfig`. **Ray and Anyscale handle fault-tolerant execution across nodes.**

5. **Perform distributed batch inference with Ray Data**  
   Use **Ray Data** with stateful GPU actors to load the trained checkpoint once per worker and run scalable, parallel forecasts on the latest data windows.  
   This enables **efficient, reusable, and fault-tolerant** inference across the cluster.

This pattern takes a local academic-style time-series workflow and scales it into a **cluster-resilient, fault-tolerant forecasting pipeline**, all while preserving your native PyTorch modeling code.