## 🔄 02 · Integrating Ray Train with Ray Data  
In this module you’ll extend distributed training with **Ray Train** by adding **Ray Data** to the pipeline. Instead of relying on a local PyTorch DataLoader, you’ll stream batches directly from a distributed **Ray Dataset**, enabling scalable preprocessing and just-in-time data loading across the cluster.  

### What you’ll learn & take away  
* When to integrate **Ray Data** with Ray Train — e.g., for CPU-heavy preprocessing, online augmentations, or multi-format data ingestion  
* How to replace `DataLoader` with **`iter_torch_batches()`** to stream batches into your training loop  
* How to shard, shuffle, and preprocess data in parallel across the cluster before feeding it into GPUs  
* How to define a **training loop** that consumes Ray Dataset shards instead of DataLoader tuples  
* How to prepare datasets (For example, Parquet format) so they can be efficiently read and transformed with Ray Data  
* How to pass Ray Datasets into the `TorchTrainer` with the `datasets` parameter  

> With Ray Data, you can scale preprocessing and training independently — CPUs handle input pipelines, GPUs focus on training — ensuring **higher utilization and throughput** in your distributed workloads.  

Note that the code blocks for this module will depend on the previous module, **Introduction to Ray Train**.

### 🔎 Integrating Ray Train with Ray Data  

Use both Ray Train and Ray Data when you face one of the following challenges:  
| Challenge | Detail | Solution |
| --- | --- | --- |
| Need to perform online or just-in-time data processing | The training pipeline requires processing data on the fly, such as data augmentation, normalization, or other transformations that may differ for each training epoch. | Ray Train's integration with Ray Data makes it easy to implement just-in-time data processing. |
| Need to improve hardware utilization | Training and data processing need to be scaled independently to keep GPUs fully utilized, especially when preprocessing is CPU-intensive. | Ray Data can distribute data processing across multiple CPU nodes, while Ray Train runs the training loop on GPUs. |
| Need a consistent interface for loading data | The training process may need to load data from various sources, such as Parquet, CSV, or lakehouses. | Ray Data provides a consistent interface for loading, shuffling, sharding, and batching data for training loops. |