## 🖼️ 04a · Computer-Vision Pattern with Ray Train

This notebook walks through an end-to-end, **real-world computer-vision workflow** that runs seamlessly on an Anyscale cluster using **Ray Train**. You start by pulling a slice of the Food-101 dataset, push it through a lightweight preprocessing pipeline, store it efficiently in Parquet, and then fine-tune a ResNet-18 in a fault-tolerant, distributed manner. Along the way, you  lean on Ray’s helpers to prepare data loaders, coordinate workers, checkpoint automatically, resume after failure, and even launch GPU inference jobs—all without writing a single line of low-level distributed code.

### What you’ll learn & take away

- Launch distributed training with **Ray Train’s `TorchTrainer`** and configure it for multi-GPU, multi-node execution.  
- Use **Ray Train’s built-in utilities** (`prepare_model`, `prepare_data_loader`, `get_checkpoint`, `train.report`) to wrap your existing PyTorch code without modifying your modeling logic.  
- Save and resume from **automatic, fault-tolerant checkpoints** across epochs.  
- Offload batch **inference as a Ray remote task**, allowing you to treat inference as a scalable workload.  
- Run end-to-end training and evaluation without needing to understand the low-level mechanics of distributed systems.

By the end of the tutorial you have a working model, clear loss curves, and a hands-on feel for how Ray Train simplifies distributed computer-vision workloads.

### 🔢 What problem are you solving? (Image classification with Food-101-Lite)

In this notebook you train a neural network to **classify food photos** into one of **10 categories**  
using the **Food-101-Lite** dataset—a compact, 10-class subset of the original Food-101 benchmark.

---

### Inputs  

Every sample is a 3-channel Red-Green-Blue (RGB) image, resized to $224 \times 224$:

$$
x \;\in\; [0,1]^{3 \times 224 \times 224}\;.
$$

You apply standard vision transforms (normalization, random crop/flip) and batch the data with plain **PyTorch DataLoader** (wrapped by `ray.train.torch.prepare_data_loader` for distributed training).

---

### Labels  

Each image belongs to one of ten classes:

['pizza', 'hamburger', 'sushi', 'ramen', 'fried rice',
'steak', 'hot dog', 'pancake', 'burrito', 'caesar salad']


The label is an integer $y \in \{0, \dots, 9\}$ used for supervision.

---

### What does the model learn?

You train a compact CNN (For example, **ResNet-18**) to map an image \(x\) to class probabilities:

$$
f_\theta(x)\;=\;\hat{y}\;\in\;\mathbb{R}^{10}.
$$

Training minimizes the **cross-entropy loss**

$$
\mathcal{L}(x,y)\;=\;-\log \bigl(\hat{y}_{\,y}\bigr),
$$

so the network assigns high likelihood to the correct class.

---

### 🧭 How you migrate this computer vision workload to a distributed setup using Ray on Anyscale
In this tutorial, you start with a small PyTorch-based image classification task---training a ResNet-18 on a 10% slice of the Food-101 dataset, and progressively migrate it into a fully distributed, fault-tolerant training job using **Ray Train on Anyscale**. Your goal is to show you exactly how to scale *your existing workflow* without rewriting it from scratch.

Here’s how you do it':

1. **Preprocess data and persist it in a distributed-friendly format**  
   You take raw images from Hugging Face’s `food101` dataset, apply `torchvision` resizing and center-cropping, and serialize them to **Parquet** using `pyarrow`. The system writes these Parquet files to the **Anyscale cluster’s shared storage volume** (`/mnt/cluster_storage`), so any node can access them, on any worker, without duplication or sync issues.

2. **Create a lightweight PyTorch `Dataset` for Parquet ingestion**  
   Instead of using Ray Data or Hugging Face `Dataset`, you implement a custom `Food101Dataset` that reads directly from the Parquet files. This provides control over the way the system reads rows and row groups. While this isn’t yet fully distributed, it allows you to simulate a real-world scenario where a developer starts with something simple before optimizing. **Note:** you use Pytorch style data loading in this tutorial to demonstrate (1) low level control in a pytorch native environment and (2) how to move pre-existing pytorch code into a distributed Anyscale environment. Other tutorials in this module incorporate Ray Data, so you can see how the two approaches differ.

3. **Integrate Ray Train into the training loop**  
   You encapsulate your existing PyTorch training logic in a `train_loop_per_worker()` function, which Ray Train executes on each worker (typically one per GPU). Inside this loop, you:

   - Wrap the model with `prepare_model()` to make it compatible with distributed data parallelism.  
   - Wrap the `DataLoader` with `prepare_data_loader()` to enable device placement and Ray worker context handling.  
   - Add a `torch.utils.data.DistributedSampler` to each `DataLoader`, so that **data is correctly sharded across workers**—each worker only processes a unique subset of the training and validation datasets.  
   - As required by the `DistributedSampler`, all `sampler.set_epoch(epoch)` at the start of each epoch to reshuffle the data correctly.
   - Use Ray’s `Checkpoint` API to save and resume from checkpoints as needed.  
   - Report training and validation metrics with `train.report()` after each epoch.

4. **Launch training with `TorchTrainer` on an Anyscale cluster**  
   You instantiate a `TorchTrainer` that runs:
   - With `num_workers=8` and `use_gpu=True` (For example, across 8 A10 or A100 GPUs on Anyscale).  
   - With `RunConfig` that sets checkpoint retention and auto-resume (with `max_failures=3`).  
   - On infrastructure that's provisioned and scheduled by Anyscale with no manual Ray cluster setup required.  

   Once launched, Ray automatically handles:
   - Multi-node orchestration  
   - Worker assignment and device pinning  
   - Failure recovery and retry logic  
   - Checkpointing and logging

5. **Validate fault tolerance**  
   You run `trainer.fit()` a second time. If manual intervention or failure interrupts the previous training, Ray picks up from the latest checkpoint. This shows **real-world robustness** without any manual checkpoint management or scripting.

6. **Launch distributed GPU inference tasks**  
   At the end, you define a Ray remote function (`@ray.remote(num_gpus=1)`) that loads the best checkpoint and runs inference on a single image from the validation set. You run this task on one GPU from the cluster.

All of this runs inside a **managed Anyscale workspace**. You don’t need to start or SSH into clusters, worry about node IP, or configure NCCL. The entire setup is **declarative and self-contained in this notebook**, and can be re-run or scaled up by changing a single parameter (`num_workers`).

This tutorial mirrors how many ML teams operate in practice: starting with a working PyTorch training loop and migrating it to the cloud without rewriting core logic. With Ray Train on Anyscale, the migration is clean, incremental, and production-ready.