## 01 · Imports  
Before you start, gather every library you're going to rely on throughout this notebook. Pull in core Python utilities for file handling and plotting, PyTorch and TorchVision for deep-learning components, Ray Train for distributed orchestration, Hugging Face Datasets for quick data access, and PyArrow plus Pandas for fast Parquet IO. Importing everything up-front keeps the rest of the tutorial clean and predictable.

In [None]:
# 01. Imports

# ————————————————————————
# Standard Library Utilities
# ————————————————————————
import os, io, tempfile, shutil  # file I/O and temp dirs
import json                      # reading/writing configs
import random, uuid              # randomness and unique IDs

# ————————————————————————
# Core Data & Storage Libraries
# ————————————————————————
import pandas as pd              # tabular data handling
import numpy as np               # numerical ops
import pyarrow as pa             # in-memory columnar format
import pyarrow.parquet as pq     # reading/writing Parquet files
from tqdm import tqdm            # progress bars

# ————————————————————————
# Image Handling & Visualization
# ————————————————————————
from PIL import Image
import matplotlib.pyplot as plt  # plotting loss curves, images

# ————————————————————————
# PyTorch + TorchVision Core
# ————————————————————————
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms as T
from torchvision.models import resnet18
from torchvision.transforms import Compose, Resize, CenterCrop

# ————————————————————————
# Ray Train: Distributed Training Primitives
# ————————————————————————
import ray
import ray.train as train
from ray.train.torch import (
    prepare_model,
    prepare_data_loader,
    TorchTrainer,
)
from ray.train import (
    ScalingConfig,
    RunConfig,
    FailureConfig,
    CheckpointConfig,
    Checkpoint,
    get_checkpoint,
    get_context,
)

# ————————————————————————
# Dataset Access
# ————————————————————————
from datasets import load_dataset  # Hugging Face Datasets


### 02 · Load 10 % of Food-101  
Next, grab roughly 7,500 images, exactly 10% of Food-101—using a single call to `load_dataset`. This trimmed subset trains quickly while still being large enough to demonstrate Ray’s scaling behavior.

In [None]:
# 02. Load 10% of food101 (~7,500 images)
ds = load_dataset("food101", split="train[:10%]") 

### 03 · Resize and Encode Images  
Here you preprocess each image: resize to 256 pixels, center-crop to 224 pixels (the size expected by most ImageNet models), and then convert the result to raw Joint Photographic Experts Group (JPEG) bytes. By storing bytes instead of full Python Imaging Library (PIL) objects, you| keep the dataset compact and Parquet-friendly.

In [None]:
# 03. Resize + encode as JPEG bytes
transform = Compose([Resize(256), CenterCrop(224)])
records = []

for example in tqdm(ds, desc="Preprocessing images", unit="img"):
    try:
        img = transform(example["image"])
        buf = io.BytesIO()
        img.save(buf, format="JPEG")
        records.append({
            "image_bytes": buf.getvalue(),
            "label": example["label"]
        })
    except Exception as e:
        continue

### 04 · Visual Sanity Check  
Before committing to hours of training, take nine random samples and plot them with their class names. This quick inspection lets you properly align labels and confirm that images are correctly resized.

In [None]:
# 04. Visualize the dataset

label_names = ds.features["label"].names  # maps int → string

samples = random.sample(records, 9)

fig, axs = plt.subplots(3, 3, figsize=(8, 8))
fig.suptitle("Sample Resized Images from food101-lite", fontsize=16)

for ax, rec in zip(axs.flatten(), samples):
    img = Image.open(io.BytesIO(rec["image_bytes"]))
    label_name = label_names[rec["label"]]
    ax.imshow(img)
    ax.set_title(label_name)
    ax.axis("off")

plt.tight_layout()
plt.show()

### 05 · Persist to Parquet  
Now write the images and labels to a Parquet file. Because Parquet is columnar, you can read just the columns you need during training, which speeds up IO---especially when multiple workers are reading in parallel under Ray.

In [None]:
# 05. Write Dataset to Parquet

output_dir = "/mnt/cluster_storage/food101_lite/parquet_256"
os.makedirs(output_dir, exist_ok=True)

table = pa.Table.from_pydict({
    "image_bytes": [r["image_bytes"] for r in records],
    "label": [r["label"] for r in records]
})
pq.write_table(table, os.path.join(output_dir, "shard_0.parquet"))

print(f"Wrote {len(records)} records to {output_dir}")