## 03 · Prepare Dataset for Ray Data  

Ray Data works best with data in **tabular formats** such as Parquet.  
In this step we:  

- Convert the MNIST dataset into a **pandas DataFrame** with two columns:  
  * `"image"` → raw image arrays  
  * `"label"` → digit class (0–9)  
- Write the DataFrame to disk in **Parquet format** under `/mnt/cluster_storage/`.  

Parquet is efficient for both reading and distributed processing, making it a good fit for Ray Data pipelines.  

In [None]:
# 03. Convert MNIST dataset into Parquet for Ray Data

# Build a DataFrame with image arrays and labels
df = pd.DataFrame({
    "image": dataset.data.tolist(),   # raw image pixels (as lists)
    "label": dataset.targets          # digit labels 0–9
})

# Persist the dataset in Parquet format (columnar, efficient for Ray Data)
df.to_parquet("/mnt/cluster_storage/cifar10.parquet")

### 04 · Load Dataset into Ray Data  

Now that the training data is stored as Parquet, we can load it back into a **Ray Dataset**:  

- Use `ray.data.read_parquet()` to create a distributed Ray Dataset from the Parquet file.  
- Each row has two columns: `"image"` (raw pixel array) and `"label"` (digit class).  
- The dataset is automatically **sharded across the Ray cluster**, so multiple workers can read and process it in parallel.  

This Ray Dataset will later be passed to the `TorchTrainer` for distributed training.  


In [None]:
# 04. Load the Parquet dataset into a Ray Dataset

# Read the Parquet file → creates a distributed Ray Dataset
train_ds = ray.data.read_parquet("/mnt/cluster_storage/cifar10.parquet")
