## Preprocess

You need to convert the classes to labels (unique integers) so that you can train a classifier that can correctly predict the class given an input image. But before you do this, apply the same data ingestion and preprocessing as the previous notebook.

In [None]:
def add_class(row):
    row["class"] = row["path"].rsplit("/", 3)[-2]
    return row


In [None]:
# Preprocess data splits.
train_ds = ray.data.read_images(
    "s3://doggos-dataset/train", include_paths=True, shuffle="files"
)
train_ds = train_ds.map(add_class)
val_ds = ray.data.read_images("s3://doggos-dataset/val", include_paths=True)
val_ds = val_ds.map(add_class)


Define a `Preprocessor` class that:
- creates an embedding. A later step moves the embedding layer outside of the model since you freeze the embedding layer's weights and so you don't have to do it repeatedly as part of the model's forward pass, saving on unnecessary compute.
- convert the classes into labels for the classifier. 

While you could've just done this step as a simple operation, you're taking the time to organize it as a class so that you can save and load for inference later.

In [None]:
def convert_to_label(row, class_to_label):
    if "class" in row:
        row["label"] = class_to_label[row["class"]]
    return row


In [None]:
import numpy as np
from PIL import Image
import torch
from transformers import CLIPModel, CLIPProcessor
from doggos.embed import EmbedImages


In [None]:
class Preprocessor:
    """Preprocessor class."""

    def __init__(self, class_to_label=None):
        self.class_to_label = class_to_label or {}  # mutable defaults
        self.label_to_class = {v: k for k, v in self.class_to_label.items()}

    def fit(self, ds, column):
        self.classes = ds.unique(column=column)
        self.class_to_label = {tag: i for i, tag in enumerate(self.classes)}
        self.label_to_class = {v: k for k, v in self.class_to_label.items()}
        return self

    def transform(self, ds, concurrency=4, batch_size=64, num_gpus=1):
        ds = ds.map(
            convert_to_label,
            fn_kwargs={"class_to_label": self.class_to_label},
        )
        ds = ds.map_batches(
            EmbedImages,
            fn_constructor_kwargs={
                "model_id": "openai/clip-vit-base-patch32",
                "device": "cuda",
            },
            concurrency=4,
            batch_size=64,
            num_gpus=1,
            accelerator_type="T4",
        )
        ds = ds.drop_columns(["image"])
        return ds

    def save(self, fp):
        with open(fp, "w") as f:
            json.dump(self.class_to_label, f)


In [None]:
# Preprocess.
preprocessor = Preprocessor()
preprocessor = preprocessor.fit(train_ds, column="class")
train_ds = preprocessor.transform(ds=train_ds)
val_ds = preprocessor.transform(ds=val_ds)


2025-08-22 00:26:17,487	INFO dataset.py:3057 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2025-08-22 00:26:17,490	INFO logging.py:295 -- Registered dataset logger for dataset dataset_72_0
2025-08-22 00:26:17,522	INFO streaming_executor.py:117 -- Starting execution of Dataset dataset_72_0. Full logs are in /tmp/ray/session_2025-08-21_18-48-13_464408_2298/logs/ray-data
2025-08-22 00:26:17,523	INFO streaming_executor.py:118 -- Execution plan of Dataset dataset_72_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(add_class)] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- ListFiles 1: 0.00 row [00:00, ? row/s]

- ReadFiles 2: 0.00 row [00:00, ? row/s]

- Map(add_class) 3: 0.00 row [00:00, ? row/s]

- Aggregate 4: 0.00 row [00:00, ? row/s]

Sort Sample 5:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 6:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 7:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 8: 0.00 row [00:00, ? row/s]

2025-08-22 00:26:29,748	INFO streaming_executor.py:231 -- ✔️  Dataset dataset_72_0 execution finished in 12.22 seconds


<div class="alert alert-block alert"> <b> Data processing</b> 

See this extensive guide on [data loading and preprocessing](https://docs.ray.io/en/latest/train/user-guides/data-loading-preprocessing.html) for the last-mile preprocessing you need to do prior to training your models. However, Ray Data does support performant joins, filters, aggregations, etc., for the more structure data processing your workloads may need.

In [None]:
import shutil


In [None]:
# Write processed data to cloud storage.
preprocessed_data_path = os.path.join(
    "/mnt/cluster_storage", "doggos/preprocessed_data"
)
if os.path.exists(preprocessed_data_path):  # Clean up.
    shutil.rmtree(preprocessed_data_path)
preprocessed_train_path = os.path.join(preprocessed_data_path, "preprocessed_train")
preprocessed_val_path = os.path.join(preprocessed_data_path, "preprocessed_val")
train_ds.write_parquet(preprocessed_train_path)
val_ds.write_parquet(preprocessed_val_path)


2025-08-22 00:26:30,402	INFO logging.py:295 -- Registered dataset logger for dataset dataset_80_0
2025-08-22 00:26:30,433	INFO streaming_executor.py:117 -- Starting execution of Dataset dataset_80_0. Full logs are in /tmp/ray/session_2025-08-21_18-48-13_464408_2298/logs/ray-data
2025-08-22 00:26:30,435	INFO streaming_executor.py:118 -- Execution plan of Dataset dataset_80_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(add_class)->Map(convert_to_label)] -> ActorPoolMapOperator[MapBatches(EmbedImages)] -> TaskPoolMapOperator[MapBatches(drop_columns)->Write]


Running 0: 0.00 row [00:00, ? row/s]

- ListFiles 1: 0.00 row [00:00, ? row/s]

- ReadFiles 2: 0.00 row [00:00, ? row/s]

- Map(add_class)->Map(convert_to_label) 3: 0.00 row [00:00, ? row/s]

- MapBatches(EmbedImages) 4: 0.00 row [00:00, ? row/s]

- MapBatches(drop_columns)->Write 5: 0.00 row [00:00, ? row/s]

[36m(autoscaler +25s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
[36m(autoscaler +25s)[0m [autoscaler] [4xT4:48CPU-192GB] Attempting to add 1 node to the cluster (increasing from 0 to 1).
[36m(autoscaler +30s)[0m [autoscaler] [4xT4:48CPU-192GB|g4dn.12xlarge] [us-west-2a] [on-demand] Launched 1 instance.
[36m(autoscaler +1m15s)[0m [autoscaler] Cluster upscaled to {104 CPU, 8 GPU}.


[36m(_MapWorker pid=3320, ip=10.0.4.102)[0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[36m(MapBatches(drop_columns)->Write pid=44781, ip=10.0.171.239)[0m FilenameProvider have to provide proper filename template including '{{i}}' macro to ensure unique filenames when writing multiple files. Appending '{{i}}' macro to the end of the file. For more details on the expected filename template checkout PyArrow's `write_to_dataset` API
[36m(_MapWorker pid=3323, ip=10.0.4.102)[0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor dif

Running 0: 0.00 row [00:00, ? row/s]

- ListFiles 1: 0.00 row [00:00, ? row/s]

- ReadFiles 2: 0.00 row [00:00, ? row/s]

- Map(add_class)->Map(convert_to_label) 3: 0.00 row [00:00, ? row/s]

- MapBatches(EmbedImages) 4: 0.00 row [00:00, ? row/s]

- MapBatches(drop_columns)->Write 5: 0.00 row [00:00, ? row/s]

path: string, new schema: image: extension<ray.data.arrow_tensor_v2<ArrowTensorTypeV2>>
path: string. This may lead to unexpected behavior.
path: string
class: string
label: int64, new schema: image: extension<ray.data.arrow_tensor_v2<ArrowTensorTypeV2>>
path: string
class: string
label: int64. This may lead to unexpected behavior.
[36m(_MapWorker pid=3910, ip=10.0.4.102)[0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[36m(MapBatches(drop_columns)->Write pid=121066)[0m FilenameProvider have to provide proper filename template including '{{i}}' macro to ensure unique filenames when writing multiple files. Appending '{{i}}' macro to the end of the file. For more details on the expected filename template check

[36m(autoscaler +3m10s)[0m [autoscaler] [8CPU-32GB] Attempting to add 1 node to the cluster (increasing from 0 to 1).
[36m(autoscaler +3m10s)[0m [autoscaler] [8CPU-32GB|m5.2xlarge] [us-west-2a] [on-demand] Launched 1 instance.


[36m(_MapWorker pid=4731, ip=10.0.4.102)[0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.[32m [repeated 3x across cluster][0m
[36m(MapBatches(drop_columns)->Write pid=45557, ip=10.0.171.239)[0m FilenameProvider have to provide proper filename template including '{{i}}' macro to ensure unique filenames when writing multiple files. Appending '{{i}}' macro to the end of the file. For more details on the expected filename template checkout PyArrow's `write_to_dataset` API[32m [repeated 6x across cluster][0m
2025-08-22 00:29:24,485	INFO streaming_executor.py:231 -- ✔️  Dataset dataset_83_0 execution finished in 13.88 seconds
2025-08-22 00:29:24,531	INFO dataset.py:4621 -- Data sink Parquet finished. 720 rows a

<div class="alert alert-block alert"> <b> Store often, save compute</b> 

Store the preprocessed data into shared cloud storage to:
- save a record of what this preprocessed data looks like
- avoid triggering the entire preprocessing for each batch the model processes
- avoid [`materialize`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.materialize.html) of the preprocessed data because you shouldn't force large data to fit in memory