## Evaluation

This tutorial concludes by evaluating the trained model on the test dataset. Evaluation is essentially the same as the batch inference workload where you apply the model on batches of data and then calculate metrics using the predictions versus true labels. Ray Data is hyper optimized for throughput so preserving order isn't a priority. But for evaluation, this approach is crucial. Achieve this approach by preserving the entire row and adding the predicted label as another column to each row.

In [None]:
from urllib.parse import urlparse
from sklearn.metrics import multilabel_confusion_matrix


In [None]:
class TorchPredictor:
    def __init__(self, preprocessor, model):
        self.preprocessor = preprocessor
        self.model = model
        self.model.eval()

    def __call__(self, batch, device="cuda"):
        self.model.to(device)
        batch["prediction"] = self.model.predict(collate_fn(batch))
        return batch

    def predict_probabilities(self, batch, device="cuda"):
        self.model.to(device)
        predicted_probabilities = self.model.predict_probabilities(collate_fn(batch))
        batch["probabilities"] = [
            {
                self.preprocessor.label_to_class[i]: float(prob)
                for i, prob in enumerate(probabilities)
            }
            for probabilities in predicted_probabilities
        ]
        return batch

    @classmethod
    def from_artifacts_dir(cls, artifacts_dir):
        with open(os.path.join(artifacts_dir, "class_to_label.json"), "r") as fp:
            class_to_label = json.load(fp)
        preprocessor = Preprocessor(class_to_label=class_to_label)
        model = ClassificationModel.load(
            args_fp=os.path.join(artifacts_dir, "args.json"),
            state_dict_fp=os.path.join(artifacts_dir, "model.pt"),
        )
        return cls(preprocessor=preprocessor, model=model)


In [None]:
# Load and preproces eval dataset.
artifacts_dir = urlparse(best_run.artifact_uri).path
predictor = TorchPredictor.from_artifacts_dir(artifacts_dir=artifacts_dir)
test_ds = ray.data.read_images("s3://doggos-dataset/test", include_paths=True)
test_ds = test_ds.map(add_class)
test_ds = predictor.preprocessor.transform(ds=test_ds)


In [None]:
# y_pred (batch inference).
pred_ds = test_ds.map_batches(
    predictor,
    concurrency=4,
    batch_size=64,
    num_gpus=1,
    accelerator_type="T4",
)
pred_ds.take(1)


2025-08-22 00:34:12,802	INFO logging.py:295 -- Registered dataset logger for dataset dataset_96_0
2025-08-22 00:34:12,814	INFO streaming_executor.py:117 -- Starting execution of Dataset dataset_96_0. Full logs are in /tmp/ray/session_2025-08-21_18-48-13_464408_2298/logs/ray-data
2025-08-22 00:34:12,815	INFO streaming_executor.py:118 -- Execution plan of Dataset dataset_96_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(add_class)->Map(convert_to_label)] -> ActorPoolMapOperator[MapBatches(EmbedImages)] -> TaskPoolMapOperator[MapBatches(drop_columns)] -> TaskPoolMapOperator[MapBatches(TorchPredictor)] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- ListFiles 1: 0.00 row [00:00, ? row/s]

- ReadFiles 2: 0.00 row [00:00, ? row/s]

- Map(add_class)->Map(convert_to_label) 3: 0.00 row [00:00, ? row/s]

- MapBatches(EmbedImages) 4: 0.00 row [00:00, ? row/s]

- MapBatches(drop_columns) 5: 0.00 row [00:00, ? row/s]

- MapBatches(TorchPredictor) 6: 0.00 row [00:00, ? row/s]

- limit=1 7: 0.00 row [00:00, ? row/s]

[36m(_MapWorker pid=18066, ip=10.0.4.102)[0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


[36m(autoscaler +8m20s)[0m [autoscaler] [1xT4:8CPU-32GB] Attempting to add 1 node to the cluster (increasing from 0 to 1).
[36m(autoscaler +8m25s)[0m [autoscaler] [1xT4:8CPU-32GB|g4dn.2xlarge] [us-west-2a] [on-demand] Launched 1 instance.
[36m(autoscaler +8m25s)[0m [autoscaler] [4xT4:48CPU-192GB] Attempting to add 1 node to the cluster (increasing from 1 to 2).
[36m(autoscaler +8m30s)[0m [autoscaler] [4xT4:48CPU-192GB|g4dn.12xlarge] [us-west-2a] [on-demand] Launched 1 instance.


[36m(_MapWorker pid=18062, ip=10.0.4.102)[0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.[32m [repeated 3x across cluster][0m
2025-08-22 00:34:50,050	INFO streaming_executor.py:231 -- ✔️  Dataset dataset_96_0 execution finished in 37.23 seconds


[{'path': 'doggos-dataset/test/basset/basset_10005.jpg',
  'class': 'basset',
  'label': 2,
  'embedding': array([ 8.86104554e-02, -5.89382686e-02,  1.15464866e-01,  2.15815112e-01,
         -3.43266308e-01, -3.35150540e-01,  1.48883224e-01, -1.02369718e-01,
         -1.69915810e-01,  4.34856862e-03,  2.41593361e-01,  1.79200619e-01,
          4.34402555e-01,  4.59785998e-01,  1.59284808e-02,  4.16959971e-01,
          5.20779848e-01,  1.86366066e-01, -3.43496174e-01, -4.00813907e-01,
         -1.15213782e-01, -3.04853529e-01,  1.77998394e-01,  1.82090014e-01,
         -3.56360346e-01, -2.30711952e-01,  1.69025257e-01,  3.78455579e-01,
          8.37044120e-02, -4.81875241e-02,  3.17967087e-01, -1.40099749e-01,
         -2.15949178e-01, -4.72761095e-01, -3.01893711e-01,  7.59940967e-02,
         -2.64865339e-01,  5.89084566e-01, -3.75831634e-01,  3.11807573e-01,
         -3.82964134e-01, -1.86417520e-01,  1.07007243e-01,  4.81416702e-01,
         -3.70819569e-01,  9.12090182e-01,  3.13

In [None]:
def batch_metric(batch):
    labels = batch["label"]
    preds = batch["prediction"]
    mcm = multilabel_confusion_matrix(labels, preds)
    tn, fp, fn, tp = [], [], [], []
    for i in range(mcm.shape[0]):
        tn.append(mcm[i, 0, 0])  # True negatives
        fp.append(mcm[i, 0, 1])  # False positives
        fn.append(mcm[i, 1, 0])  # False negatives
        tp.append(mcm[i, 1, 1])  # True positives
    return {"TN": tn, "FP": fp, "FN": fn, "TP": tp}


In [None]:
# Aggregated metrics after processing all batches.
metrics_ds = pred_ds.map_batches(batch_metric)
aggregate_metrics = metrics_ds.sum(["TN", "FP", "FN", "TP"])

# Aggregate the confusion matrix components across all batches.
tn = aggregate_metrics["sum(TN)"]
fp = aggregate_metrics["sum(FP)"]
fn = aggregate_metrics["sum(FN)"]
tp = aggregate_metrics["sum(TP)"]

# Calculate metrics.
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
accuracy = (tp + tn) / (tp + tn + fp + fn)


2025-08-22 00:34:50,290	INFO logging.py:295 -- Registered dataset logger for dataset dataset_99_0


2025-08-22 00:34:50,303	INFO streaming_executor.py:117 -- Starting execution of Dataset dataset_99_0. Full logs are in /tmp/ray/session_2025-08-21_18-48-13_464408_2298/logs/ray-data
2025-08-22 00:34:50,304	INFO streaming_executor.py:118 -- Execution plan of Dataset dataset_99_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[Map(add_class)->Map(convert_to_label)] -> ActorPoolMapOperator[MapBatches(EmbedImages)] -> TaskPoolMapOperator[MapBatches(drop_columns)] -> TaskPoolMapOperator[MapBatches(TorchPredictor)] -> TaskPoolMapOperator[MapBatches(batch_metric)] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- ListFiles 1: 0.00 row [00:00, ? row/s]

- ReadFiles 2: 0.00 row [00:00, ? row/s]

- Map(add_class)->Map(convert_to_label) 3: 0.00 row [00:00, ? row/s]

- MapBatches(EmbedImages) 4: 0.00 row [00:00, ? row/s]

- MapBatches(drop_columns) 5: 0.00 row [00:00, ? row/s]

- MapBatches(TorchPredictor) 6: 0.00 row [00:00, ? row/s]

- MapBatches(batch_metric) 7: 0.00 row [00:00, ? row/s]

- Aggregate 8: 0.00 row [00:00, ? row/s]

Sort Sample 9:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 10:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 11:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 12: 0.00 row [00:00, ? row/s]

[36m(_MapWorker pid=19193, ip=10.0.4.102)[0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[36m(_MapWorker pid=25926, ip=10.0.4.102)[0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.[32m [repeated 2x across cluster][0m


[36m(autoscaler +9m10s)[0m [autoscaler] Cluster upscaled to {120 CPU, 9 GPU}.
[36m(autoscaler +9m15s)[0m [autoscaler] Cluster upscaled to {168 CPU, 13 GPU}.


[36m(_MapWorker pid=27577, ip=10.0.4.102)[0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
2025-08-22 00:38:03,968	INFO streaming_executor.py:231 -- ✔️  Dataset dataset_99_0 execution finished in 193.66 seconds


In [None]:
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1: {f1:.2f}")
print(f"Accuracy: {accuracy:.2f}")


Precision: 0.84
Recall: 0.84
F1: 0.84
Accuracy: 0.98


[36m(autoscaler +13m0s)[0m [autoscaler] Downscaling node i-0ffe5abae6e899f5a (node IP: 10.0.60.138) due to node idle termination.
[36m(autoscaler +13m5s)[0m [autoscaler] Cluster resized to {120 CPU, 9 GPU}.
[36m(autoscaler +16m0s)[0m [autoscaler] Downscaling node i-0aa72cef9b8921af5 (node IP: 10.0.31.199) due to node idle termination.
[36m(autoscaler +16m5s)[0m [autoscaler] Cluster resized to {112 CPU, 8 GPU}.


[33m(raylet, ip=10.0.4.102)[0m Using CPython 3.12.11 interpreter at: /home/ray/anaconda3/bin/python3.12
[33m(raylet, ip=10.0.4.102)[0m Creating virtual environment at: .venv
[33m(raylet, ip=10.0.4.102)[0m    Building doggos @ file:///tmp/ray/session_2025-08-21_18-48-13_464408_2298/runtime_resources/working_dir_files/_ray_pkg_f79228c33bd2a431/doggos
[33m(raylet, ip=10.0.4.102)[0m Downloading pillow (6.3MiB)
[33m(raylet, ip=10.0.4.102)[0m Downloading grpcio (5.9MiB)
[33m(raylet, ip=10.0.4.102)[0m Downloading sqlalchemy (3.2MiB)
[33m(raylet, ip=10.0.4.102)[0m Downloading pydantic-core (1.9MiB)
[33m(raylet, ip=10.0.4.102)[0m Downloading jedi (1.5MiB)
[33m(raylet, ip=10.0.4.102)[0m Downloading virtualenv (5.7MiB)
[33m(raylet, ip=10.0.4.102)[0m Downloading pandas (11.4MiB)
[33m(raylet, ip=10.0.4.102)[0m Downloading setuptools (1.1MiB)
[33m(raylet, ip=10.0.4.102)[0m Downloading uvloop (4.5MiB)
[33m(raylet, ip=10.0.4.102)[0m Downloading nvidia-cuda-nvrtc-cu12 (22.6MiB

**🚨 Note**: Reset this notebook using the **"🔄 Restart"** button location at the notebook's menu bar. This way we can free up all the variables, utils, etc. used in this notebook.