# Ray Data on Docker Compose
Streaming GPU inference with Ray Data — built for GPU saturation and heterogeneous scheduling.

## Setup

Start the Ray stack and launch Jupyter:

```bash
# 1. Build images
docker compose build

# 2. Start MinIO + Ray + App
docker compose up -d minio minio-setup ray-head app

# 3. Upload sample data
./scripts/upload-data.sh

# 4. Launch Jupyter Lab
docker compose exec app jupyter lab --ip 0.0.0.0 --port 8888 --allow-root --no-browser --notebook-dir=/app/notebook
```

Then open http://localhost:8888 in your browser.

## What is Ray Data?

Ray Data is a streaming data framework designed for **GPU-heavy ML workloads**. Key concepts:

- **Datasets** — distributed, streaming collections of Arrow-backed rows
- **map_batches** — the core operation: apply a function to batches of data
- **ActorPoolStrategy** — persistent GPU workers with model loaded once per actor
- **Streaming execution** — bounded memory, backpressure-aware
- **Heterogeneous scheduling** — CPU preprocessing → GPU inference seamlessly

## Architecture

```
Client (app) → Ray Head (GPU execution) → MinIO (S3 storage)
```

The app connects to the Ray cluster as a client. Ray schedules tasks on the head node (or workers). Data reads/writes go through MinIO.

In [None]:
import ray

ray.init("ray://ray-head:10001")

resources = ray.cluster_resources()
print(f"Cluster resources:")
for k, v in sorted(resources.items()):
    print(f"  {k}: {v}")

## Read Tabular Data

In [None]:
ds = ray.data.read_parquet("s3://lake/taxi/")

print(f"Schema: {ds.schema()}")
print(f"Count: {ds.count():,}")
ds.show(5)

## Basic Transformations

`map_batches` applies a function to each batch. For CPU transforms, no special config needed.

In [None]:
import numpy as np


def add_tip_pct(batch):
    """Add tip percentage column."""
    fare = np.array(batch["fare_amount"])
    tip = np.array(batch["tip_amount"])
    batch["tip_pct"] = np.where(fare > 0, tip / fare * 100, 0.0)
    return batch


transformed = ds.map_batches(add_tip_pct)
transformed.select_columns(["fare_amount", "tip_amount", "tip_pct"]).show(10)

## Read Images

In [None]:
images = ray.data.read_images("s3://bucket/images/")
print(f"Image count: {images.count()}")
images.show(2)

## GPU Inference with ActorPoolStrategy

The `ImageClassifier` loads ResNet-50 **once per actor** and reuses it across batches.
This avoids the cost of loading a model for every batch.

In [None]:
import torch
from torchvision.models import ResNet50_Weights, resnet50


class ImageClassifier:
    """Stateful GPU actor — model loaded ONCE, reused for all batches."""

    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.weights = ResNet50_Weights.DEFAULT
        self.model = resnet50(weights=self.weights).to(self.device).eval()
        self.preprocess = self.weights.transforms()
        self.categories = self.weights.meta["categories"]
        print(f"[ImageClassifier] ResNet-50 loaded on {self.device}")

    def __call__(self, batch: dict) -> dict:
        tensors = torch.stack(
            [
                self.preprocess(torch.from_numpy(img).permute(2, 0, 1))
                for img in batch["image"]
            ]
        ).to(self.device)

        with torch.no_grad():
            logits = self.model(tensors)

        top_idx = logits.argmax(dim=1).cpu().numpy()
        return {
            "prediction": [self.categories[i] for i in top_idx],
            "confidence": logits.softmax(dim=1).max(dim=1).values.cpu().numpy(),
        }


predictions = images.map_batches(
    ImageClassifier,
    compute=ray.data.ActorPoolStrategy(size=1),
    num_gpus=1,
    batch_size=32,
)

## Inspect Predictions

In [None]:
predictions.show(10)

# Class distribution
pdf = predictions.to_pandas()
print("\nTop-10 predicted classes:")
print(pdf["prediction"].value_counts().head(10).to_string())

print(
    f"\nConfidence — avg: {pdf['confidence'].mean():.4f}, "
    f"min: {pdf['confidence'].min():.4f}, "
    f"max: {pdf['confidence'].max():.4f}"
)

## Write Results

In [None]:
predictions.write_parquet("s3://bucket/notebook_predictions/")
print("Written to s3://bucket/notebook_predictions/")

# Read back to verify
saved = ray.data.read_parquet("s3://bucket/notebook_predictions/")
print(f"Read back {saved.count():,} rows")
saved.show(5)

## Cleanup

In [None]:
ray.shutdown()
print("Ray disconnected.")