# Daft on Docker Compose
Multimodal-native DataFrame engine with a Rust core — from laptop to cluster with zero code changes.

## Setup

Start the Ray+Daft stack and launch Jupyter:

```bash
# 1. Build images
docker compose build

# 2. Start MinIO + Ray + App
docker compose up -d minio minio-setup ray-head app

# 3. Upload sample data
./scripts/upload-data.sh

# 4. Launch Jupyter Lab
docker compose exec app jupyter lab --ip 0.0.0.0 --port 8888 --allow-root --no-browser --notebook-dir=/app/notebook
```

Then open http://localhost:8888 in your browser.

## What is Daft?

Daft is a **multimodal-native DataFrame engine** with a Rust core. Key concepts:

- **Lazy evaluation** — builds a query plan, executes on `.collect()` or `.show()`
- **Rust-native ops** — `url.download()`, `image.decode()`, `image.resize()` run in Rust, not Python
- **Streaming execution** — bounded memory via the Swordfish scheduler
- **Seamless scaling** — same code runs locally or on a Ray cluster (`DAFT_RUNNER=ray`)
- **Class UDFs** — GPU models loaded once per worker, reused across batches

## Architecture

```
Daft Client (app) → Ray Backend (ray-head, GPU) → MinIO (S3)
```

Daft uses Ray as its distributed execution backend. The `DAFT_RUNNER=ray` env var enables this transparently.

In [None]:
import os
import daft
from daft import col

# S3/MinIO configuration
io_config = daft.io.IOConfig(
    s3=daft.io.S3Config(
        endpoint_url=os.environ.get("AWS_ENDPOINT_URL", "http://minio:9000"),
        key_id=os.environ.get("AWS_ACCESS_KEY_ID", "minioadmin"),
        access_key=os.environ.get("AWS_SECRET_ACCESS_KEY", "minioadmin"),
        region_name="us-east-1",
    )
)
daft.set_planning_config(default_io_config=io_config)

print(f"Daft runner: {os.environ.get('DAFT_RUNNER', 'py')}")
print(f"Daft version: {daft.__version__}")

## Read Parquet from S3

In [None]:
taxi = daft.read_parquet("s3://lake/taxi/*.parquet")
taxi.schema()

In [None]:
taxi.show(5)

## DataFrame Operations

Daft's expression API supports filter, select, groupby, and aggregation — all lazily evaluated.

In [None]:
# Filter high-value trips
high_value = taxi.where((col("fare_amount") > 10.0) & (col("trip_distance") > 5.0))
high_value.select("trip_distance", "fare_amount", "tip_amount", "total_amount").show(10)

In [None]:
# Revenue by payment type
(
    taxi.groupby("payment_type")
    .agg(
        col("total_amount").sum().alias("total_revenue"),
        col("total_amount").count().alias("trip_count"),
        col("tip_amount").mean().alias("avg_tip"),
    )
    .sort(col("total_revenue"), desc=True)
    .show()
)

## Read Image Metadata

In [None]:
images = daft.read_parquet("s3://bucket/image_metadata.parquet")
images.schema()

In [None]:
images.show(5)

## Multimodal Pipeline — Rust-Native Ops

Download, decode, and resize images using Daft's built-in Rust operations.
No Python Pillow needed — these ops run in parallel native Rust threads.

In [None]:
processed = (
    images.with_column("image_bytes", col("image_url").url.download())
    .with_column("image", col("image_bytes").image.decode())
    .with_column("resized", col("image").image.resize(224, 224))
)
print(f"Processed {processed.count_rows()} images")
processed.select("image_url", "resized").show(3)

## GPU Embedding — CLIP via Class UDF

The `ImageEmbedder` loads CLIP ViT-base-patch32 **once per worker** and encodes image batches on GPU.

In [None]:
import numpy as np


@daft.cls
class ImageEmbedder:
    """GPU-bound: CLIP model loaded once, encode batches of images."""

    def __init__(self):
        import logging
        import torch

        logging.getLogger("transformers").setLevel(logging.ERROR)
        from transformers import CLIPModel, CLIPProcessor

        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(
            self.device
        )
        self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
        self.model.eval()

    @daft.method.batch(
        return_dtype=daft.DataType.fixed_size_list(daft.DataType.float32(), 512)
    )
    def __call__(self, image_bytes_col):
        import io
        import torch
        from PIL import Image

        default_embedding = np.zeros(512, dtype=np.float32)
        embeddings = []
        for img_bytes in image_bytes_col.to_pylist():
            if img_bytes is None:
                embeddings.append(default_embedding)
                continue
            try:
                img = Image.open(io.BytesIO(img_bytes)).convert("RGB")
                inputs = self.processor(images=img, return_tensors="pt").to(self.device)
                with torch.no_grad():
                    features = self.model.get_image_features(**inputs)
                embeddings.append(features[0].cpu().numpy().astype(np.float32))
            except Exception:
                embeddings.append(default_embedding)
        return np.array(embeddings, dtype=np.float32)


embedder = ImageEmbedder()

embedded = (
    images.with_column("image_bytes", col("image_url").url.download())
    .with_column("embedding", embedder(col("image_bytes")))
    .exclude("image_bytes")
)
embedded.show(3)

## Write and Verify

In [None]:
embedded.write_parquet("s3://bucket/notebook_embeddings/")
print("Written to s3://bucket/notebook_embeddings/")

# Read back
saved = daft.read_parquet("s3://bucket/notebook_embeddings/")
print(f"Read back {saved.count_rows():,} rows")
saved.show(5)

## Zero-Copy Interop — Daft to Arrow to Polars

Daft DataFrames can be converted to Arrow tables with zero copy, enabling interop with any Arrow-compatible library.

In [None]:
# Daft → Arrow → Polars (zero-copy where possible)
arrow_table = taxi.limit(1000).to_arrow()
print(f"Arrow table: {arrow_table.num_rows} rows, {arrow_table.num_columns} columns")

try:
    import polars as pl

    polars_df = pl.from_arrow(arrow_table)
    print(f"Polars DataFrame: {polars_df.shape}")
    print(polars_df.head(3))
except ImportError:
    print("Polars not installed — skipping interop demo")

## Cleanup

No explicit cleanup needed — Daft uses the Ray cluster managed by Docker Compose.
Stop with `docker compose down`.