## 3. Transforming Data

- Use either Ray tasks or Ray actors to transform datasets. 
- Using actors allows for expensive state initialization (e.g., for GPU-based tasks) to be cached.
- Ray Data simplifies general purpose parallel GPU and CPU compute in Ray. 

Here is a sample data pipeline for streaming image data across a classification and segmentation model on a heterogenous cluster of CPUs and GPUs.

<img src='https://docs.ray.io/en/releases-2.6.1/_images/stream-example.png' width=60%/>

To transform data, we can use the `map_batches` API. This API allows us to apply a transformation to each batch of data.

In [14]:
import numpy as np
from torchvision.transforms import Compose, ToTensor, Normalize


def normalize(
    batch: dict[str, np.ndarray], min_: float, max_: float
) -> dict[str, np.ndarray]:
    transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
    batch["image"] = [transform(image) for image in batch["image"]]
    return batch


ds_normalized = ds.map_batches(normalize, fn_kwargs={"min_": 0, "max_": 255})
ds_normalized

2024-12-06 09:25:49,233	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:25:49,234	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

MapBatches(normalize)
+- Dataset(
      num_rows=?,
      schema={image: numpy.ndarray(shape=(28, 28), dtype=uint8), path: string}
   )

### Execution mode

- Most transformations are **lazy**. 
- They **don't execute until you** write a dataset to storage or decide to **materialize** (or consume) the dataset.
- To materialize a very small subset of the data, you can use the **`take_batch`** method.

In [15]:
normalized_batch = ds_normalized.take_batch(batch_size=10)

for image in normalized_batch["image"]:
    assert image.shape == (1, 28, 28) # channel, height, width
    assert image.min() >= -1 and image.max() <= 1 # normalized to [-1, 1]

2024-12-06 09:26:40,187	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:26:40,188	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(normalize)] -> LimitOperator[limit=10]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

- MapBatches(normalize) 3: 0 bundle [00:00, ? bundle/s]

- limit=10 4: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

### Stateful transformations with actors

In cases like **batch inference**, you want to spin up a number of **actor processes** that are initialized once with your model and reused to process multiple batches.

To implement this, you can use the `map_batches` API with a "Callable" class method that implements:

- `__init__`: Initialize any expensive state.
- `__call__`: Perform the stateful transformation.

For example, we can implement a `MNISTClassifier` that:
- loads a pre-trained model from a local file
- accepts a batch of images and generates the predicted label

In [16]:
import torch


class MNISTClassifier:
    def __init__(self, local_path: str):
        self.model = torch.jit.load(local_path)
        self.model.to("cuda")
        self.model.eval()

    def __call__(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        images = torch.tensor(batch["image"]).float().to("cuda")

        with torch.no_grad():
            logits = self.model(images).cpu().numpy()

        batch["predicted_label"] = np.argmax(logits, axis=1)
        return batch

In [17]:
# We download the model from s3 to an EFS storage
!aws s3 cp s3://anyscale-public-materials/ray-ai-libraries/mnist/model/model.pt /mnt/cluster_storage/model.pt

download: s3://anyscale-public-materials/ray-ai-libraries/mnist/model/model.pt to ../../../mnt/cluster_storage/model.pt


We can now use the `map_batches` API to apply the transformation to each batch of data.

In [18]:
ds_preds = ds_normalized.map_batches(
    MNISTClassifier,
    fn_constructor_kwargs={"local_path": "/mnt/cluster_storage/model.pt"},
    num_gpus=0.1,
    concurrency=1,
    batch_size=100,
)

<div class="alert alert-block alert-warning">

<b>Note:</b> We pass in the Callable class uninitialized. Ray will pass in the arguments to the class constructor when the class is actually used in a transformation.

</div>

In [19]:
batch_preds = ds_preds.take_batch(100)

2024-12-06 09:32:17,604	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:32:17,605	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(normalize)] -> ActorPoolMapOperator[MapBatches(MNISTClassifier)] -> LimitOperator[limit=100]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

- MapBatches(normalize) 3: 0 bundle [00:00, ? bundle/s]

- MapBatches(MNISTClassifier) 4: 0 bundle [00:00, ? bundle/s]

- limit=100 5: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

In [None]:
batch_preds

### Materializing Data

You can choose to materialize the entire dataset into the ray object store which is distributed across the cluster, primarily in memory and secondarily spilling to disk.

To materialize the dataset, we can use the `materialize()` method.

Use this **only** when you require the full dataset to compute downstream outputs.

In [20]:
ds_preds.materialize()

2024-12-06 09:35:32,187	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:35:32,187	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(normalize)] -> ActorPoolMapOperator[MapBatches(MNISTClassifier)]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

- MapBatches(normalize) 3: 0 bundle [00:00, ? bundle/s]

- MapBatches(MNISTClassifier) 4: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

MaterializedDataset(
   num_blocks=1,
   num_rows=500,
   schema={
      image: numpy.ndarray(shape=(1, 28, 28), dtype=float),
      path: string,
      predicted_label: int64
   }
)