# Intro to Ray Data

This notebook will provide an overview of Ray Data and how to use it to load, and transform data in a distributed manner.

<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook:</b>
<ul>
    <li><b>Part 0:</b> Introduction</a></li>
    <li><b>Part 1:</b> When to use Ray Data</a></li>
    <li><b>Part 2:</b> Loading Data</a></li>
    <li><b>Part 3:</b> Transforming Data</a></li>
    <li><b>Part 4:</b> Data Operations: Grouping, Aggregation, and Shuffling</a></li>
    <li><b>Part 5:</b> Persisting Data</a></li>
</ul>
</div>


## 1. When to use Ray Data

- To **load and preprocess data for distributed ML workloads**.
- Datasets are the main abstraction.
- Compared to other loading solutions, Datasets are more **flexible** and provide [**higher overall performance**](https://www.anyscale.com/blog/why-third-generation-ml-platforms-are-more-performant). 
- Especially performant when needing to run pre-processing in a 
   - **streaming fashion** across a 
   - **large dataset** on a 
   - **heterogeneous cluster of CPUs and GPUs**.
- Use as **last-mile bridge** from storage or ETL pipeline outputs **to distributed applications** and libraries in Ray. 

<img src='https://docs.ray.io/en/releases-2.34.0/_images/dataset-loading-1.svg' width=60%/>


## 2. Loading Data

- Datasets uses Ray tasks to read data from remote storage. 
- When reading from a file-based datasource (e.g., S3), it creates a number of read tasks proportional to the number of CPUs in the cluster. 
- Each read task reads its assigned files and produces an output block

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/dataset-read-cropped-v2.svg" width="500px">

Let's load some `MNIST` data from s3.

In [22]:
# Here is our dataset it contains 50 images per class
!aws s3 ls s3://anyscale-public-materials/ray-ai-libraries/mnist/50_per_index/

                           PRE 0/
                           PRE 1/
                           PRE 2/
                           PRE 3/
                           PRE 4/
                           PRE 5/
                           PRE 6/
                           PRE 7/
                           PRE 8/
                           PRE 9/


We will use the `read_images` function to load the image data.

In [21]:
import ray

ds = ray.data.read_images("s3://anyscale-public-materials/ray-ai-libraries/mnist/50_per_index/", include_paths=True)
ds

2024-12-06 09:56:54,059	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:56:54,060	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

2024-12-06 09:56:57,428	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:56:57,429	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

Dataset(
   num_rows=?,
   schema={image: numpy.ndarray(shape=(28, 28), dtype=uint8), path: string}
)

Refer to the [Input/Output docs](https://docs.ray.io/en/latest/data/api/input_output.html) for a comprehensive list of read functions.

### More about Datasets

- A Dataset consists of a list of Ray object references to **blocks**. 
- Having multiple blocks in a dataset allows for **parallel transformation and ingest**.

The following figure visualizes a **tabular dataset with three blocks, each block holding 1000 rows each**:

<img src='https://docs.ray.io/en/releases-2.6.1/_images/dataset-arch.svg' width=50%/>

- A Dataset is just a list of Ray object references.
- Fits into Ray's general compute framework.
- It can be freely passed between Ray tasks, actors, and other  libraries like any other object reference.
- This flexibility is a unique characteristic of Ray Datasets.

## 3. Transforming Data

- Use either Ray tasks or Ray actors to transform datasets. 
- Using actors allows for expensive state initialization (e.g., for GPU-based tasks) to be cached.
- Ray Data simplifies general purpose parallel GPU and CPU compute in Ray. 

Here is a sample data pipeline for streaming image data across a classification and segmentation model on a heterogenous cluster of CPUs and GPUs.

<img src='https://docs.ray.io/en/releases-2.6.1/_images/stream-example.png' width=60%/>

To transform data, we can use the `map_batches` API. This API allows us to apply a transformation to each batch of data.

In [14]:
import numpy as np
from torchvision.transforms import Compose, ToTensor, Normalize


def normalize(
    batch: dict[str, np.ndarray], min_: float, max_: float
) -> dict[str, np.ndarray]:
    transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
    batch["image"] = [transform(image) for image in batch["image"]]
    return batch


ds_normalized = ds.map_batches(normalize, fn_kwargs={"min_": 0, "max_": 255})
ds_normalized

2024-12-06 09:25:49,233	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:25:49,234	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

MapBatches(normalize)
+- Dataset(
      num_rows=?,
      schema={image: numpy.ndarray(shape=(28, 28), dtype=uint8), path: string}
   )

### Execution mode

- Most transformations are **lazy**. 
- They **don't execute until you** write a dataset to storage or decide to **materialize** (or consume) the dataset.
- To materialize a very small subset of the data, you can use the **`take_batch`** method.

In [15]:
normalized_batch = ds_normalized.take_batch(batch_size=10)

for image in normalized_batch["image"]:
    assert image.shape == (1, 28, 28) # channel, height, width
    assert image.min() >= -1 and image.max() <= 1 # normalized to [-1, 1]

2024-12-06 09:26:40,187	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:26:40,188	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(normalize)] -> LimitOperator[limit=10]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

- MapBatches(normalize) 3: 0 bundle [00:00, ? bundle/s]

- limit=10 4: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

### Stateful transformations with actors

In cases like **batch inference**, you want to spin up a number of **actor processes** that are initialized once with your model and reused to process multiple batches.

To implement this, you can use the `map_batches` API with a "Callable" class method that implements:

- `__init__`: Initialize any expensive state.
- `__call__`: Perform the stateful transformation.

For example, we can implement a `MNISTClassifier` that:
- loads a pre-trained model from a local file
- accepts a batch of images and generates the predicted label

In [16]:
import torch


class MNISTClassifier:
    def __init__(self, local_path: str):
        self.model = torch.jit.load(local_path)
        self.model.to("cuda")
        self.model.eval()

    def __call__(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        images = torch.tensor(batch["image"]).float().to("cuda")

        with torch.no_grad():
            logits = self.model(images).cpu().numpy()

        batch["predicted_label"] = np.argmax(logits, axis=1)
        return batch

In [17]:
# We download the model from s3 to an EFS storage
!aws s3 cp s3://anyscale-public-materials/ray-ai-libraries/mnist/model/model.pt /mnt/cluster_storage/model.pt

download: s3://anyscale-public-materials/ray-ai-libraries/mnist/model/model.pt to ../../../mnt/cluster_storage/model.pt


We can now use the `map_batches` API to apply the transformation to each batch of data.

In [18]:
ds_preds = ds_normalized.map_batches(
    MNISTClassifier,
    fn_constructor_kwargs={"local_path": "/mnt/cluster_storage/model.pt"},
    num_gpus=0.1,
    concurrency=1,
    batch_size=100,
)

<div class="alert alert-block alert-warning">

<b>Note:</b> We pass in the Callable class uninitialized. Ray will pass in the arguments to the class constructor when the class is actually used in a transformation.

</div>

In [19]:
batch_preds = ds_preds.take_batch(100)

2024-12-06 09:32:17,604	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:32:17,605	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(normalize)] -> ActorPoolMapOperator[MapBatches(MNISTClassifier)] -> LimitOperator[limit=100]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

- MapBatches(normalize) 3: 0 bundle [00:00, ? bundle/s]

- MapBatches(MNISTClassifier) 4: 0 bundle [00:00, ? bundle/s]

- limit=100 5: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

In [None]:
batch_preds

### Materializing Data

You can choose to materialize the entire dataset into the ray object store which is distributed across the cluster, primarily in memory and secondarily spilling to disk.

To materialize the dataset, we can use the `materialize()` method.

Use this **only** when you require the full dataset to compute downstream outputs.

In [20]:
ds_preds.materialize()

2024-12-06 09:35:32,187	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:35:32,187	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(normalize)] -> ActorPoolMapOperator[MapBatches(MNISTClassifier)]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

- MapBatches(normalize) 3: 0 bundle [00:00, ? bundle/s]

- MapBatches(MNISTClassifier) 4: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

MaterializedDataset(
   num_blocks=1,
   num_rows=500,
   schema={
      image: numpy.ndarray(shape=(1, 28, 28), dtype=float),
      path: string,
      predicted_label: int64
   }
)

## 4. Data Operations: Grouping, Aggregation, and Shuffling

Let's look at some more involved transformations.

#### Custom batching using **`groupby`** 

In case you want to generate batches according to a **specific key**, you can use **`groupby` to group the data** by the key and then use **`map_groups`** to apply the transformation.

For instance, let's compute the accuracy of the model by **"ground truth label"**.

In [23]:
def add_label(batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
    batch["ground_truth_label"] = [int(path.split("/")[-2]) for path in batch["path"]]
    return batch


def compute_accuracy(group: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
    return {
        "accuracy": [np.mean(group["predicted_label"] == group["ground_truth_label"])],
        "ground_truth_label": group["ground_truth_label"][:1],
    }


ds_preds.map_batches(add_label).groupby("ground_truth_label").map_groups(compute_accuracy).to_pandas()

2024-12-06 09:59:09,827	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 09:59:09,828	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(normalize)] -> ActorPoolMapOperator[MapBatches(MNISTClassifier)] -> TaskPoolMapOperator[MapBatches(add_label)] -> AllToAllOperator[Sort] -> TaskPoolMapOperator[MapBatches(compute_accuracy)]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

- MapBatches(normalize) 3: 0 bundle [00:00, ? bundle/s]

- MapBatches(MNISTClassifier) 4: 0 bundle [00:00, ? bundle/s]

- MapBatches(add_label) 5: 0 bundle [00:00, ? bundle/s]

- Sort 6: 0 bundle [00:00, ? bundle/s]

Sort Sample 7:   0%|          | 0/1 [00:00<?, ? bundle/s]

Shuffle Map 8:   0%|          | 0/1 [00:00<?, ? bundle/s]

Shuffle Reduce 9:   0%|          | 0/1 [00:00<?, ? bundle/s]

- MapBatches(compute_accuracy) 10: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

Sort Sample 0:   0%|          | 0/1 [00:00<?, ? block/s]

Unnamed: 0,accuracy,ground_truth_label
0,0.98,0
1,1.0,1
2,1.0,2
3,0.96,3
4,1.0,4
5,1.0,5
6,0.98,6
7,0.98,7
8,1.0,8
9,0.98,9


<div class="alert alert-block alert-warning">

<b>Note:</b> ds_preds is not re-computed given we have already materialized the dataset.

</div>

### Aggregations

Ray Data also supports a **variety of aggregations**. These aggregations can be **chained**.

For instance, we can compute the **mean** accuracy across the entire dataset.

In [24]:
ds_preds.map_batches(add_label).map_batches(compute_accuracy).mean(on="accuracy")

2024-12-06 10:01:43,102	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 10:01:43,103	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(normalize)] -> ActorPoolMapOperator[MapBatches(MNISTClassifier)] -> TaskPoolMapOperator[MapBatches(add_label)->MapBatches(compute_accuracy)] -> AllToAllOperator[Aggregate] -> LimitOperator[limit=1]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

- MapBatches(normalize) 3: 0 bundle [00:00, ? bundle/s]

- MapBatches(MNISTClassifier) 4: 0 bundle [00:00, ? bundle/s]

- MapBatches(add_label)->MapBatches(compute_accuracy) 5: 0 bundle [00:00, ? bundle/s]

- Aggregate 6: 0 bundle [00:00, ? bundle/s]

Shuffle Map 7:   0%|          | 0/1 [00:00<?, ? bundle/s]

Shuffle Reduce 8:   0%|          | 0/1 [00:00<?, ? bundle/s]

- limit=1 9: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

0.988

As of version 2.34.0, Ray Data provides the following aggregation functions:

- `count`
- `max`
- `mean`
- `min`
- `sum`
- `std`

See relevant [docs page here](https://docs.ray.io/en/latest/data/api/grouped_data.html#ray.data.aggregate.AggregateFn).

### Shuffling data 

There are **different options to shuffle data** in Ray Data of varying degrees of randomness and performance.

#### File based shuffle on read

To randomly **shuffle the ordering of input files before reading**, call a read function that supports shuffling, such as `read_images()`, and use the shuffle="files" parameter.

In [25]:
ray.data.read_images("s3://anyscale-public-materials/ray-ai-libraries/mnist/50_per_index/", shuffle="files")

2024-12-06 10:04:12,645	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 10:04:12,645	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadImage]


- ReadImage 1: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

2024-12-06 10:04:13,055	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 10:04:13,056	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadImage]


- ReadImage 1: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

Dataset(
   num_rows=500,
   schema={image: numpy.ndarray(shape=(28, 28), dtype=uint8)}
)

#### Shuffling block order

- This option randomizes the **order of blocks in a dataset**. 
- Blocks are the **basic unit** of data chunk that Ray Data stores in the object store. 
- Applying this operation alone **doesn’t involve heavy computation and communication**. 
- However, it requires Ray Data to **materialize all blocks in memory** before applying the operation. 
- Only use this option when your dataset is small enough to fit into the object store memory.

To perform block order shuffling, use `randomize_block_order`.

In [26]:
ds_randomized_blocks = ds_preds.randomize_block_order()
ds_randomized_blocks.materialize()

2024-12-06 10:05:33,949	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 10:05:33,950	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(normalize)] -> ActorPoolMapOperator[MapBatches(MNISTClassifier)] -> AllToAllOperator[RandomizeBlockOrder]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

- MapBatches(normalize) 3: 0 bundle [00:00, ? bundle/s]

- MapBatches(MNISTClassifier) 4: 0 bundle [00:00, ? bundle/s]

- RandomizeBlockOrder 5: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

MaterializedDataset(
   num_blocks=1,
   num_rows=500,
   schema={
      image: numpy.ndarray(shape=(1, 28, 28), dtype=float),
      path: string,
      predicted_label: int64
   }
)

#### Shuffle rows globally

- To randomly **shuffle all rows globally**, call `random_shuffle()`. 
- This is the **slowest option** for shuffle, and requires **transferring data across network between workers**. 
- This option achieves the **best randomness** among all options.


In [27]:
ds_randomized_rows = ds_preds.random_shuffle()
ds_randomized_rows.materialize()

2024-12-06 10:06:32,993	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 10:06:32,993	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(normalize)] -> ActorPoolMapOperator[MapBatches(MNISTClassifier)] -> AllToAllOperator[RandomShuffle]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

- MapBatches(normalize) 3: 0 bundle [00:00, ? bundle/s]

- MapBatches(MNISTClassifier) 4: 0 bundle [00:00, ? bundle/s]

- RandomShuffle 5: 0 bundle [00:00, ? bundle/s]

Shuffle Map 6:   0%|          | 0/1 [00:00<?, ? bundle/s]

Shuffle Reduce 7:   0%|          | 0/1 [00:00<?, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

MaterializedDataset(
   num_blocks=1,
   num_rows=500,
   schema={
      image: numpy.ndarray(shape=(1, 28, 28), dtype=float),
      path: string,
      predicted_label: int64
   }
)

## 5. Persisting Data

Finally, you can persist a dataset to storage using any of the "write" functions that Ray Data supports.

Lets write our predictions to a parquet dataset.

In [28]:
ds_preds.write_parquet("/mnt/cluster_storage/mnist_preds")

2024-12-06 10:09:17,311	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-12-06_07-31-15_546129_2446/logs/ray-data
2024-12-06 10:09:17,312	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ExpandPaths] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(normalize)] -> ActorPoolMapOperator[MapBatches(MNISTClassifier)] -> TaskPoolMapOperator[Write]


- ExpandPaths 1: 0 bundle [00:00, ? bundle/s]

- ReadFiles 2: 0 bundle [00:00, ? bundle/s]

- MapBatches(normalize) 3: 0 bundle [00:00, ? bundle/s]

- MapBatches(MNISTClassifier) 4: 0 bundle [00:00, ? bundle/s]

- Write 5: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

Refer to the [Input/Output docs](https://docs.ray.io/en/latest/data/api/input_output.html) for a comprehensive list of write functions.

In [None]:
# cleanup
!rm -rf /mnt/cluster_storage/mnist_preds

### Outlook:  Ray Data in Production

1. Runway AI is using Ray Data to scale its ML workloads. See [this interview with Runway AI](https://siliconangle.com/2024/10/02/runway-transforming-ai-driven-filmmaking-innovative-tools-techniques-raysummit/) to learn more.
2. Netflix is using Ray Data for multi-modal inference pipelines. See [this talk at the Ray Summit 2024](https://raysummit.anyscale.com/flow/anyscale/raysummit2024/landing/page/sessioncatalog/session/1722028596844001bCg0) to learn more.