# Introduction to Ray Data: Ray Data + Structured Data
© 2025, Anyscale. All Rights Reserved

💻 **Launch Locally**: You can run this notebook locally, but performance will be reduced.

🚀 **Launch on Cloud**: A Ray Cluster (Click [here](http://console.anyscale.com/register) to easily start a Ray cluster on Anyscale) is recommended to run this notebook.

This notebook will provide an overview of Ray Data and how to use it to load, and transform data in a distributed manner.

<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook:</b>
<ul>
    <li><b>Part 0:</b> What is Ray Data?</a></li>
    <li><b>Part 1:</b> How to use Ray Data?</a></li>
    <li><b>Part 2:</b> Loading Data</a></li>
    <li><b>Part 3:</b> Transforming Data</a></li>
    <li><b>Part 4:</b> Writing Data</a></li>
    <li><b>Part 5:</b> Data Operations: Shuffling, Grouping and Aggregation</a></li>
    <li><b>Part 6:</b> When to use Ray Data</a></li>
    <li><b>Part 7:</b> Ray Data in Production</a></li>
    <li><b>Part 8:</b> Upcoming Features in Ray Data</a></li>
</ul>
</div>

**Imports**

In [None]:
import ray
import pandas as pd

## 0. What is Ray Data?

Ray Data is a distributed data processing library that provides a Python API for parallel data processing. 

It is built on top of Ray, a fast and simple framework for building and running distributed applications. Ray Data is designed to be easy to use, scalable, and fault-tolerant.

### 1. How to Use Ray Data?

You typically should use the Ray Data API in this way:

1. **Create a Ray Dataset** from external storage or in-memory data.
2. **Apply transformations** to the data.
3. **Write the outputs** to external storage or **feed the outputs** to training workers.


## 2. Loading Data

Our Dataset is the New York City Taxi & Limousine Commission's Trip Record Data

**Dataset features**

| Column | Description | 
| ------ | ----------- |
| `trip_distance` | Float representing trip distance in miles. |
| `passenger_count` | The number of passengers |
| `PULocationID` | TLC Taxi Zone in which the taximeter was engaged | 
| `DOLocationID` | TLC Taxi Zone in which the taximeter was disengaged | 
| `payment_type` | A numeric code signifying how the passenger paid for the trip. |
| `tolls_amount` | Total amount of all tolls paid in trip. | 
| `tip_amount` | Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. | 
| `total_amount` | The total amount charged to passengers. Does not include cash tips. |


In [None]:
COLUMNS = [
    "trip_distance",
    "passenger_count",
    "PULocationID",
    "DOLocationID",
    "payment_type",
    "tolls_amount",
    "tip_amount",
    "total_amount",
]

DATA_PATH = "s3://anyscale-public-materials/nyc-taxi-cab"

Let's read the data for a single month. It takes up to 2 minutes to run.

In [None]:
df = pd.read_parquet(
    f"{DATA_PATH}/yellow_tripdata_2011-05.parquet",
    columns=COLUMNS,
)

df.head()

Let's check how much memory the dataset is using.

In [None]:
df.memory_usage(deep=True).sum().sum() / 1024**2

Let's check how many files there are in the dataset

In [None]:
!aws s3 ls s3://anyscale-public-materials/nyc-taxi-cab/ --human-readable | wc -l

We are not making use of all the columns and are already consuming ~1GB of data per file -> will quickly become a problem if you want to scale to entire dataset (~155 files) if we are running on a small node.

Let's instead make use of a distributed data preprocessing library like Ray Data to load the full dataset in a distributed manner.

In [None]:
ds = ray.data.read_parquet(
    DATA_PATH,
    columns=COLUMNS,
)

There are Ray data equivalents for common pandas functions like `read_csv`, `read_parquet`, `read_json`, etc.

Refer to the [Input/Output docs](https://docs.ray.io/en/latest/data/api/input_output.html) for a comprehensive list of read functions.

### Dataset

Let's view our dataset

In [None]:
ds

Ray Data by default adopts **lazy execution** this means that the data is not loaded into memory until it is needed. Instead only a small part of the dataset is loaded into memory to infer the schema.

A Dataset specifies a sequence of transformations that will be applied to the data. 

The data itself will be organized into blocks, where each block is a collection of rows.

The following figure visualizes a tabular dataset with three blocks, each block holding 1000 rows each:

<img src='https://docs.ray.io/en/releases-2.6.1/_images/dataset-arch.svg' width=50%/>

Since a Dataset is just a list of Ray object references, it can be freely passed between Ray tasks, actors, and libraries like any other object reference. This flexibility is a unique characteristic of Ray Datasets.

## 3. Transforming Data

Let's create a simple function to generate features from the data. Here is how we would do so using pandas

In [None]:
def adjust_total_amount(df: pd.DataFrame) -> pd.DataFrame:
    df["adjusted_total_amount"] = df["total_amount"] - df["tip_amount"]
    return df

df = adjust_total_amount(df)

We can take the same function and apply it to the Ray dataset using `map_batches`. 

`map_batches` will batch each block of the dataset and apply the function to each batch in parallel.

In [None]:
ds_adjusted = ds.map_batches(adjust_total_amount, batch_format="pandas")

<div class="alert alert-block alert-warning">
<b>Note</b> 

The default `batch_format` in Ray Data is `numpy`, which means that the data is returned as a numpy array. For optimal performance, it is recommended to **avoid converting the data to pandas dataframes unless necessary**.

</div>

Let's add another transformation, for the sake of this example, we will add a simple transformation to calculate the tip percentage.

In [None]:
def compute_tip_percentage(df: pd.DataFrame) -> pd.DataFrame:
    df["tip_percentage"] = df["tip_amount"] / df["total_amount"]
    return df

df = compute_tip_percentage(df)

We would apply it again using `map_batches`. Note that we can control certain additional parameters such as the batch size to use.

In [None]:
ds_tip = ds_adjusted.map_batches(compute_tip_percentage, batch_format="pandas", batch_size=1024)

### Execution mode

Most transformations are **lazy** in Ray Data - i.e. they don't execute until you either:
- **write a dataset to storage**
- explicitly **materialize** the data
- **iterate over the dataset** (usually when feeding data to model training).

To explicitly *materialize* a very small subset of the data, you can use the `take_batch` method.

In [None]:
ds.take_batch()

Let's view a batch of the transformed data.

In [None]:
ds_tip.take_batch()

## 4. Writing Data

Let's write the adjusted data. Here is how we would do it with pandas:

In [None]:
storage_folder = '/mnt/cluster_storage' # Modify this path to your local folder if it runs on your local environment

In [None]:
df.to_parquet(f"{storage_folder}/adjusted_data.parquet")

Let's check the file we just wrote:

In [None]:
!ls -lh {storage_folder}/adjusted_data.parquet

Here is how we would do so with Ray Data:

In [None]:
!rm -rf /mnt/cluster_storage/adjusted_data_ray/ # let's remove the directory if it exists
ds_limited = ds_adjusted.limit(df.shape[0]) # we limit to avoid writing too much data
ds_limited.write_parquet(f"{storage_folder}/adjusted_data_ray/")

There are Ray data equivalents for common pandas functions like `write_parquet` for `to_parquet`, `write_csv` for `to_csv`, etc.

See the [Input/Output docs](https://docs.ray.io/en/latest/data/api/input_output.html) for a comprehensive list of write functions.

Let's check the files in the directory:

In [None]:
!ls -lh {storage_folder}/adjusted_data_ray/

Notice that we have **multiple files** in the directory. This is because Ray Data writes data in a **distributed manner**. 

**Each task writes its own file**, and the number of files is proportional to the number of CPUs in the cluster.

**Ray Data uses Ray tasks** to process data.

When reading from a file-based datasource (e.g., S3, GCS). Each read task reads its assigned files and produces an output block which in turn is consumed by the next task in the pipeline.
    
<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/rag-app/dataset-read-cropped-v2.svg" width="500px">

<div class="alert alert-block alert-warning">
<b>Note</b> 

We passed `/mnt/cluster_storage/` as the path to write the data. This is a path on the Ray cluster's shared storage. If instead you use a path that is only local to one of the nodes in a multi-node cluster, you will see errors like `FileNotFoundError: [Errno 2] No such file or directory: '/path/to/file'`.

This is because Ray Data is designed to work with distributed storage systems like S3, HDFS, etc. If you want to write to local storage, you can add a special prefix `local://` to the path. For example, `local:///path/to/file`. However to do so you will need to ensure that Ray is enabled to schedule and run tasks on the head node of the cluster.
</div>

## 5. Data Operations: Shuffling, Grouping and Aggregation

Let's look at some more involved transformations.

### Shuffling data 

There are different options to shuffle data in Ray Data of varying degrees of randomness and performance.

#### File based shuffle on read

To randomly shuffle the ordering of input files before reading, use the shuffle="files" parameter.

In [None]:
ds_file_shuffled = ray.data.read_parquet(DATA_PATH, columns=COLUMNS, shuffle="files")

In [None]:
ds_file_shuffled

#### Shuffling block order
This option randomizes the order of blocks in a dataset.

Applying this operation alone doesn’t involve heavy computation and communication. However, it requires Ray Data to materialize all blocks before applying the operation.

Let's read the data and shuffle the block order.

In [None]:
ds = (
    ray.data.read_parquet(
        "s3://anyscale-public-materials/nyc-taxi-cab/yellow_tripdata_2011-05.parquet",
        columns=COLUMNS,
    )
)

To perform block order shuffling, use `randomize_block_order`.

In [None]:
ds_block_based_shuffle = ds.randomize_block_order()
ds_block_based_shuffle.to_pandas()

#### Shuffle all rows globally
To randomly shuffle all rows globally, call `random_shuffle()`. This is the slowest option for shuffle, and requires transferring data across network between workers. This option achieves the best randomness among all options.


In [None]:
ds_row_based_shuffle = ds.random_shuffle()

In [None]:
ds_row_based_shuffle.to_pandas()

#### Custom batching using `groupby` and aggregations

In case you want to generate batches according to a specific key, you can use `groupby` to group the data by the key and then use `map_groups` to apply the transformation.

For instance, let's compute the average trip distance per passenger count. Here is how we would do it with pandas:

In [None]:
df.groupby("payment_type")["trip_distance"].mean()

Here is how we would do the same operation with Ray Data:

In [None]:
num_cpus = 8
ds.repartition(num_cpus).groupby("payment_type").mean("trip_distance").to_pandas()

Here are the main aggregation functions available in Ray Data:
- count
- max
- mean
- min
- sum
- std

See [relevant docs page here](https://docs.ray.io/en/latest/data/api/grouped_data.html#computations-or-descriptive-stats)

<div class="alert alert-block alert-warning">

<b>Note:</b> This is an area of active development in Ray Data. The current implementation of groupby is not as optimized as it could be. We are working on improving the performance of `groupby` and `map_groups` operations.

For more details, the current implementation makes use of a sort operation which instead can be done using a hash-based implementation. Additionally, we had to repartition the data to maximize parallelism - in the future Ray Data should be able to dynamically repartition the data to maximize parallelism.

</div>

## 6. When to use Ray Data

Ray Data is especially performant when needing to:
- run data processing in a **streaming fashion** 
- run across a **large dataset**
- run inside a **heterogeneous cluster of CPUs and GPUs**.

Here is one use case for Batch Inference with Ray Data over a large dataset:

<img src='https://docs.ray.io/en/releases-2.6.1/_images/stream-example.png' width=60%/>


Ray Data also integrates seamlessly with Ray Train, making it an optimal choice for **data preprocessing in machine learning training pipelines**. Especially when you need to:
- **Independently scale out data loading and transformation** from model training.
- **Enable fault tolerance** for model training.


### 7. Ray Data in Production

1. Runway AI is using Ray Data to scale its ML workloads. See [this interview with Runway AI](https://siliconangle.com/2024/10/02/runway-transforming-ai-driven-filmmaking-innovative-tools-techniques-raysummit/) to learn more.
2. Netflix is using Ray Data for multi-modal batch inference pipelines. See [this talk at the Ray Summit 2024](https://raysummit.anyscale.com/flow/anyscale/raysummit2024/landing/page/sessioncatalog/session/1722028596844001bCg0) to learn more.
3. Spotify uses Ray Data for large-scale data processing. See [this talk at the Ray Summit 2023](https://www.anyscale.com/blog/how-spotify-built-a-robust-ray-platform-with-a-frictionless-developer) to learn more.

### 8. Upcoming Features in Ray Data

Here are some relevant upcoming features in Ray Data:

For structured data:
- improved `groupby` and `map_groups` performance
- using parquet metadata for computing statistics like `count`
- enabling predicate pushdown for parquet files when calling `filter`
- supporting `join` and `merge` operations
- optimizing performance of the `Preprocessor` API for distributed feature engineering
- running spark on Ray more seamlessly


For all data types:
- data checkpointing for fault tolerance
- optimizing data connectors
- concurrent execution of multiple datasets

In [None]:
# Run this cell for file cleanup 
!rm {storage_folder}/adjusted_data.parquet
!rm -rf {storage_folder}/adjusted_data_ray/