# Data Processing with Ray Data
© 2025, Anyscale. All Rights Reserved

💻 **Launch Locally**: You can run this notebook locally.

🚀 **Launch on Cloud**: Think about running this notebook on a Ray Cluster (Click [here](http://console.anyscale.com/register) to easily start a Ray cluster on Anyscale)


Data preprocessing is a crucial step in any machine learning workflow, and Ray Data provides a scalable and flexible way to handle this process on large datasets. With Ray Data, you can efficiently load, transform, filter, and join datasets using a distributed framework that leverages the power of multiple CPUs or nodes. This makes it possible to preprocess data that would be too large to fit into memory on a single machine, enabling seamless scaling from your laptop to a cluster.

Ray Data supports a variety of preprocessing operations, such as filtering rows based on conditions, joining multiple datasets, and applying custom transformations like tokenization or label encoding. For example, you can load a large public dataset like IMDB reviews, filter for only positive reviews, join with metadata, and tokenize the text—all in a distributed and parallelized manner. This approach not only speeds up data preparation but also integrates smoothly with downstream machine learning tasks, making Ray Data a powerful tool for modern data pipelines.

### Outline of the notebook
This notebook shows a few examples on how to preprocess data with Ray Data.

<div class="alert alert-block alert-info">
<ul>
    <li>Library Imports
    <li>Initialize Ray and Load a Large Public Dataset
    <li>Convert to Ray Dataset
    <li>Create a Second Dataset
    <li>Filtering Data
    <li>Joining Datasets
    <li>Tokenization and Preprocessing
    <li>Converting to Pandas DataFrame
    <li>Shutting Down Ray
    <li>Conclusion
</ul>
</div>

## Library Imports

In [2]:
import ray
from datasets import load_dataset
import pandas as pd
# showcase ray data transformations
from ray.data.preprocessors import Tokenizer

### Initialize Ray and Load a Dataset
Load a public dataset from Hugging Face and explore the dataset with a few sample rows.

In [3]:

# Start Ray
ray.init()


2025-07-11 08:03:29,512	INFO worker.py:1908 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m


0,1
Python version:,3.11.6
Ray version:,2.47.1
Dashboard:,http://127.0.0.1:8265


In [4]:
# Load the IMDB dataset from Hugging Face Datasets (train split)
imdb = load_dataset("imdb", split="train")


In [5]:
print(imdb) # show the dataset structure

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})


In [6]:
imdb[0:3]  # Show the first 3 rows of the dataset

{'text': ['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

In [7]:
# 0 = negative, 1 = positive
print('Number of positive labels = ', sum(imdb['label']))

Number of positive labels =  12500


## Convert to Ray Dataset
Add unique IDs to each row and create a Ray Dataset. Ray Dataset is distributed across the nodes and enables parallel computations.

In [8]:
# Add id and convert to Ray Dataset
ds_reviews = ray.data.from_items([
    {"id": i, "text": row["text"], "label": row["label"]}
        for i, row in enumerate(imdb)
]) # add id to each row


In [9]:
print(ds_reviews)

MaterializedDataset(
   num_blocks=200,
   num_rows=25000,
   schema={id: int64, text: string, label: int64}
)


### Create a Second Dataset
Here we create a second dataset (fake metadata) with some rows with common IDs as in ds_reviews data. In practice, you would have a real metadata or another dataset. This used to show join operation between two Ray datasets.

In [10]:

# Create a second dataset: pretend we have metadata (source column) for every 100th row.
# This is just for demonstration purposes; in practice, you would have a real metadata source.
ds_meta = ray.data.from_items([
    {"id": i, "source": "imdb"} for i in range(0, len(imdb), 100)
])


In [11]:
print(ds_meta)

MaterializedDataset(
   num_blocks=200,
   num_rows=250,
   schema={id: int64, source: string}
)


## Filter Ray Dataset
This show how to filter some rows in a Ray dataset. In this example, a filter expression is applied to keep only positive reviews.

* Filter can be an expression or a function. Expression API is faster than functional API.
* Concurrency parameter is used to launch multiple workers. If it not set (default), number of workers is determined by the available resources and the number of input blocks.

In [18]:
# Filter: keep only positive reviews
# using expression API, faster than lambda
# concurrency=2 is used to parallelize the filtering operation
# Note: If you have a large dataset, consider using a larger concurrency value for better performance
# filtered = ds_reviews.filter(expr="label == 1", concurrency=2)

# functional API, slower than expression API
filtered = ds_reviews.filter(lambda row: row["label"] == 1)



In [19]:
print(filtered)

Filter(<lambda>)
+- Dataset(num_rows=25000, schema={id: int64, text: string, label: int64})


## Join Two Ray Datasets
Here we show a join operation by joining the filtered reviews with the metadata dataset on 'id'. This is a simple inner join operation to demonstrate how to combine datasets.

Note: num_partitions=2 is used here for demonstration; adjust based on your dataset
size and performance needs. The join operation can be expensive, so consider the size of your datasets.

In [20]:
# Join: inner join on 'id'
joined = filtered.join(ds_meta, on=("id",), join_type="inner", num_partitions=2)


In [21]:

# Show a few joined results
for row in joined.take(5):
    print(row)

2025-07-11 08:42:12,344	INFO logging.py:295 -- Registered dataset logger for dataset dataset_7_0
2025-07-11 08:42:12,351	INFO streaming_executor.py:117 -- Starting execution of Dataset dataset_7_0. Full logs are in /tmp/ray/session_2025-07-11_08-03-28_326524_18345/logs/ray-data
2025-07-11 08:42:12,351	INFO streaming_executor.py:118 -- Execution plan of Dataset dataset_7_0: InputDataBuffer[Input] -> TaskPoolMapOperator[Filter(<lambda>)], InputDataBuffer[Input] -> JoinOperator[Join(num_partitions=2)] -> LimitOperator[limit=5]


Running 0: 0.00 row [00:00, ? row/s]

- Filter(<lambda>) 1: 0.00 row [00:00, ? row/s]

- Join(num_partitions=2) 2: 0.00 row [00:00, ? row/s]

- limit=5 3: 0.00 row [00:00, ? row/s]

2025-07-11 08:42:13,579	INFO streaming_executor.py:227 -- ✔️  Dataset dataset_7_0 execution finished in 1.23 seconds


{'id': 12500, 'text': 'Zentropa has much in common with The Third Man, another noir-like film set among the rubble of postwar Europe. Like TTM, there is much inventive camera work. There is an innocent American who gets emotionally involved with a woman he doesn\'t really understand, and whose naivety is all the more striking in contrast with the natives.<br /><br />But I\'d have to say that The Third Man has a more well-crafted storyline. Zentropa is a bit disjointed in this respect. Perhaps this is intentional: it is presented as a dream/nightmare, and making it too coherent would spoil the effect. <br /><br />This movie is unrelentingly grim--"noir" in more than one sense; one never sees the sun shine. Grim, but intriguing, and frightening.', 'label': 1, 'source': 'imdb'}
{'id': 13000, 'text': "Visually stunning and full of Eastern Philosophy, this amazing martial arts fantasy is brought to you by master director Tsui Hark, the man behind some of the best films Hong Kong cinema has 

## Preprocessing with a Tokenizer
This showcases a preprocessing method that tokenizes the text column using custom tokenization function. Such preprocessing methods are useful in preparing the data for training a model.

In [22]:
import string
# Define a tokenizer for the text column
def tokenization_fn(s):
    for character in string.punctuation:
        s = s.replace(character, "")
    return s.split()
# tokenizer = Tokenizer(columns=["text"], tokenization_fn=tokenization_fn)
tokenizer = Tokenizer(columns=["text"], output_columns=["text_tokenized"],
                      tokenization_fn=tokenization_fn)

In [23]:

# print tokenizer sample
for row in tokenizer.transform(filtered).take(5):
    print(row)


2025-07-11 08:47:48,646	INFO logging.py:295 -- Registered dataset logger for dataset dataset_9_0
2025-07-11 08:47:48,650	INFO streaming_executor.py:117 -- Starting execution of Dataset dataset_9_0. Full logs are in /tmp/ray/session_2025-07-11_08-03-28_326524_18345/logs/ray-data
2025-07-11 08:47:48,651	INFO streaming_executor.py:118 -- Execution plan of Dataset dataset_9_0: InputDataBuffer[Input] -> TaskPoolMapOperator[Filter(<lambda>)->Tokenizer] -> LimitOperator[limit=5]


Running 0: 0.00 row [00:00, ? row/s]

- Filter(<lambda>)->Tokenizer 1: 0.00 row [00:00, ? row/s]

- limit=5 2: 0.00 row [00:00, ? row/s]

2025-07-11 08:47:48,837	INFO streaming_executor.py:227 -- ✔️  Dataset dataset_9_0 execution finished in 0.19 seconds


{'id': 12500, 'text': 'Zentropa has much in common with The Third Man, another noir-like film set among the rubble of postwar Europe. Like TTM, there is much inventive camera work. There is an innocent American who gets emotionally involved with a woman he doesn\'t really understand, and whose naivety is all the more striking in contrast with the natives.<br /><br />But I\'d have to say that The Third Man has a more well-crafted storyline. Zentropa is a bit disjointed in this respect. Perhaps this is intentional: it is presented as a dream/nightmare, and making it too coherent would spoil the effect. <br /><br />This movie is unrelentingly grim--"noir" in more than one sense; one never sees the sun shine. Grim, but intriguing, and frightening.', 'label': 1, 'text_tokenized': ['Zentropa', 'has', 'much', 'in', 'common', 'with', 'The', 'Third', 'Man', 'another', 'noirlike', 'film', 'set', 'among', 'the', 'rubble', 'of', 'postwar', 'Europe', 'Like', 'TTM', 'there', 'is', 'much', 'inventive

### Convert to Pandas DataFrame
Converting the processed Ray Dataset to a Pandas DataFrame for further analysis.

In [24]:
# Get a tokenized Pandas DataFrame from the 'joined' Ray Dataset
# Note: This will convert the Ray Dataset to a Pandas DataFrame, which may not
# be suitable for very large datasets due to memory constraints.
preprocessed_df = tokenizer.transform(joined).to_pandas()


2025-07-11 08:48:51,767	INFO logging.py:295 -- Registered dataset logger for dataset dataset_10_0
2025-07-11 08:48:51,771	INFO streaming_executor.py:117 -- Starting execution of Dataset dataset_10_0. Full logs are in /tmp/ray/session_2025-07-11_08-03-28_326524_18345/logs/ray-data
2025-07-11 08:48:51,771	INFO streaming_executor.py:118 -- Execution plan of Dataset dataset_10_0: InputDataBuffer[Input] -> TaskPoolMapOperator[Filter(<lambda>)], InputDataBuffer[Input] -> JoinOperator[Join(num_partitions=2)] -> TaskPoolMapOperator[Tokenizer]


Running 0: 0.00 row [00:00, ? row/s]

- Filter(<lambda>) 1: 0.00 row [00:00, ? row/s]

- Join(num_partitions=2) 2: 0.00 row [00:00, ? row/s]

- Tokenizer 3: 0.00 row [00:00, ? row/s]

2025-07-11 08:48:52,845	INFO streaming_executor.py:227 -- ✔️  Dataset dataset_10_0 execution finished in 1.07 seconds


In [25]:
# Display the first few rows of the preprocessed DataFrame
preprocessed_df.head()

Unnamed: 0,id,text,label,source,text_tokenized
0,12500,Zentropa has much in common with The Third Man...,1,imdb,"[Zentropa, has, much, in, common, with, The, T..."
1,13000,Visually stunning and full of Eastern Philosop...,1,imdb,"[Visually, stunning, and, full, of, Eastern, P..."
2,13100,The fourth of five westerns Anthony Mann did w...,1,imdb,"[The, fourth, of, five, westerns, Anthony, Man..."
3,13600,I miss Dark Angel!..<br /><br />I understand n...,1,imdb,"[I, miss, Dark, Angelbr, br, I, understand, no..."
4,13700,I like the good and evil battle. I liked Eddie...,1,imdb,"[I, like, the, good, and, evil, battle, I, lik..."


### Shutdown Ray

In [26]:
# Shutdown Ray
ray.shutdown()

### Summary

This notebook demonstrates how to preprocess large datasets using Ray Data in a distributed and scalable way. It covers loading a public dataset (IMDB reviews), converting it to a Ray Dataset, filtering and joining data, applying tokenization, and converting the results to a Pandas DataFrame for further analysis. The workflow highlights how Ray Data enables efficient data processing for machine learning pipelines, even with datasets that are too large for a single machine.