## Preprocessing with a Tokenizer
This showcases a preprocessing method that tokenizes the text column using custom tokenization function. Such preprocessing methods are useful in preparing the data for training a model.

In [22]:
import string
# Define a tokenizer for the text column
def tokenization_fn(s):
    for character in string.punctuation:
        s = s.replace(character, "")
    return s.split()
# tokenizer = Tokenizer(columns=["text"], tokenization_fn=tokenization_fn)
tokenizer = Tokenizer(columns=["text"], output_columns=["text_tokenized"],
                      tokenization_fn=tokenization_fn)

In [23]:

# print tokenizer sample
for row in tokenizer.transform(filtered).take(5):
    print(row)


2025-07-11 08:47:48,646	INFO logging.py:295 -- Registered dataset logger for dataset dataset_9_0
2025-07-11 08:47:48,650	INFO streaming_executor.py:117 -- Starting execution of Dataset dataset_9_0. Full logs are in /tmp/ray/session_2025-07-11_08-03-28_326524_18345/logs/ray-data
2025-07-11 08:47:48,651	INFO streaming_executor.py:118 -- Execution plan of Dataset dataset_9_0: InputDataBuffer[Input] -> TaskPoolMapOperator[Filter(<lambda>)->Tokenizer] -> LimitOperator[limit=5]


Running 0: 0.00 row [00:00, ? row/s]

- Filter(<lambda>)->Tokenizer 1: 0.00 row [00:00, ? row/s]

- limit=5 2: 0.00 row [00:00, ? row/s]

2025-07-11 08:47:48,837	INFO streaming_executor.py:227 -- ✔️  Dataset dataset_9_0 execution finished in 0.19 seconds


{'id': 12500, 'text': 'Zentropa has much in common with The Third Man, another noir-like film set among the rubble of postwar Europe. Like TTM, there is much inventive camera work. There is an innocent American who gets emotionally involved with a woman he doesn\'t really understand, and whose naivety is all the more striking in contrast with the natives.<br /><br />But I\'d have to say that The Third Man has a more well-crafted storyline. Zentropa is a bit disjointed in this respect. Perhaps this is intentional: it is presented as a dream/nightmare, and making it too coherent would spoil the effect. <br /><br />This movie is unrelentingly grim--"noir" in more than one sense; one never sees the sun shine. Grim, but intriguing, and frightening.', 'label': 1, 'text_tokenized': ['Zentropa', 'has', 'much', 'in', 'common', 'with', 'The', 'Third', 'Man', 'another', 'noirlike', 'film', 'set', 'among', 'the', 'rubble', 'of', 'postwar', 'Europe', 'Like', 'TTM', 'there', 'is', 'much', 'inventive

### Convert to Pandas DataFrame
Converting the processed Ray Dataset to a Pandas DataFrame for further analysis.

In [24]:
# Get a tokenized Pandas DataFrame from the 'joined' Ray Dataset
# Note: This will convert the Ray Dataset to a Pandas DataFrame, which may not
# be suitable for very large datasets due to memory constraints.
preprocessed_df = tokenizer.transform(joined).to_pandas()


2025-07-11 08:48:51,767	INFO logging.py:295 -- Registered dataset logger for dataset dataset_10_0
2025-07-11 08:48:51,771	INFO streaming_executor.py:117 -- Starting execution of Dataset dataset_10_0. Full logs are in /tmp/ray/session_2025-07-11_08-03-28_326524_18345/logs/ray-data
2025-07-11 08:48:51,771	INFO streaming_executor.py:118 -- Execution plan of Dataset dataset_10_0: InputDataBuffer[Input] -> TaskPoolMapOperator[Filter(<lambda>)], InputDataBuffer[Input] -> JoinOperator[Join(num_partitions=2)] -> TaskPoolMapOperator[Tokenizer]


Running 0: 0.00 row [00:00, ? row/s]

- Filter(<lambda>) 1: 0.00 row [00:00, ? row/s]

- Join(num_partitions=2) 2: 0.00 row [00:00, ? row/s]

- Tokenizer 3: 0.00 row [00:00, ? row/s]

2025-07-11 08:48:52,845	INFO streaming_executor.py:227 -- ✔️  Dataset dataset_10_0 execution finished in 1.07 seconds


In [25]:
# Display the first few rows of the preprocessed DataFrame
preprocessed_df.head()

Unnamed: 0,id,text,label,source,text_tokenized
0,12500,Zentropa has much in common with The Third Man...,1,imdb,"[Zentropa, has, much, in, common, with, The, T..."
1,13000,Visually stunning and full of Eastern Philosop...,1,imdb,"[Visually, stunning, and, full, of, Eastern, P..."
2,13100,The fourth of five westerns Anthony Mann did w...,1,imdb,"[The, fourth, of, five, westerns, Anthony, Man..."
3,13600,I miss Dark Angel!..<br /><br />I understand n...,1,imdb,"[I, miss, Dark, Angelbr, br, I, understand, no..."
4,13700,I like the good and evil battle. I liked Eddie...,1,imdb,"[I, like, the, good, and, evil, battle, I, lik..."


### Shutdown Ray

In [26]:
# Shutdown Ray
ray.shutdown()

### Summary

This notebook demonstrates how to preprocess large datasets using Ray Data in a distributed and scalable way. It covers loading a public dataset (IMDB reviews), converting it to a Ray Dataset, filtering and joining data, applying tokenization, and converting the results to a Pandas DataFrame for further analysis. The workflow highlights how Ray Data enables efficient data processing for machine learning pipelines, even with datasets that are too large for a single machine.