# Introduction to Ray Data 

In this problem, we introduce Ray Data, and how to work with Ray Datasets. We highly encourage you to first go over [documentation](https://docs.ray.io/en/latest/data/data.html) for Ray Data. 

The dataset we will use for this problems is the Electronics subset of the Amazon Reviews dataset. This dataset has been provided to you in parquet format at path ``~/public/pa2``In the first section, you will use the ``read_parquet`` method to read your parquet dataset into a  Ray.data.Dataset object. Ray Data uses Ray Tasks to read files in parallel. [This](https://docs.ray.io/en/latest/data/data-internals.html) is a useful resource to understand how data loading works.

In [1]:
# Run this command once to upgrade ray, then comment this out to avoid
# reinstall package in the grader account during autograding.

# !pip install "ray[default]" --upgrade

In [None]:
import ray
import re
import json
import math
import os
ray.shutdown()
ray.init()

ds = ray.data.read_parquet("pa2_data_100k.parquet")

# sort dataset to make sure determinstic results
ds = ds.sort(["asin", "unixReviewTime"])

2025-11-02 17:20:18,090	INFO worker.py:2004 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


Parquet dataset sampling 0:   0%|          | 0.00/1.00 [00:00<?, ? file/s]

2025-11-02 17:20:19,262	INFO parquet_datasource.py:699 -- Estimated parquet encoding ratio is 1.683.
2025-11-02 17:20:19,263	INFO parquet_datasource.py:759 -- Estimated parquet reader batch size at 285570 rows


[36m(Map(lowercase)->MapBatches(scale)->MapBatches(add_column)->MapBatches(preprocessor) pid=4221)[0m Token indices sequence length is longer than the specified maximum sequence length for this model (1278 > 1024). Running this sequence through the model will result in indexing errors


In [3]:
#Print out the schema of the dataset
ds.schema()

Column          Type
------          ----
reviewTime      string
reviewerName    string
summary         string
unixReviewTime  int64
asin            string
reviewText      string
reviewerID      string
verified        bool
overall         double

# Check `num_blocks`
Go through the documentation listed at the top to understand what `num_blocks` is!

In [4]:
ds = ds.materialize()
ds.num_blocks()

2025-11-02 17:20:19,574	INFO logging.py:293 -- Registered dataset logger for dataset dataset_2_0
2025-11-02 17:20:19,584	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_2_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:20:19,584	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_2_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> AllToAllOperator[Sort]


Running 0: 0.00 row [00:00, ? row/s]

- ReadParquet->SplitBlocks(44) 1: 0.00 row [00:00, ? row/s]

- Sort 2: 0.00 row [00:00, ? row/s]

Sort Sample 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Map 4:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Shuffle Reduce 5:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

2025-11-02 17:20:20,942	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_2_0 execution finished in 1.36 seconds


44

In [5]:
# View the first 5 entries using the Dataset.take() function.
# Store the result in task1_1_first_5_entries.

task1_1_first_5_entries = ds.take(5)

2025-11-02 17:20:21,044	INFO dataset.py:3500 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2025-11-02 17:20:21,051	INFO logging.py:293 -- Registered dataset logger for dataset dataset_4_0
2025-11-02 17:20:21,054	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_4_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:20:21,055	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_4_0: InputDataBuffer[Input] -> LimitOperator[limit=5]


Running 0: 0.00 row [00:00, ? row/s]

- limit=5 1: 0.00 row [00:00, ? row/s]

2025-11-02 17:20:21,128	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_4_0 execution finished in 0.07 seconds


In [6]:
# read (NDJSON) and verify
task1_1_path = os.path.expanduser("task1_1_expected_output.txt")
        
with open(task1_1_path, "r", encoding="utf-8") as f:
    expected_first_5_entries = [json.loads(line) for line in f if line.strip()]

assert len(task1_1_first_5_entries) == len(expected_first_5_entries) == 5, "Expected 5 entries."

for i, (got, exp) in enumerate(zip(task1_1_first_5_entries, expected_first_5_entries)):
    assert got == exp, f"Mismatch at index {i}: {got} != {exp}"

print("✅ Verified: each of the first 5 entries matches the saved output.")

✅ Verified: each of the first 5 entries matches the saved output.


# Adding a column 
To add a column to a Ray Dataset, we use the ``Dataset.add_column()`` method, documentation for which can be found [here](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.add_column.html)

In [7]:
# add a column called id to your dataframe, where we number each of our entries
# from 0 to ds.count()

import numpy as np

def add_ids(batch):
    batch["id"] = np.arange(len(batch["unixReviewTime"]))
    return batch

ds = ds.map_batches(add_ids, batch_size=100000)

ds = ds.materialize() #why did we do this? Read the cell below.
task_1_2_first_5_entries = ds.take(5)

2025-11-02 17:20:21,177	INFO logging.py:293 -- Registered dataset logger for dataset dataset_6_0
2025-11-02 17:20:21,182	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_6_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:20:21,183	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_6_0: InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(add_ids)]


Running 0: 0.00 row [00:00, ? row/s]

- MapBatches(add_ids) 1: 0.00 row [00:00, ? row/s]

2025-11-02 17:20:21,452	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_6_0 execution finished in 0.27 seconds
2025-11-02 17:20:21,474	INFO logging.py:293 -- Registered dataset logger for dataset dataset_8_0
2025-11-02 17:20:21,476	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_8_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:20:21,476	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_8_0: InputDataBuffer[Input] -> LimitOperator[limit=5]


Running 0: 0.00 row [00:00, ? row/s]

- limit=5 1: 0.00 row [00:00, ? row/s]

2025-11-02 17:20:21,500	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_8_0 execution finished in 0.02 seconds


In [8]:
# read (NDJSON) and verify

task1_2_path = os.path.expanduser("task1_2_expected_output.txt")

with open(task1_2_path, "r", encoding="utf-8") as f:
    expected_first_5_entries = [json.loads(line) for line in f if line.strip()]

assert len(task_1_2_first_5_entries) == len(expected_first_5_entries) == 5, "Expected 5 entries."

for i, (got, exp) in enumerate(zip(task_1_2_first_5_entries, expected_first_5_entries)):
    assert got == exp, f"Mismatch at index {i}: {got} != {exp}"

print("✅ Verified: each of the first 5 entries matches the saved output.")

✅ Verified: each of the first 5 entries matches the saved output.


# Lazy Execution in Ray 

As you may have noticed, we added a ``ds.materialize()`` command in the cell above. We do this because the default execution mode in Ray Data is Lazy and Streaming execution. You should read more about it [here](https://docs.ray.io/en/latest/data/data-internals.html#execution). We call ``materialize`` here to execute the ``add_column`` transformation on the entire dataset. 

# Compute Statistics
Just like pandas, we can compute some statistics on our data using inbuilt functions like mean, min and max for columns in our Dataset

In [9]:
# Calculate median of the overall rating, and mean of the vote count using
# inbuilt Dataset methods. 

mean_overall = ds.mean("overall")
min_vote = ds.min("overall")
max_vote = ds.max("overall")

2025-11-02 17:20:21,524	INFO logging.py:293 -- Registered dataset logger for dataset dataset_10_0
2025-11-02 17:20:21,525	INFO hash_aggregate.py:180 -- Estimated memory requirement for aggregating aggregator (partitions=1, aggregators=1, dataset (estimate)=0.0GiB): shuffle=44.9MiB, output=44.9MiB, total=89.8MiB, 
2025-11-02 17:20:21,526	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_10_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:20:21,527	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_10_0: InputDataBuffer[Input] -> HashAggregateOperator[HashAggregate(key_columns=(), num_partitions=1)] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- HashAggregate(key_columns=(), num_partitions=1) 1: 0.00 row [00:00, ? row/s]

Shuffle 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Aggregation 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 4: 0.00 row [00:00, ? row/s]

2025-11-02 17:20:21,863	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_10_0 execution finished in 0.34 seconds
2025-11-02 17:20:21,903	INFO logging.py:293 -- Registered dataset logger for dataset dataset_12_0
2025-11-02 17:20:21,912	INFO hash_aggregate.py:180 -- Estimated memory requirement for aggregating aggregator (partitions=1, aggregators=1, dataset (estimate)=0.0GiB): shuffle=44.9MiB, output=44.9MiB, total=89.8MiB, 
2025-11-02 17:20:21,913	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_12_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:20:21,914	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_12_0: InputDataBuffer[Input] -> HashAggregateOperator[HashAggregate(key_columns=(), num_partitions=1)] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- HashAggregate(key_columns=(), num_partitions=1) 1: 0.00 row [00:00, ? row/s]

Shuffle 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Aggregation 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 4: 0.00 row [00:00, ? row/s]

2025-11-02 17:20:22,222	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_12_0 execution finished in 0.31 seconds
2025-11-02 17:20:22,241	INFO logging.py:293 -- Registered dataset logger for dataset dataset_14_0
2025-11-02 17:20:22,243	INFO hash_aggregate.py:180 -- Estimated memory requirement for aggregating aggregator (partitions=1, aggregators=1, dataset (estimate)=0.0GiB): shuffle=44.9MiB, output=44.9MiB, total=89.8MiB, 
2025-11-02 17:20:22,244	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_14_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:20:22,244	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_14_0: InputDataBuffer[Input] -> HashAggregateOperator[HashAggregate(key_columns=(), num_partitions=1)] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- HashAggregate(key_columns=(), num_partitions=1) 1: 0.00 row [00:00, ? row/s]

Shuffle 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Aggregation 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 4: 0.00 row [00:00, ? row/s]

2025-11-02 17:20:22,543	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_14_0 execution finished in 0.30 seconds


In [10]:
# --- read back and verify ---
task1_3_path = os.path.expanduser("task1_3_expected_output.txt")

with open(task1_3_path, "r", encoding="utf-8") as f:
    expected_metrics = [json.loads(line) for line in f if line.strip()][0]

# recompute (or reuse the existing variables if you prefer)
_mean_overall = mean_overall
_min_vote = min_vote
_max_vote = max_vote

assert math.isclose(float(_mean_overall), float(expected_metrics["mean_overall"]), rel_tol=1e-9, abs_tol=1e-12), \
    f"mean_overall mismatch: {_mean_overall} != {expected_metrics['mean_overall']}"
assert math.isclose(float(_min_vote), float(expected_metrics["min_vote"]), rel_tol=1e-9, abs_tol=1e-12), \
    f"min_vote mismatch: {_min_vote} != {expected_metrics['min_vote']}"
assert math.isclose(float(_max_vote), float(expected_metrics["max_vote"]), rel_tol=1e-9, abs_tol=1e-12), \
    f"max_vote mismatch: {_max_vote} != {expected_metrics['max_vote']}"

print("✅ Verified: mean_overall, min_vote, and max_vote match the saved output.")

✅ Verified: mean_overall, min_vote, and max_vote match the saved output.


# Preprocessors in Ray Data

Ray data is a part of the Ray AI Runtime system, and is built to be a scalable data processing library for ML applications. Hence, it has a rich library of various common preprocessors we require to use while serving ML models. [Here](https://docs.ray.io/en/latest/data/api/doc/ray.data.preprocessor.Preprocessor.html#ray.data.preprocessor.Preprocessor) is how the inbuilt preprocessors work.

In [11]:
# Scale each 'overall' using it's maximum absolute value using the MaxAbsScaler

from ray.data.preprocessors import MaxAbsScaler

scaler = MaxAbsScaler(columns=["overall"])
scaler = scaler.fit(ds)
ds = scaler.transform(ds)
ds = ds.materialize()

task1_4_first_5_entries = ds.take(5)

2025-11-02 17:20:22,577	INFO logging.py:293 -- Registered dataset logger for dataset dataset_16_0
2025-11-02 17:20:22,579	INFO hash_aggregate.py:180 -- Estimated memory requirement for aggregating aggregator (partitions=1, aggregators=1, dataset (estimate)=0.0GiB): shuffle=44.9MiB, output=44.9MiB, total=89.8MiB, 
2025-11-02 17:20:22,581	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_16_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:20:22,581	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_16_0: InputDataBuffer[Input] -> HashAggregateOperator[HashAggregate(key_columns=(), num_partitions=1)] -> LimitOperator[limit=1]


Running 0: 0.00 row [00:00, ? row/s]

- HashAggregate(key_columns=(), num_partitions=1) 1: 0.00 row [00:00, ? row/s]

Shuffle 2:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

Aggregation 3:   0%|          | 0.00/1.00 [00:00<?, ? row/s]

- limit=1 4: 0.00 row [00:00, ? row/s]

2025-11-02 17:20:22,977	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_16_0 execution finished in 0.40 seconds
2025-11-02 17:20:23,021	INFO logging.py:293 -- Registered dataset logger for dataset dataset_18_0
2025-11-02 17:20:23,024	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_18_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:20:23,030	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_18_0: InputDataBuffer[Input] -> TaskPoolMapOperator[MaxAbsScaler]


Running 0: 0.00 row [00:00, ? row/s]

- MaxAbsScaler 1: 0.00 row [00:00, ? row/s]

2025-11-02 17:20:23,158	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_18_0 execution finished in 0.13 seconds
2025-11-02 17:20:23,174	INFO logging.py:293 -- Registered dataset logger for dataset dataset_20_0
2025-11-02 17:20:23,175	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_20_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:20:23,175	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_20_0: InputDataBuffer[Input] -> LimitOperator[limit=5]


Running 0: 0.00 row [00:00, ? row/s]

- limit=5 1: 0.00 row [00:00, ? row/s]

2025-11-02 17:20:23,248	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_20_0 execution finished in 0.07 seconds


In [12]:
# read (NDJSON) and verify
task1_4_path = os.path.expanduser("task1_4_expected_output.txt")

with open(task1_4_path, "r", encoding="utf-8") as f:
    expected_first_5_entries = [json.loads(line) for line in f if line.strip()]

assert len(task1_4_first_5_entries) == len(expected_first_5_entries) == 5, "Expected 5 entries."

for i, (got, exp) in enumerate(zip(task1_4_first_5_entries, expected_first_5_entries)):
    assert got == exp, f"Mismatch at index {i}: {got} != {exp}"

print("✅ Verified: each of the first 5 entries matches the saved output.")

✅ Verified: each of the first 5 entries matches the saved output.


# Applying a transform over the entire dataset

To apply a function to the entire dataset, we use the ``Dataset.map()`` method. It transforms the dataset row-wise in accordance to the function you pass into it. Dataset.map uses ray tasks to transform the blocks of the dataset. [Here](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map.html#ray.data.Dataset.map) is an example of how to use it.

In [13]:
# Create a function named lowercase() that accepts a single row of data as
# input and converts the text in the 'summary' column of each row to lowercase
# letters Use Dataset.map() to apply this function over the entire dataset.

# Define a row-wise lowercase() transform and apply with Dataset.map()
def lowercase(row):
    s = row.get("summary")
    if isinstance(s, str):
        row["summary"] = s.lower()
    return row

ds = ds.map(lowercase)

task1_5_first_5_entries = ds.take(5)

2025-11-02 17:20:23,291	INFO logging.py:293 -- Registered dataset logger for dataset dataset_22_0
2025-11-02 17:20:23,293	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_22_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:20:23,293	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_22_0: InputDataBuffer[Input] -> LimitOperator[limit=5] -> TaskPoolMapOperator[Map(lowercase)]


Running 0: 0.00 row [00:00, ? row/s]

- limit=5 1: 0.00 row [00:00, ? row/s]

- Map(lowercase) 2: 0.00 row [00:00, ? row/s]

2025-11-02 17:20:23,386	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_22_0 execution finished in 0.09 seconds


In [14]:
# read (NDJSON) and verify
task1_5_path = os.path.expanduser("task1_5_expected_output.txt")

with open(task1_5_path, "r", encoding="utf-8") as f:
    expected_first_5_entries = [json.loads(line) for line in f if line.strip()]

assert len(task1_5_first_5_entries) == len(expected_first_5_entries) == 5, "Expected 5 entries."

for i, (got, exp) in enumerate(zip(task1_5_first_5_entries, expected_first_5_entries)):
    assert got == exp, f"Mismatch at index {i}: {got} != {exp}"

print("✅ Verified: each of the first 5 entries matches the saved output.")

✅ Verified: each of the first 5 entries matches the saved output.


# Applying a vectorized transformation over the entire dataset

If your transformation can be vectorized, i.e applied to multiple rows at ones, you can apply that transform over batches. You do so using the ``Dataset.map_batches()`` method. [Here](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html) is an example how to use it. 

Hint: set the parameter ``batch_format="pandas"`` for this dataset while using``map_batches()`` to avoid Ray-data specific issues.

In [15]:
import pandas as pd

def scale(batch: pd.DataFrame):
    # Scale from [0, 1] to [-1, 1]
    # Formula: new_value = 2 * old_value - 1
    batch["overall"] = 2 * batch["overall"] - 1
    return batch

ds = ds.map_batches(scale, batch_size=128, batch_format="pandas")

task1_6_first_5_entries = ds.take(5)

2025-11-02 17:20:23,413	INFO logging.py:293 -- Registered dataset logger for dataset dataset_24_0
2025-11-02 17:20:23,415	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_24_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:20:23,415	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_24_0: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(lowercase)->MapBatches(scale)] -> LimitOperator[limit=5]


Running 0: 0.00 row [00:00, ? row/s]

- Map(lowercase)->MapBatches(scale) 1: 0.00 row [00:00, ? row/s]

- limit=5 2: 0.00 row [00:00, ? row/s]

2025-11-02 17:21:52,456	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_24_0 execution finished in 89.04 seconds


In [16]:
# read (NDJSON) and verify
task1_6_path = os.path.expanduser("task1_6_expected_output.txt")

with open(task1_6_path, "r", encoding="utf-8") as f:
    expected_first_5_entries = [json.loads(line) for line in f if line.strip()]

assert len(task1_6_first_5_entries) == len(expected_first_5_entries) == 5, "Expected 5 entries."

for i, (got, exp) in enumerate(zip(task1_6_first_5_entries, expected_first_5_entries)):
    assert got == exp, f"Mismatch at index {i}: {got} != {exp}"

print("✅ Verified: each of the first 5 entries matches the saved output.")

✅ Verified: each of the first 5 entries matches the saved output.


# Cleaning up reviewText

Write a function called ``preprocessor()`` which takes in a batch of size 128. You should convert the ``reviewText``  in each row into lowercase letters, remove all punctuation (we suggest using regex), and tokenize the sentence. A GPT-2 tokenizer has been instantiated for you, and you should use the tokenizer.encode() method to tokenize each `reviewText`. Add these tokenized representations to your dataset under the column ``tokenizedText``. 
(You might have to add the new column beforehand)

Apply this preprocessor transform using map_batches.
Print the first 5 entries of this transformed dataset.


In [17]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [None]:
# Write a `preprocessor` function that can tokenize text for a batch of data. 
# Store the result of the map_batches in `transformed`. Use map_batches again.

ds = ds.add_column("tokenizedText", lambda df: None)

def preprocessor(batch: pd.DataFrame):
    import re
    
    # Process each review text
    processed_texts = []
    for text in batch["reviewText"]:
        # Convert to lowercase
        text = text.lower()
        # Remove punctuation using regex
        text = re.sub(r'[^\w\s]', '', text)
        processed_texts.append(text)
    
    # Update the reviewText column
    batch["reviewText"] = processed_texts
    
    # Tokenize each text
    tokenized = []
    for text in processed_texts:
        tokens = tokenizer.encode(text)
        tokenized.append(tokens)
    
    # Update the tokenizedText column
    batch["tokenizedText"] = tokenized
    
    return batch

transformed = ds.map_batches(preprocessor, batch_size=128, batch_format="pandas")

transformed_results = transformed.take(5)
decode_txt = tokenizer.decode(transformed_results[0]["tokenizedText"])

2025-11-02 17:21:55,058	INFO logging.py:293 -- Registered dataset logger for dataset dataset_27_0
2025-11-02 17:21:55,060	INFO streaming_executor.py:159 -- Starting execution of Dataset dataset_27_0. Full logs are in /tmp/ray/session_2025-11-02_17-20-16_753217_4187/logs/ray-data
2025-11-02 17:21:55,060	INFO streaming_executor.py:160 -- Execution plan of Dataset dataset_27_0: InputDataBuffer[Input] -> TaskPoolMapOperator[Map(lowercase)->MapBatches(scale)->MapBatches(add_column)->MapBatches(preprocessor)] -> LimitOperator[limit=5]


Running 0: 0.00 row [00:00, ? row/s]

- Map(lowercase)->MapBatches(scale)->MapBatches(add_column)->MapBatches(preprocessor) 1: 0.00 row [00:00, ? ro…

- limit=5 2: 0.00 row [00:00, ? row/s]

2025-11-02 17:23:29,696	INFO streaming_executor.py:279 -- ✔️  Dataset dataset_27_0 execution finished in 94.63 seconds


In [19]:
# Check if you have tokenized your text correctly 

assert decode_txt == transformed_results[0]['reviewText'] 

task1_7_path = os.path.expanduser("task1_7_expected_output.txt")

with open(task1_7_path, "r", encoding="utf-8") as f: 
    content = f.read()
assert content == decode_txt
print("✅ decode_txt round-trip matches expected output")


✅ decode_txt round-trip matches expected output
