# NeMo Curator Lab

**Assignment 2, Part a** | UC Berkeley EE 194/290-16: Scalable AI | Spring 2026

## Objective

In this assignment, students will use NeMo Curator to download and curate Wikipedia data within a chosen domain. Then, they will compare the non-curated data with the curated data and reason about how data curation may impact downstream tasks.

This assignment will be scored out of 70 points, with an opportunity to earn an additional 5 bonus points during the evaluation section. Refer to the headings for breakdowns of the scoring rubric.

**Deliverable**: Submit this notebook with all cells implemented and run, including each output per cell.

## Environment Setup [0 points]

Refer to Curator's [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html) to install and run Curator via Docker or `uv`. If using `uv`, include the `text_cuda12` extra (or install `all` which includes `text_cuda12`). Verify the environment and library setups with:

In [None]:
# verifies curator is installed
import nemo_curator

print(nemo_curator.__version__)

In [None]:
!nvidia-smi

In [None]:
# verifies that gpu dependencies are installed
import cudf

print(cudf.__version__)

Initialize and start a Ray client:

In [None]:
from nemo_curator.core.client import RayClient

ray_client = RayClient(num_cpus=32)
ray_client.start()

Note: If you encounter out of memory errors while downloading the data, then you may need to stop the Ray client (with `ray_client.stop()`), reduce `num_cpus` above, and start a fresh Ray client.

## Data Setup [5 points]

Use the following code to download and extract data from the latest Wikipedia dump. Limit it to English text only.

You are encouraged to experiment with the URL limit and record limit. Some helpful information:
- There are ~70 URLs per dump
- The number of JSONL files written == `url_limit`
- There are ~20,000 records per URL
- The number of rows per JSONL file == `record_limit`

In [None]:
from nemo_curator.stages.text.download.wikipedia.stage import WikipediaDownloadExtractStage

language = "en"  # do not change this
download_dir = "./wiki_downloads"
url_limit = 2  # experiment with this
record_limit = 100  # experiment with this

# Initialize the Wikipedia download stage
wiki_stage = WikipediaDownloadExtractStage(
    language=language,
    download_dir=download_dir,
    url_limit=url_limit,
    record_limit=record_limit,
)

In [None]:
from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter

wiki_data_dir = "./wiki_data"

# Initialize the JSONL writer stage
jsonl_writer = JsonlWriter(wiki_data_dir, write_kwargs={"force_ascii": False})

In [None]:
from nemo_curator.pipeline import Pipeline

# Initialize the pipeline
pipeline = Pipeline("download_wiki_pipeline")

# Add the stages to the pipeline
pipeline.add_stage(wiki_stage)
pipeline.add_stage(jsonl_writer)

In [None]:
# Run the pipeline
results = pipeline.run()

## Implement a Stage [10 points]

Use the code snippet below to read and inspect a portion of the dataset:

In [None]:
import pandas as pd
from nemo_curator.utils.file_utils import get_all_file_paths_under

# Read and inspect the first JSONL file from wiki_data_dir
file_paths = get_all_file_paths_under(wiki_data_dir)
df = pd.read_json(file_paths[0], lines=True)
df.head(10)

In the Wikipedia dataset, the last portion of the document's text typically corresponds "Categories" section of the Wikipedia page. For example, the text for the Wikipedia page on [Anarchism](https://en.wikipedia.org/wiki/Anarchism) looks like:

```text
Anarchism is a political philosophy and movement... \n\nExternal links \n\n Anarchy Archives â€“ an online research center on the history and theory of anarchism.\n\n \nAnti-capitalism\nAnti-fascism\nEconomic ideologies\nFar-left politics\nLeft-wing ideologies\nLibertarian socialism\nLibertarianism\nPolitical culture\nPolitical ideologies\nPolitical movements\nSocial theories\nTypes of socialism
```

where `\n\n` denotes a new section on the page, and the last section of the page is a list of categories (in the above example, the categories are Anti-capitalism, Anti-fascism, ... Types of socialism).

In Pandas, we can create a new column called "categories" by using the following operations:

In [None]:
# In the text column, grab everything after the final occurence of "\n\n"
df["categories"] = df["text"].str.rsplit("\n\n", n=1).str[-1].str.strip().str.split("\n")

# Add the title of the page to the list of categories
df["categories"] = df.apply(
    lambda row: [row["title"]] + row["categories"], axis=1
)

In [None]:
df.head(10)

With this in mind, implement a stage in Curator which adds a "categories" column to the entire dataset. Here is a skeleton to help:

In [None]:
from dataclasses import dataclass
from nemo_curator.stages.base import ProcessingStage
from nemo_curator.tasks import DocumentBatch

@dataclass
class CategoriesAdder(ProcessingStage[DocumentBatch, DocumentBatch]):
    """
    Adds a "categories" column to the dataset.
    """

    text_field = "text"
    title_field = "title"
    categories_field = "categories"  # list[str]

    def inputs(self):
        return [["data"], [self.text_field, self.title_field]]

    def outputs(self):
        return [["data"], [self.categories_field]]

    def process(self, batch: DocumentBatch) -> DocumentBatch:
        # implement this function
        return batch

In [None]:
from nemo_curator.stages.text.io.reader import JsonlReader

# Create a pipeline and add stages to it
pipeline = Pipeline("wiki_categories_pipeline")

jsonl_reader = JsonlReader(wiki_data_dir)
pipeline.add_stage(jsonl_reader)

categories_adder = CategoriesAdder()
pipeline.add_stage(categories_adder)

wiki_categories_dir = "./wiki_categories_data"
jsonl_writer = JsonlWriter(wiki_categories_dir)
pipeline.add_stage(jsonl_writer)

In [None]:
# Run the pipeline
results = pipeline.run()

Feel free to do some exploratory analyses of the categories of data available. Include your explorations below.

In [None]:
# exploratory analysis here

## Implement a Filter [10 points]

Next, choose a topic of interest and implement the following filter. Aim for 1,000 - 10,000 or more documents (rows) in your `wiki_domain_dir`.

Note in the below code, we expect `domains` to be a list of strings. This is intended to help keep more documents matching your targeted area. For example, if your targeted domain is political theory, then an appropriate usage might be `domains=["politics", "political theory", "conservatism", "liberalism", ...]` to catch as many relevant matches as possible.

In [None]:
from nemo_curator.stages.text.filters.doc_filter import DocumentFilter

class DomainFilter(DocumentFilter):
    """
    Throw away documents whose categories column does not contain the given substring(s)

    Advice:
    - Check if any of the domains are in the categories, even as a substring
    - This function should not be case-sensitive
    """

    def __init__(self, domains: list[str]):
        super().__init__()
        self._domains = domains
        self._name = "domain_filter"

    def score_document(self, categories: list[str]) -> list[str]:
        return categories  # do not modify this function

    def keep_document(self, categories: list[str]) -> bool:
        # implement this function
        # keep_document should return True if the document contains any of the domains and False otherwise
        return False

In [None]:
from nemo_curator.stages.text.modules import Filter

# Create a pipeline and add stages to it
pipeline = Pipeline("wiki_domain_pipeline")

jsonl_reader = JsonlReader(wiki_categories_dir)
pipeline.add_stage(jsonl_reader)

domains = []  # replace with your list of domains
domain_filter = DomainFilter(domains=domains)
pipeline.add_stage(Filter(domain_filter, filter_field="categories"))

wiki_domain_dir = "./wiki_domain_data"
jsonl_writer = JsonlWriter(wiki_domain_dir)
pipeline.add_stage(jsonl_writer)

In [None]:
# Run the pipeline
results = pipeline.run()

You can use the following bash command to count the number documents within your `wiki_domain_dir`:

In [None]:
!wc -l ./wiki_domain_data/*.jsonl

Feel free to do some exploratory analyses of the categories of data available. Include your explorations below.

In [None]:
# exploratory analysis here

## Implement a Pipeline [30 points]

Create and run a data curation pipeline. Explain your reasoning for each stage included and how you expect it to be relevant to your target domain and/or findings from your exploratory analysis.

Some options include:
- Heuristic filters: using `ScoreFilter` and/or `Filter` as the stage(s), with an existing or custom `DocumentFilter` as the parameter(s)
- Modifiers: using `Modify` as the stage(s), with an existing or custom `DocumentModifier` as the parameter(s)
- Deduplication: exact, fuzzy, and/or semantic deduplication workflow(s)
- Quality classifiers: determining educational value using a FineWeb-Edu classifier, etc.

Please note that since deduplication is not a map-style operation, you do **not** add it to a `Pipeline`. Instead, deduplication is initialized as a **workflow** (e.g., `workflow = TextSemanticDeduplicationWorkflow(...)`) and run with `workflow.run()`. See the existing [tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text/deduplication) and/or NeMo Curator documentation for more help.

You are encouraged to implement own own custom filters and/or stages in Curator! You are also encouraged to look at the number of dropped documents or other interesting statistics per stage.

Save your final curated dataset in a directory called `wiki_curated_domain_data/`.

### Scoring

- A deduplication workflow is required [10 points]
- Additional stages (not including read/write) are 5 points each (up to 15 points)
- Justification for each stage is required [5 points]

In [None]:
# Initialize the pipeline
pipeline = Pipeline("wiki_curation_pipeline")

jsonl_reader = JsonlReader(wiki_domain_dir)
pipeline.add_stage(jsonl_reader)

In [None]:
# add stages here
# justify your choices either in the comments or in a markdown cell
# remember that the order of the stages matters!

In [None]:
curated_data_dir = "./wiki_curated_domain_data"
jsonl_writer = JsonlWriter(curated_data_dir)
pipeline.add_stage(jsonl_writer)

In [None]:
# Run the pipeline
results = pipeline.run()

## Evaluate the Data [15 points]

In this section, we will compare the non-curated data within `wiki_domain_data/` with the curated data within `wiki_curated_domain_data/`.

The best way to compare the effectiveness of curated versus non-curated data is to train separate models with each dataset and evaluate the resulting models themselves. However, in the interest of time, we will use **perplexity** as our evaluation metric.

Perplexity is colloquially referred to how "confusing" a piece of text is to an LLM. A lower perplexity score indicates that the model is less "perplexed," meaning it is more confident and accurate in predicting the next token in a sequence. This means that we would expect the perplexity of the curated data to be lower than the perplexity of the non-curated data, but maybe that will not be the case for your datasets. If this happens, it is okay. You should reason about the results with respect to your chosen domain.

The cell below includes a basic function for calculating the perplexity of a text using the `gpt2` model. Calculate and plot the perplexities of the non-curated versus curated datasets.

Consider the following:
- The provided function is slow for computing the perplexities of hundreds or thousands of documents. Convert it into a Curator stage.
- The `gpt2` model is a nice lightweight and generic model to use here. Evaluation using the `gpt2` model is required. Additionally, you are encouraged to try it out with different models depending on your chosen domain (e.g., if your domain is medicine, consider evaluating with a model specifically intended for medical and clinical text).

### Scoring

- Conversion to Curator stage [10 points]
- Plots for the perplexities of the non-curated versus curated data [5 points]
- Extension to domain-specific models and/or metrics, with a comprehensive analysis per method [5 bonus points]

In [None]:
# do not modify this cell

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import math

# Load pretrained model and tokenizer
model_name = "gpt2"  # small, fast for experiments
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")
model.eval()

def compute_perplexity(texts):
    """
    texts: list of strings
    returns: list of perplexity scores
    """
    perplexities = []

    for text in texts:
        # Tokenize
        encodings = tokenizer(text, return_tensors="pt", truncation=True)
        input_ids = encodings.input_ids.to("cuda")

        with torch.no_grad():
            outputs = model(input_ids, labels=input_ids)
            # Cross-entropy loss per token
            loss = outputs.loss
        # Perplexity = exp(loss)
        perplexity = math.exp(loss.item())
        perplexities.append(perplexity)

    return perplexities

# Example usage
texts = [
    "This is a well-written coherent sentence.",
    "Ths txt has typos and is hard to read."
]

perplexities = compute_perplexity(texts)
for t, p in zip(texts, perplexities):
    print(f"Perplexity: {p:.2f} | Text: {t}")

In [None]:
# implement a curator stage here

In [None]:
# calculate perplexities of the non-curated and curated datasets here

In [None]:
# plot perplexities here

You are invited to experiment with other methods to evaluate the non-curated versus curated datasets. Include them below.

In [None]:
# additional explorations here

## Conclusion

Reason about the assignment and/or your findings here.

Once you are done with any Curator-related pipelines, stop the Ray client with:

In [None]:
ray_client.stop()