# NeMo Data Designer Lab: Synthetic Pre-Training Data via Document Rephrasing

**Assignment 2, Part b** | UC Berkeley EE 194/290-16: Scalable AI | Spring 2026

**Cells marked "YOUR CODE HERE" require your implementation.**

In Part a, you used [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) to curate a domain-specific
Wikipedia corpus. In this notebook, you will use **[NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner)** to generate
synthetic pre-training data from that curated corpus via **document rephrasing**,
the core technique behind [Nemotron-CC](https://arxiv.org/abs/2412.02595) (Parmar et al., 2024) and [BeyondWeb](https://arxiv.org/abs/2506.10426)
(Xu et al., 2025).

The idea is simple but powerful: take high-quality source documents and rephrase them
into diverse synthetic formats (QA pairs, distilled passages, knowledge extracts)
that improve pre-training data quality and diversity. A cost/efficiency thread runs
throughout: you will estimate FLOPs and GPU hours for every generation pipeline you
build, reasoning about the compute-quality frontier.

### Learning Objectives

By the end of this notebook, you will be able to:

1. Launch and validate a GPU inference server for large-scale text generation
2. Estimate computational costs (FLOPs, GPU-hours) for generation workloads
3. Build single-strategy and multi-strategy document rephrasing pipelines
4. Generate structured outputs (diverse QA pairs) using Pydantic schemas
5. Evaluate synthetic data quality using perplexity and LLM-as-judge metrics
6. Scale a generation pipeline from prototype to production

### Prerequisites

- **Part a (Curator notebook)**: You need the curated Wikipedia corpus from Part 1a
  as seed data. A small fallback sample is provided if your Curator output is
  unavailable.
- **Hardware**: A dedicated node with 8xH100 GPUs.
- **Docker**: The inference server runs as a Docker container
  (`nvidia-container-toolkit` must be installed for GPU access).

### Pipeline Overview

The notebook follows a five-stage pipeline:

1. **Seed data**: Load the curated corpus from Part 1a and configure it as a
   NeMo Data Designer seed source.
2. **Inference server**: Launch a [vLLM](https://github.com/vllm-project/vllm) Docker container serving [Nemotron-3 Nano](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16)
   on your 8xH100 node. Validate the endpoint and measure baseline throughput.
3. **Rephrasing strategies**: Build progressively more sophisticated pipelines:
   a single Wikipedia-style rephrase, structured diverse QA, and finally a
   multi-strategy pipeline combining 4 approaches (QA, distill, knowledge extract,
   knowledge list).
4. **Evaluation**: Assess output quality with GPT-2 perplexity and LLM-as-judge
   scoring before committing to a full generation run.
5. **Scale up**: Run `create()` to generate the full dataset for Part B
   pre-training.

## Environment Setup (0 pts)

Install NeMo Data Designer, verify imports, and configure model access. This section is
fully guided. Follow the cells in order.

In [2]:
# Install data-designer
# If you already have it installed, this will upgrade to the latest version.
%pip install -U data-designer

Note: you may need to restart the kernel to use updated packages.


In [4]:
# Verify imports
import data_designer.config as dd
from data_designer.interface import DataDesigner

print("NeMo Data Designer imported successfully.")

NeMo Data Designer imported successfully.


In [5]:
# Define a custom model provider pointing at the local vLLM server.
# ModelProvider handles endpoint URLs and API keys; ModelConfig references
# the provider by name.
model_providers = [
    dd.ModelProvider(
        name="local-vllm",
        endpoint="http://localhost:8000/v1",
        provider_type="openai",  # vLLM exposes an OpenAI-compatible API
        api_key=None,  # no authentication needed for local server
    )
]

# Initialize NeMo Data Designer with our custom provider.
data_designer = DataDesigner(model_providers=model_providers)

In [6]:
# Configure the model. The provider="local-vllm" tells NeMo Data Designer to route
# requests through the provider we just defined (our local vLLM server).
model_configs = [
    dd.ModelConfig(
        model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",  # must match the served model
        provider="local-vllm",
        alias="rephraser",
        # Disable thinking for rephrasing. Nemotron-3 Nano is a hybrid reasoning
        # model, but for straightforward rephrasing tasks, thinking wastes tokens.
        # We want direct output only. For tasks that benefit from reasoning (complex
        # QA, tool use, multi-step problems), remove extra_body or set
        # enable_thinking=True to let the model use its built-in chain-of-thought.
        inference_parameters=dd.ChatCompletionInferenceParams(
            max_tokens=2048,
            temperature=0.7,
            extra_body={"chat_template_kwargs": {"enable_thinking": False}},
        ),
    )
]

# NOTE: To use NVIDIA's hosted API instead of a local server, skip the custom
# ModelProvider and use the built-in "nvidia" provider:
#
#   model_configs = [
#       dd.ModelConfig(
#           model="nvidia/nemotron-3-nano-30b-a3b",
#           provider="nvidia",  # uses https://integrate.api.nvidia.com
#           alias="rephraser",
#       )
#   ]
#
# This requires NVIDIA_API_KEY set in your environment. For this lab, we use
# the local endpoint since you have dedicated GPUs.

---

## Inference Serving (10 pts)

Before generating synthetic data, you need a running inference server. This section
walks through booting a model, confirming it serves, and measuring basic throughput.
This is a lightweight introduction; Assignment A4 covers inference optimization in
depth. Here, we just need the server running so NeMo Data Designer can call it.

### Launching the Inference Server

You have a dedicated 8xH100 node with Docker installed. We will run vLLM as a
Docker container. Run these commands in a **separate terminal** (the server runs
as a long-lived process).

**Step 1**: Download the custom reasoning parser plugin. Nemotron-3 Nano is a
**hybrid reasoning model** that can interleave "thinking" tokens (chain-of-thought)
with final response tokens. The reasoning parser separates these two streams so
vLLM can handle them correctly (e.g., stripping thinking tokens from the response,
or exposing them via the API for inspection). The plugin implements the
Nemotron-3-specific parsing logic.

```bash
mkdir -p ~/vllm-plugins
wget -O ~/vllm-plugins/nano_v3_reasoning_parser.py \
    https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py
```

**Step 2**: Launch the server with reasoning and tool-call support:

```bash
docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -v ~/vllm-plugins:/plugins \
    -p 8000:8000 \
    vllm/vllm-openai:v0.14.0 \
    --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --dtype bfloat16 \
    --port 8000 \
    --reasoning-parser nano_v3 \
    --reasoning-parser-plugin /plugins/nano_v3_reasoning_parser.py \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder
```

Key Docker flags:
- `--gpus all`: exposes all 8 H100s to the container (requires `nvidia-container-toolkit`).
- `-v ~/.cache/huggingface:/root/.cache/huggingface`: mounts the Hugging Face model
  cache so weights persist across container restarts.
- `-v ~/vllm-plugins:/plugins`: mounts the directory containing the reasoning parser
  plugin so vLLM can load it from inside the container.
- `-p 8000:8000`: maps the container's port 8000 to the host.

Key vLLM flags:
- `--reasoning-parser nano_v3` / `--reasoning-parser-plugin`: register the custom
  parser that separates thinking tokens from final output in the API response.
- `--enable-auto-tool-choice` / `--tool-call-parser qwen3_coder`: enable structured
  tool calling (used by NeMo Data Designer for structured output / function calling).

Wait until you see `Uvicorn running on http://0.0.0.0:8000`. This may take several
minutes as the model weights are downloaded and loaded. The server exposes an
OpenAI-compatible API that NeMo Data Designer (and the `openai` Python client) can call
directly.

### Alternative servers

**[SGLang](https://github.com/sgl-project/sglang):**
```bash
docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
        --port 8000
```

**[TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM/):**
```bash
docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 \
    trtllm-serve \
        --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
        --port 8000
```
TRT-LLM requires an engine build step before serving but offers the best throughput
for production workloads. See the
[TRT-LLM documentation](https://nvidia.github.io/TensorRT-LLM/) for setup details.

All three servers expose OpenAI-compatible endpoints, so the rest of this notebook
works unchanged regardless of which you choose.

### Your Task

1. Launch the vLLM server following the instructions above.
2. Run the health check cell to confirm the server is responding.
3. Send a test request and verify you get a coherent response.
4. Measure sequential throughput (tokens/sec) using the provided prompts.
5. Compute the effective MFU and explain in a sentence or two why it is low at batch size 1.

In [7]:
# Health check: confirm the server is up and serving your model.
import requests

response = requests.get("http://localhost:8000/v1/models")
print(response.json())

{'object': 'list', 'data': [{'id': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16', 'object': 'model', 'created': 1772098939, 'owned_by': 'vllm', 'root': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16', 'parent': None, 'max_model_len': 262144, 'permission': [{'id': 'modelperm-9ec175c6e984bd13', 'object': 'model_permission', 'created': 1772098939, 'allow_create_engine': False, 'allow_sampling': True, 'allow_logprobs': True, 'allow_search_indices': False, 'allow_view': True, 'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': False}]}]}


In [8]:
# Test request: send a single chat completion to verify end-to-end.
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    messages=[{"role": "user", "content": "What is synthetic data generation?"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

None


In [9]:
# Throughput measurement: send a batch of requests and measure tokens/sec.
# This is sequential (batch size 1) to establish a baseline. Real throughput
# improves with concurrent requests and batching (often 5-10x).

import time

prompts = [
    "Explain the concept of synthetic data generation in machine learning.",
    "What are the main challenges in training large language models?",
    "Describe the transformer architecture and its key innovations.",
    "How does distributed training work across multiple GPUs?",
    "What is the difference between pre-training and fine-tuning?",
]

total_prompt_tokens = 0
total_completion_tokens = 0
start = time.time()

for prompt in prompts:
    resp = client.chat.completions.create(
        model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
    )
    total_prompt_tokens += resp.usage.prompt_tokens
    total_completion_tokens += resp.usage.completion_tokens

elapsed = time.time() - start
tokens_per_sec = total_completion_tokens / elapsed

print(f"Requests sent:       {len(prompts)}")
print(f"Prompt tokens:       {total_prompt_tokens}")
print(f"Completion tokens:   {total_completion_tokens}")
print(f"Wall-clock time:     {elapsed:.2f} s")
print(f"Throughput (decode):  {tokens_per_sec:.1f} tokens/sec")

Requests sent:       5
Prompt tokens:       136
Completion tokens:   1280
Wall-clock time:     3.73 s
Throughput (decode):  343.3 tokens/sec


### Effective MFU

How efficient is our inference server? We can compare observed throughput against the
GPU's theoretical peak to estimate **effective MFU** (Model FLOPs Utilization).

For a dense transformer, each token requires approximately **2P FLOPs** in the
forward pass (where P is the number of parameters). For Nemotron-3 Nano, which is a
hybrid Mamba+MoE+Attention model, FLOPs per token uses **active parameters** (~3.2B),
not total parameters (31.6B), since only a fraction of experts are activated per token.
Given observed tokens/sec, we can compute observed TFLOPS and compare against the
H100's FP16 peak of 989 TFLOPS.

At batch size 1, expect low MFU because autoregressive decode is **memory-bandwidth-bound**,
not compute-bound. The GPU spends most of its time loading weights from HBM rather
than doing arithmetic. We will revisit this in Section 4.

In [10]:
# Effective MFU estimation

active_params = 3.2e9          # Nemotron-3 Nano: 3.2B active parameters (MoE)
flops_per_token = 2 * active_params
num_gpus = 4                   # tensor-parallel-size from our vLLM launch command
h100_peak_tflops = 989         # per-GPU FP16 peak TFLOPS

observed_tflops = (tokens_per_sec * flops_per_token) / 1e12
total_peak_tflops = h100_peak_tflops * num_gpus
mfu = observed_tflops / total_peak_tflops

print(f"Active parameters:    {active_params / 1e9:.1f} B")
print(f"FLOPs per token:      {flops_per_token:.2e}")
print(f"Observed throughput:  {tokens_per_sec:.1f} tokens/sec")
print(f"Observed TFLOPS:      {observed_tflops:.2f}")
print(f"Peak TFLOPS ({num_gpus} GPUs): {total_peak_tflops:.0f}")
print(f"Effective MFU:        {mfu:.4f}  ({mfu * 100:.2f}%)")

Active parameters:    3.2 B
FLOPs per token:      6.40e+09
Observed throughput:  343.3 tokens/sec
Observed TFLOPS:      2.20
Peak TFLOPS (4 GPUs): 3956
Effective MFU:        0.0006  (0.06%)


---

## Seed Data (5 pts)

Load the curated Wikipedia corpus you produced in Part 1a (the NeMo Curator lab).
This becomes the **seed data** for all rephrasing. Every synthetic document will
be derived from one of these source articles. We will also configure it as a NeMo Data
Designer seed source for use in later sections.

### Your Task

1. Load your Part 1a Curator output from `wiki_curated_domain_data/`. If your Curator output is unavailable, use the provided sample data.
2. Report: number of documents, text length statistics (mean, median, min, max), approximate total tokens, and category breakdown.
3. Configure a `LocalFileSeedSource` pointing at your data.
4. Examine 2-3 documents and comment on what makes them good or poor candidates for rephrasing.

In [None]:
# Load and inspect the Curator output.
# The Curator lab writes JSONL files to wiki_curated_domain_data/.
# Load all JSONL files into a pandas DataFrame.

import glob
import pandas as pd

jsonl_files = sorted(glob.glob("wiki_curated_domain_data/*.jsonl"))
print(f"Found {len(jsonl_files)} JSONL files: {jsonl_files}")

seed_df = pd.concat([pd.read_json(f, lines=True) for f in jsonl_files], ignore_index=True)
print(f"Loaded {len(seed_df)} documents")
print(f"Columns: {list(seed_df.columns)}")
seed_df.head()

Found 3 JSONL files: ['wiki_curated_domain_data/8dff7abad519.jsonl', 'wiki_curated_domain_data/acef2699d502.jsonl', 'wiki_curated_domain_data/seed.jsonl']
Loaded 18 documents
Columns: ['text', 'title', 'id', 'url', 'language', 'source_id', 'file_name', 'categories']


Unnamed: 0,text,title,id,url,language,source_id,file_name,categories
0,Interoperability is a characteristic of a prod...,Interoperability,41285,https://en.wikipedia.org/wiki/Interoperability,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Interoperability, Computing terminology, Tele..."
1,"In communications and computing, a machine-rea...",Machine-readable medium and data,41341,https://en.wikipedia.org/wiki/Machine-readable...,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Machine-readable medium and data, Computing t..."
2,Maintainability is the ease of maintaining or ...,Maintainability,41347,https://en.wikipedia.org/wiki/Maintainability,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Maintainability, Telecommunications engineeri..."
3,"In telecommunications and data storage, Manche...",Manchester code,41350,https://en.wikipedia.org/wiki/Manchester%20code,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Manchester code, Line codes, Department of Co..."
4,"The Œº-law algorithm (sometimes written mu-law,...",Mu-law algorithm,41382,https://en.wikipedia.org/wiki/Mu-law%20algorithm,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Mu-law algorithm, Audio codecs, ITU-T recomme..."


In [12]:
# Text length distribution: how long are the curated articles?
# Report: number of documents, mean/median/min/max text length.

text_lengths = seed_df["text"].str.len()

print(f"Number of documents: {len(seed_df)}")
print(f"Mean text length:    {text_lengths.mean():.0f} chars")
print(f"Median text length:  {text_lengths.median():.0f} chars")
print(f"Min text length:     {text_lengths.min()} chars")
print(f"Max text length:     {text_lengths.max()} chars")
print(f"Std text length:     {text_lengths.std():.0f} chars")

Number of documents: 18
Mean text length:    8038 chars
Median text length:  4734 chars
Min text length:     947 chars
Max text length:     29874 chars
Std text length:     10173 chars


In [13]:
# Rough token count estimate (chars / 4 is a common approximation for English text).
# Report the approximate total corpus size in tokens.

total_chars = text_lengths.sum()
approx_tokens = total_chars / 4

print(f"Total characters:       {total_chars:,}")
print(f"Approximate tokens:     {approx_tokens:,.0f}")
print(f"Avg tokens per document: {approx_tokens / len(seed_df):,.0f}")

Total characters:       144,675
Approximate tokens:     36,169
Avg tokens per document: 2,009


In [14]:
# Category breakdown: what domains are represented?
# If your data has a "categories" column, report unique categories and their counts.

if "categories" in seed_df.columns:
    all_cats = seed_df["categories"].explode()
    cat_counts = all_cats.value_counts()
    print(f"Unique categories: {len(cat_counts)}")
    print(f"\nTop categories:")
    print(cat_counts.head(20).to_string())
else:
    print("No 'categories' column found in the dataset.")

Unique categories: 28

Top categories:
categories
Computing terminology                                       6
Telecommunications engineering                              6
Interoperability                                            3
Product testing                                             3
Machine-readable medium and data                            3
Storage media                                               3
Optical character recognition                               3
Maintainability                                             3
Design for X                                                3
Maintenance                                                 3
Software quality                                            3
Broad-concept articles                                      3
Manchester code                                             3
Line codes                                                  3
Department of Computer Science, University of Manchester    3
History of computing

If your Curator output is unavailable, a small sample dataset is provided in
wiki_curated_domain_data/ for testing. It uses the same JSONL format and schema
as the Curator pipeline output, so all downstream code works unchanged.

In [15]:
# Configure the seed source for NeMo Data Designer.
# This tells NeMo Data Designer where to read source documents from. In later sections,
# prompt templates will reference seed columns like {{ text }} and {{ title }}.

from data_designer.config import LocalFileSeedSource

seed_path = "wiki_curated_domain_data/seed.jsonl"
seed_df.to_json(seed_path, orient="records", lines=True)

seed_source = LocalFileSeedSource(path=seed_path)
print(f"Seed source configured: {seed_source}")
print(f"Path: {seed_source.path}")

Seed source configured: seed_type='local' path='wiki_curated_domain_data/seed.jsonl'
Path: wiki_curated_domain_data/seed.jsonl


In [16]:
# Explore: examine a few documents in detail. What makes good vs. poor
# candidates for rephrasing? Look at text quality and information density.

for i in range(min(3, len(seed_df))):
    row = seed_df.iloc[i]
    print(f"{'=' * 80}")
    print(f"Document {i}: {row['title']}")
    print(f"Length: {len(row['text'])} chars (~{len(row['text']) // 4} tokens)")
    if "categories" in seed_df.columns:
        print(f"Categories: {row['categories']}")
    print(f"{'-' * 80}")
    print(row["text"][:800])
    print("...\n" if len(row["text"]) > 800 else "\n")

print("=" * 80)
print("Rephrasing suitability:")
print("- Good candidates: long, information-dense articles with clear factual content")
print("  and well-structured sections. These give the LLM rich material to rephrase.")
print("- Poor candidates: very short stubs, list-heavy articles with little prose,")
print("  or articles dominated by tables/data that lose meaning when rephrased.")

Document 0: Interoperability
Length: 29874 chars (~7468 tokens)
Categories: ['Interoperability', 'Computing terminology', 'Telecommunications engineering', 'Product testing']
--------------------------------------------------------------------------------
Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader definition takes into account social, political, and organizational factors that impact system-to-system performance.

Types of interoperability include syntactic interoperability, where two systems can communicate with each other, and cross-domain interoperability, where multiple organizations work together and exchange information.

Types

If two or more systems use common data formats and communication protocols, then they are capable of communicating with each other, and they exhibit syntactic 

---

## Cost of Generation (5 pts + bonus)

Before generating synthetic data at scale, estimate the computational cost.
Compare two model architectures and build intuition for how architecture
choices affect the cost of synthetic data generation.

### Your Task

1. Estimate the FLOPs and GPU-hours for rephrasing your dataset with: (a) an 8B dense model and (b) a 30B-total, 3B-active MoE.
2. Answer: Which model gives the best FLOPs/token? Which has the highest memory requirement? When would you choose the dense model over the MoE?

**Bonus (5 pts):** Compute the critical batch size at which the MoE becomes compute-bound on H100. Show your arithmetic intensity calculation.

In [17]:
# Model comparison: estimate generation cost for dense vs. MoE.

num_docs = len(seed_df)
total_input_tokens = int(approx_tokens)
total_output_tokens = total_input_tokens  # assume ~1:1 ratio for rephrasing
total_tokens = total_input_tokens + total_output_tokens

# --- (a) 8B Dense Model ---
dense_params = 8e9
dense_flops_per_token = 2 * dense_params
dense_total_flops = total_tokens * dense_flops_per_token

# --- (b) 30B-total, 3B-active MoE (Nemotron-3 Nano) ---
moe_total_params = 30e9
moe_active_params = 3.2e9
moe_flops_per_token = 2 * moe_active_params
moe_total_flops = total_tokens * moe_flops_per_token

# GPU-hours: assume 4x H100, realistic MFU ~30% for batched inference
num_gpus = 4
h100_peak_tflops = 989
total_peak_flops_per_sec = num_gpus * h100_peak_tflops * 1e12
assumed_mfu = 0.30
effective_flops_per_sec = total_peak_flops_per_sec * assumed_mfu

dense_gpu_hours = dense_total_flops / effective_flops_per_sec / 3600
moe_gpu_hours = moe_total_flops / effective_flops_per_sec / 3600

# Memory requirements (bf16 = 2 bytes per param, weights only)
dense_mem_gb = dense_params * 2 / 1e9
moe_mem_gb = moe_total_params * 2 / 1e9  # must load ALL params into memory

print(f"Dataset: {num_docs} docs, ~{total_input_tokens:,} input tokens, ~{total_output_tokens:,} output tokens")
print(f"{'=' * 60}")
print(f"{'Metric':<30} {'8B Dense':>14} {'30B MoE (3B active)':>20}")
print(f"{'-' * 60}")
print(f"{'FLOPs/token':<30} {dense_flops_per_token:.2e} {moe_flops_per_token:>20.2e}")
print(f"{'Total FLOPs':<30} {dense_total_flops:.2e} {moe_total_flops:>20.2e}")
print(f"{'GPU-hours (4xH100, 30% MFU)':<30} {dense_gpu_hours:>14.6f} {moe_gpu_hours:>20.6f}")
print(f"{'Memory (bf16 weights)':<30} {dense_mem_gb:>13.1f}G {moe_mem_gb:>19.1f}G")
print(f"{'FLOPs/token ratio vs dense':<30} {'1.0x':>14} {moe_flops_per_token/dense_flops_per_token:>19.2f}x")
print()
print("Key takeaways:")
print(f"- The MoE uses {moe_flops_per_token/dense_flops_per_token:.1%} of the dense model's FLOPs/token")
print(f"  ({moe_active_params/1e9:.1f}B active vs {dense_params/1e9:.0f}B), so it's ~{dense_flops_per_token/moe_flops_per_token:.1f}x cheaper per token.")
print(f"- But the MoE needs {moe_mem_gb:.0f} GB vs {dense_mem_gb:.0f} GB for weights ‚Äî {moe_mem_gb/dense_mem_gb:.1f}x more memory.")
print(f"- Choose the dense model when memory is constrained (e.g., fewer/smaller GPUs)")
print(f"  or when you need simpler serving (no expert routing overhead).")
print(f"  Choose the MoE when you have enough memory and want lower cost per token.")

Dataset: 18 docs, ~36,168 input tokens, ~36,168 output tokens
Metric                               8B Dense  30B MoE (3B active)
------------------------------------------------------------
FLOPs/token                    1.60e+10             6.40e+09
Total FLOPs                    1.16e+15             4.63e+14
GPU-hours (4xH100, 30% MFU)          0.000271             0.000108
Memory (bf16 weights)                   16.0G                60.0G
FLOPs/token ratio vs dense               1.0x                0.40x

Key takeaways:
- The MoE uses 40.0% of the dense model's FLOPs/token
  (3.2B active vs 8B), so it's ~2.5x cheaper per token.
- But the MoE needs 60 GB vs 16 GB for weights ‚Äî 3.8x more memory.
- Choose the dense model when memory is constrained (e.g., fewer/smaller GPUs)
  or when you need simpler serving (no expert routing overhead).
  Choose the MoE when you have enough memory and want lower cost per token.


In [18]:
# Bonus: Arithmetic intensity analysis
# Find the critical batch size where MoE decode becomes compute-bound on H100.

# H100 specs (per GPU)
h100_compute = 989e12      # FP16 peak FLOPS
h100_bandwidth = 3.35e12   # HBM bandwidth in bytes/sec

# Machine balance point: the arithmetic intensity (FLOPs/byte) where
# compute and memory bandwidth are both fully utilized.
machine_balance = h100_compute / h100_bandwidth
print(f"H100 machine balance: {machine_balance:.1f} FLOPs/byte")

# During autoregressive decode at batch size B:
#   Compute = B * 2 * P_active FLOPs  (one forward pass per token, B tokens in parallel)
#   Memory  = P_active * 2 bytes      (load active weights once from HBM, bf16)
#
# Arithmetic intensity = compute / memory = (B * 2 * P_active) / (P_active * 2) = B
#
# The arithmetic intensity equals the batch size! This is because each weight
# element is loaded once but used B times (once per sequence in the batch).

print(f"\nFor MoE decode (P_active = {moe_active_params/1e9:.1f}B):")
print(f"  Compute per step = B √ó 2 √ó {moe_active_params:.2e} = B √ó {2*moe_active_params:.2e} FLOPs")
print(f"  Memory per step  = {moe_active_params:.2e} √ó 2 bytes = {moe_active_params*2:.2e} bytes")
print(f"  Arithmetic intensity = B FLOPs/byte")
print()

critical_batch = int(machine_balance)
print(f"Critical batch size = machine balance = {critical_batch}")
print(f"\n  B < {critical_batch}: memory-bandwidth-bound (GPU waits on HBM reads)")
print(f"  B > {critical_batch}: compute-bound (GPU arithmetic units are the bottleneck)")
print(f"\nAt batch size 1, we achieve only ~1/{critical_batch} of peak compute utilization,")
print(f"which explains the very low MFU observed in the throughput measurement above.")

H100 machine balance: 295.2 FLOPs/byte

For MoE decode (P_active = 3.2B):
  Compute per step = B √ó 2 √ó 3.20e+09 = B √ó 6.40e+09 FLOPs
  Memory per step  = 3.20e+09 √ó 2 bytes = 6.40e+09 bytes
  Arithmetic intensity = B FLOPs/byte

Critical batch size = machine balance = 295

  B < 295: memory-bandwidth-bound (GPU waits on HBM reads)
  B > 295: compute-bound (GPU arithmetic units are the bottleneck)

At batch size 1, we achieve only ~1/295 of peak compute utilization,
which explains the very low MFU observed in the throughput measurement above.


---

## Single-Strategy Rephrasing (Guided, 15 pts)

Time to use NeMo Data Designer. The core pattern is straightforward: seed documents go in,
rephrased documents come out. An LLM text column takes a prompt template, fills in
seed columns via Jinja2, and generates new text.

The Nemotron-CC paper (Parmar et al., 2024) defines several named rephrasing
strategies: **Open Rewrite** (Wikipedia-style), **Diverse QA**, **Knowledge
Extraction**, **Knowledge List**, and others. Each strategy transforms source
documents into a different format, surfacing different aspects of the content.
Even simple rephrasing of low-quality web documents improves pre-training data
quality. The key is the prompt that controls the style, tone, and structure of the
output. Here we implement the **Wikipedia-style rephrase** (Open Rewrite) strategy
as a guided introduction to the NeMo Data Designer pipeline.

### Your Task

1. Implement the **Wikipedia-style rephrase** strategy from Nemotron-CC (also referred to as "Open Rewrite" in the paper). This strategy rewrites each source document as a clean encyclopedia entry. Build the pipeline using `LLMTextColumnConfig`.
2. Run `preview()` on 3 records and inspect the output. Compare original vs. rephrased text for at least 2 documents.
3. Estimate the FLOPs cost of running this strategy across your full dataset.
4. In 2-3 sentences: Is information preserved, lost, or added? How does the structure change?

In [19]:
# Build a single-strategy rephrasing pipeline.
# Each pipeline gets its own config builder so we can compare outputs later.

from data_designer.config import DataDesignerConfigBuilder, LLMTextColumnConfig

single_strategy = DataDesignerConfigBuilder(model_configs=model_configs)
single_strategy = single_strategy.with_seed_dataset(seed_source)

In [None]:
# Add the rephrasing column. The prompt instructs the model to rewrite the source
# document as a clean Wikipedia-style encyclopedia entry, preserving facts while
# improving clarity and structure.

wiki_rephrase_prompt = """\
Rewrite the following document as a clean, well-structured Wikipedia-style encyclopedia entry. \
Preserve all factual information from the original text. Improve clarity, organization, and \
readability. Use a neutral, encyclopedic tone. Add section headings where appropriate. \
Do not add information that is not in the original document. Output only the rewritten article.

Title: {{ title }}

Original document:
{{ text }}
"""

single_strategy = single_strategy.add_column(
    LLMTextColumnConfig(
        name="wiki_rephrase",
        prompt=wiki_rephrase_prompt,
        model_alias="rephraser",
    )
)
print(single_strategy)

DataDesignerConfigBuilder(
    seed_dataset: local seed
    llm_text_columns: ['wiki_rephrase']
)


In [None]:
# Preview generates a small sample so we can inspect quality before scaling up.

single_preview = data_designer.preview(single_strategy, num_records=3)
single_preview_df = single_preview.dataset
print(f"Preview shape: {single_preview_df.shape}")
print(f"Columns: {list(single_preview_df.columns)}")
single_preview_df.head()

[09:43:04] [INFO] üñºÔ∏è Preview generation in progress
[09:43:05] [INFO] ‚úÖ Validation passed
[09:43:05] [INFO] ‚õìÔ∏è Sorting column configs into a Directed Acyclic Graph
[09:43:05] [INFO] ü©∫ Running health checks for models...
[09:43:05] [INFO]   |-- üëÄ Checking 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16' in provider named 'local-vllm' for model alias 'rephraser'...
[09:43:05] [INFO]   |-- ‚úÖ Passed!
[09:43:06] [INFO] üå± Sampling 3 records from seed dataset
[09:43:06] [INFO]   |-- seed dataset size: 18 records
[09:43:06] [INFO]   |-- sampling strategy: ordered
[09:43:06] [INFO] üìù llm-text model config for column 'wiki_rephrase'
[09:43:06] [INFO]   |-- model: 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16'
[09:43:06] [INFO]   |-- model alias: 'rephraser'
[09:43:06] [INFO]   |-- model provider: 'local-vllm'
[09:43:06] [INFO]   |-- inference parameters:
[09:43:06] [INFO]   |  |-- generation_type=chat-completion
[09:43:06] [INFO]   |  |-- max_parallel_requests=4
[09:43:06] [IN

Preview shape: (3, 9)
Columns: ['text', 'title', 'id', 'url', 'language', 'source_id', 'file_name', 'categories', 'wiki_rephrase']


Unnamed: 0,text,title,id,url,language,source_id,file_name,categories,wiki_rephrase
0,Interoperability is a characteristic of a prod...,Interoperability,41285,https://en.wikipedia.org/wiki/Interoperability,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Interoperability, Computing terminology, Tele...",Interoperability \n\nInteroperability is a pr...
1,"In communications and computing, a machine-rea...",Machine-readable medium and data,41341,https://en.wikipedia.org/wiki/Machine-readable...,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Machine-readable medium and data, Computing t...",**Machine-readable medium and data**\n\nA mach...
2,Maintainability is the ease of maintaining or ...,Maintainability,41347,https://en.wikipedia.org/wiki/Maintainability,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Maintainability, Telecommunications engineeri...",**Maintainability**\n\nMaintainability is the ...


In [None]:
# Compare original vs. rephrased for the first document

for i in range(min(2, len(single_preview_df))):
    row = single_preview_df.iloc[i]
    print(f"{'=' * 80}")
    print(f"Document {i}: {row['title']}")
    print(f"\n--- ORIGINAL (first 500 chars) ---")
    print(row["text"][:500])
    print(f"\n--- REPHRASED (first 500 chars) ---")
    print(row["wiki_rephrase"][:500])
    print()

Document 0: Interoperability

--- ORIGINAL (first 500 chars) ---
Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader definition takes into account social, political, and organizational factors that impact system-to-system performance.

Types of interoperability include syntactic interoperability, where two systems can communicate with each other, and cross-domain interopera

--- REPHRASED (first 500 chars) ---
Interoperability  

Interoperability is a property of a product or system that enables it to work seamlessly with other products or systems. Initially defined in the context of information technology and systems engineering to facilitate information exchange, the concept has broadened to include social, political, and organizational dimensions that influence system-to-system performance.  

Int

In [23]:
# Estimate FLOPs for single-strategy rephrasing of the full dataset.
# Each document: ~500 input tokens (source) + ~500 output tokens (rephrased)

avg_input_tokens = int(approx_tokens / len(seed_df))
avg_output_tokens = avg_input_tokens  # rephrased output roughly matches input length
tokens_per_doc = avg_input_tokens + avg_output_tokens

single_total_flops = len(seed_df) * tokens_per_doc * moe_flops_per_token
single_gpu_hours = single_total_flops / effective_flops_per_sec / 3600

print(f"Avg input tokens/doc:   ~{avg_input_tokens}")
print(f"Avg output tokens/doc:  ~{avg_output_tokens}")
print(f"Total tokens (all docs): ~{len(seed_df) * tokens_per_doc:,}")
print(f"FLOPs/token (3.2B MoE):  {moe_flops_per_token:.2e}")
print(f"Total FLOPs:             {single_total_flops:.2e}")
print(f"GPU-hours (4xH100):      {single_gpu_hours:.6f}")
print()
print("Information is largely preserved but restructured: the rephrase adds")
print("section headings, smooths prose, and removes informal/redundant phrasing.")
print("Some granular detail may be lost (e.g., niche examples), while the overall")
print("structure becomes more encyclopedic and readable.")

Avg input tokens/doc:   ~2009
Avg output tokens/doc:  ~2009
Total tokens (all docs): ~72,324
FLOPs/token (3.2B MoE):  6.40e+09
Total FLOPs:             4.63e+14
GPU-hours (4xH100):      0.000108

Information is largely preserved but restructured: the rephrase adds
section headings, smooths prose, and removes informal/redundant phrasing.
Some granular detail may be lost (e.g., niche examples), while the overall
structure becomes more encyclopedic and readable.


Explore: How does the rephrased text differ from the original? Is information
preserved? Lost? Added?

---

## Diverse-QA Generation (Scaffolded, 20 pts)

Nemotron-CC (Parmar et al., 2024) showed that diverse QA pairs derived from source
documents are effective for pre-training data augmentation. Instead of a single
rephrasing style, we generate multiple question types (yes/no, open-ended, and
multiple-choice), all grounded in the source material.

This section introduces **structured outputs**: NeMo Data Designer can enforce a Pydantic
schema on the LLM output, guaranteeing that the generated JSON conforms to a
well-defined structure. This is essential for QA pairs where we need typed fields
(question, answer, question type, optional choices).

### Your Task

1. Define a Pydantic schema for structured QA output. Your schema must support at least 3 question types (yes/no, open-ended, multiple-choice) with typed fields for question, answer, question type, and optional choices.
2. Build a Diverse-QA pipeline using `LLMStructuredColumnConfig` with your schema.
3. Add an `ExpressionColumnConfig` that formats the structured QA into a training-ready text string.
4. Run `preview()` on 3 records. Inspect the generated QA pairs: Are the questions grounded in the source text? Is the question type distribution diverse?
5. Estimate the per-document and total FLOPs for this strategy and compare to single-strategy rephrasing.

In [24]:
# Define the Pydantic schema for structured QA output.
# NeMo Data Designer will enforce this schema on the LLM response, parsing and validating
# the JSON automatically.

from pydantic import BaseModel, Field
from typing import Literal

class QAPair(BaseModel):
    question: str = Field(description="A question grounded in the source document.")
    answer: str = Field(description="The answer to the question, based on the source document.")
    question_type: Literal["yes_no", "open_ended", "multiple_choice"] = Field(
        description="The type of question."
    )
    choices: list[str] | None = Field(
        default=None,
        description="Answer choices for multiple-choice questions. None for other types.",
    )

class DiverseQAOutput(BaseModel):
    qa_pairs: list[QAPair] = Field(
        description="A list of diverse QA pairs derived from the source document.",
        min_length=3,
    )

print("Schema defined:")
print(DiverseQAOutput.model_json_schema())

Schema defined:
{'$defs': {'QAPair': {'properties': {'question': {'description': 'A question grounded in the source document.', 'title': 'Question', 'type': 'string'}, 'answer': {'description': 'The answer to the question, based on the source document.', 'title': 'Answer', 'type': 'string'}, 'question_type': {'description': 'The type of question.', 'enum': ['yes_no', 'open_ended', 'multiple_choice'], 'title': 'Question Type', 'type': 'string'}, 'choices': {'anyOf': [{'items': {'type': 'string'}, 'type': 'array'}, {'type': 'null'}], 'default': None, 'description': 'Answer choices for multiple-choice questions. None for other types.', 'title': 'Choices'}}, 'required': ['question', 'answer', 'question_type'], 'title': 'QAPair', 'type': 'object'}}, 'properties': {'qa_pairs': {'description': 'A list of diverse QA pairs derived from the source document.', 'items': {'$ref': '#/$defs/QAPair'}, 'minItems': 3, 'title': 'Qa Pairs', 'type': 'array'}}, 'required': ['qa_pairs'], 'title': 'DiverseQAO

In [25]:
# Build the Diverse QA pipeline.

from data_designer.config import LLMStructuredColumnConfig, ExpressionColumnConfig

diverse_qa_prompt = """\
You are given a source document. Generate exactly 3 diverse question-answer pairs grounded \
in the document. You MUST include at least one of each type:
- A yes/no question (question_type: "yes_no")
- An open-ended question (question_type: "open_ended")
- A multiple-choice question with 4 choices (question_type: "multiple_choice", include choices)

All answers must be directly supported by the source text. Do not invent facts.

Title: {{ title }}

Source document:
{{ text }}
"""

qa_strategy = DataDesignerConfigBuilder(model_configs=model_configs)
qa_strategy = qa_strategy.with_seed_dataset(seed_source)
qa_strategy = qa_strategy.add_column(
    LLMStructuredColumnConfig(
        name="diverse_qa",
        prompt=diverse_qa_prompt,
        model_alias="rephraser",
        output_format=DiverseQAOutput,
    )
)
print(qa_strategy)

DataDesignerConfigBuilder(
    seed_dataset: local seed
    llm_structured_columns: ['diverse_qa']
)


In [26]:
# Format the structured QA output into a training-ready text format.
# Each QA pair becomes a passage that could appear in a pre-training corpus.

qa_format_expr = """\
{% for pair in diverse_qa.qa_pairs %}\
Question ({{ pair.question_type }}): {{ pair.question }}
Answer: {{ pair.answer }}\
{% if pair.choices %}
Choices: {{ pair.choices }}\
{% endif %}

{% endfor %}\
"""

qa_strategy = qa_strategy.add_column(
    ExpressionColumnConfig(
        name="qa_training_text",
        expr=qa_format_expr,
        dtype="str",
    )
)
print(qa_strategy)

DataDesignerConfigBuilder(
    seed_dataset: local seed
    llm_structured_columns: ['diverse_qa']
    expression_columns: ['qa_training_text']
)


In [27]:
# Preview the Diverse QA pipeline.

qa_preview = data_designer.preview(qa_strategy, num_records=3)
qa_preview_df = qa_preview.dataset
print(f"Preview shape: {qa_preview_df.shape}")
print(f"Columns: {list(qa_preview_df.columns)}")
qa_preview_df.head()

[09:43:29] [INFO] üëÄ Preview generation in progress
[09:43:29] [INFO] ‚úÖ Validation passed
[09:43:29] [INFO] ‚õìÔ∏è Sorting column configs into a Directed Acyclic Graph
[09:43:29] [INFO] ü©∫ Running health checks for models...
[09:43:29] [INFO]   |-- üëÄ Checking 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16' in provider named 'local-vllm' for model alias 'rephraser'...
[09:43:29] [INFO]   |-- ‚úÖ Passed!
[09:43:29] [INFO] üå± Sampling 3 records from seed dataset
[09:43:29] [INFO]   |-- seed dataset size: 18 records
[09:43:29] [INFO]   |-- sampling strategy: ordered
[09:43:29] [INFO] üóÇÔ∏è llm-structured model config for column 'diverse_qa'
[09:43:29] [INFO]   |-- model: 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16'
[09:43:29] [INFO]   |-- model alias: 'rephraser'
[09:43:29] [INFO]   |-- model provider: 'local-vllm'
[09:43:29] [INFO]   |-- inference parameters:
[09:43:29] [INFO]   |  |-- generation_type=chat-completion
[09:43:29] [INFO]   |  |-- max_parallel_requests=4
[09:43:29] 

Preview shape: (3, 10)
Columns: ['text', 'title', 'id', 'url', 'language', 'source_id', 'file_name', 'categories', 'diverse_qa', 'qa_training_text']


Unnamed: 0,text,title,id,url,language,source_id,file_name,categories,diverse_qa,qa_training_text
0,Interoperability is a characteristic of a prod...,Interoperability,41285,https://en.wikipedia.org/wiki/Interoperability,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Interoperability, Computing terminology, Tele...",{'qa_pairs': [{'question': 'Is interoperabilit...,Question (yes_no): Is interoperability defined...
1,"In communications and computing, a machine-rea...",Machine-readable medium and data,41341,https://en.wikipedia.org/wiki/Machine-readable...,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Machine-readable medium and data, Computing t...",{'qa_pairs': [{'question': 'Is a machine-reada...,Question (yes_no): Is a machine-readable mediu...
2,Maintainability is the ease of maintaining or ...,Maintainability,41347,https://en.wikipedia.org/wiki/Maintainability,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Maintainability, Telecommunications engineeri...",{'qa_pairs': [{'question': 'Is maintainability...,Question (yes_no): Is maintainability defined ...


In [28]:
# Inspect the generated QA pairs for the first document

for i in range(min(2, len(qa_preview_df))):
    row = qa_preview_df.iloc[i]
    print(f"{'=' * 80}")
    print(f"Document {i}: {row['title']}")
    print(f"\n--- Structured QA output ---")
    qa_data = row["diverse_qa"]
    if isinstance(qa_data, dict) and "qa_pairs" in qa_data:
        for j, pair in enumerate(qa_data["qa_pairs"]):
            print(f"\n  QA Pair {j+1} ({pair['question_type']}):")
            print(f"    Q: {pair['question']}")
            print(f"    A: {pair['answer'][:200]}")
            if pair.get("choices"):
                print(f"    Choices: {pair['choices']}")
    print(f"\n--- Training-ready text ---")
    print(row["qa_training_text"][:600])
    print()

Document 0: Interoperability

--- Structured QA output ---

  QA Pair 1 (yes_no):
    Q: Is interoperability defined only in terms of technical communication between systems?
    A: No, the source states that interoperability also considers social, political, and organizational factors, not just technical communication.

  QA Pair 2 (open_ended):
    Q: What is required for semantic interoperability between systems?
    A: Both sides must refer to a common information exchange reference model and the content of the information exchange requests must be unambiguously defined.

  QA Pair 3 (multiple_choice):
    Q: Which of the following is an example of a common data format contributing to syntactic interoperability?
    A: XML
    Choices: ['XML', 'HTML', 'JSON', 'FTP']

--- Training-ready text ---
Question (yes_no): Is interoperability defined only in terms of technical communication between systems?
Answer: No, the source states that interoperability also considers social, political,

In [29]:
# Diverse QA generates more output tokens per document due to structured format.
# Estimate: ~500 input + ~800 output (QA pairs are verbose)

qa_avg_input = avg_input_tokens
qa_avg_output = int(avg_input_tokens * 0.4)  # QA output is shorter but structured (~800 tokens)
qa_tokens_per_doc = qa_avg_input + qa_avg_output
qa_total_flops = len(seed_df) * qa_tokens_per_doc * moe_flops_per_token

print(f"Diverse QA cost estimate:")
print(f"  Avg input tokens/doc:    ~{qa_avg_input}")
print(f"  Avg output tokens/doc:   ~{qa_avg_output}")
print(f"  Tokens per doc:          ~{qa_tokens_per_doc}")
print(f"  Total FLOPs:             {qa_total_flops:.2e}")
print()
print(f"Comparison to single-strategy rephrasing:")
print(f"  Single-strategy FLOPs:   {single_total_flops:.2e}")
print(f"  Diverse QA FLOPs:        {qa_total_flops:.2e}")
print(f"  Ratio:                   {qa_total_flops / single_total_flops:.2f}x")
print()
print("Diverse QA is cheaper per document because the output is shorter (3 QA pairs")
print("vs. a full article rewrite), but the structured JSON format adds overhead.")
print("The information density per output token is higher: each QA pair isolates")
print("a specific fact, while rephrasing spreads the same facts across more prose.")

Diverse QA cost estimate:
  Avg input tokens/doc:    ~2009
  Avg output tokens/doc:   ~803
  Tokens per doc:          ~2812
  Total FLOPs:             3.24e+14

Comparison to single-strategy rephrasing:
  Single-strategy FLOPs:   4.63e+14
  Diverse QA FLOPs:        3.24e+14
  Ratio:                   0.70x

Diverse QA is cheaper per document because the output is shorter (3 QA pairs
vs. a full article rewrite), but the structured JSON format adds overhead.
The information density per output token is higher: each QA pair isolates
a specific fact, while rephrasing spreads the same facts across more prose.


Explore: Compare the QA training text with the single-strategy rephrased text.
Which format carries more information per token?

---

## Multi-Strategy Pipeline (Open-Ended, 25 pts)

BeyondWeb (Xu et al., 2025) showed that **multi-strategy diversity prevents
saturation** during pre-training. A single rephrasing strategy, no matter how good,
eventually stops providing new signal. The paper reports +3.7 percentage points over
single-strategy rephrasing by combining multiple approaches.

The insight: each strategy surfaces different aspects of the source material. QA pairs
exercise retrieval and reasoning, distilled passages compress information, knowledge
extracts restructure for pedagogy, and organized lists surface atomic facts.

In this section, implement a multi-strategy pipeline combining **at least 3**
rephrasing approaches. Estimate total FLOPs and justify your compute allocation
across strategies. The reference solution implements 4 strategies.

### Your Task

1. Design and implement a multi-strategy pipeline with **at least 3** distinct rephrasing strategies. Each strategy should be a separate column in the `DataDesignerConfigBuilder`. The reference solution uses 4: Diverse QA, Distill, Knowledge Extract, and Knowledge List.
2. Run `preview()` on 2 records and compare all strategies side-by-side for at least one document.
3. Compute the per-strategy and total FLOPs. Present a breakdown showing the compute allocation across your strategies.
4. In 3-5 sentences: Justify your strategy choices. Why these strategies? How do they complement each other? Where is there redundancy?

In [30]:
# Build a multi-strategy pipeline with 4 rephrasing columns.
# Each column applies a different transformation to the same source documents.

multi_strategy = DataDesignerConfigBuilder(model_configs=model_configs)
multi_strategy = multi_strategy.with_seed_dataset(seed_source)

# Strategy 1: Diverse QA (reuse prompt from earlier)
multi_strategy = multi_strategy.add_column(
    LLMStructuredColumnConfig(
        name="diverse_qa",
        prompt=diverse_qa_prompt,
        model_alias="rephraser",
        output_format=DiverseQAOutput,
    )
)

# Strategy 2: Distill ‚Äî compress the document into a concise, information-dense passage
distill_prompt = """\
Distill the following document into a concise, information-dense passage of 2-3 paragraphs. \
Preserve all key facts, definitions, and relationships. Remove redundancy, examples, \
and verbose explanations. The result should read as a self-contained summary that \
captures the essence of the original. Output only the distilled passage.

Title: {{ title }}

Source document:
{{ text }}
"""
multi_strategy = multi_strategy.add_column(
    LLMTextColumnConfig(
        name="distill",
        prompt=distill_prompt,
        model_alias="rephraser",
    )
)

# Strategy 3: Knowledge Extract ‚Äî restructure as a pedagogical explanation
knowledge_extract_prompt = """\
Rewrite the following document as a clear, pedagogical explanation suitable for a \
textbook or educational reference. Organize the content logically with section headings. \
Define technical terms when they first appear. Use a teaching tone that builds concepts \
progressively. Output only the rewritten educational text.

Title: {{ title }}

Source document:
{{ text }}
"""
multi_strategy = multi_strategy.add_column(
    LLMTextColumnConfig(
        name="knowledge_extract",
        prompt=knowledge_extract_prompt,
        model_alias="rephraser",
    )
)

# Strategy 4: Knowledge List ‚Äî extract atomic facts as a structured list
knowledge_list_prompt = """\
Extract all important facts, definitions, and relationships from the following document \
as a numbered list of concise, self-contained statements. Each item should express exactly \
one fact. Be comprehensive ‚Äî aim for at least 10 items. Output only the numbered list.

Title: {{ title }}

Source document:
{{ text }}
"""
multi_strategy = multi_strategy.add_column(
    LLMTextColumnConfig(
        name="knowledge_list",
        prompt=knowledge_list_prompt,
        model_alias="rephraser",
    )
)

print(multi_strategy)

DataDesignerConfigBuilder(
    seed_dataset: local seed
    llm_text_columns: ['distill', 'knowledge_extract', 'knowledge_list']
    llm_structured_columns: ['diverse_qa']
)


In [31]:
# Preview the multi-strategy pipeline.

multi_preview = data_designer.preview(multi_strategy, num_records=2)
multi_preview_df = multi_preview.dataset
print(f"Preview shape: {multi_preview_df.shape}")
print(f"Columns: {list(multi_preview_df.columns)}")
multi_preview_df.head()

[09:43:45] [INFO] üëÅÔ∏è Preview generation in progress
[09:43:46] [INFO] ‚úÖ Validation passed
[09:43:46] [INFO] ‚õìÔ∏è Sorting column configs into a Directed Acyclic Graph
[09:43:46] [INFO] ü©∫ Running health checks for models...
[09:43:46] [INFO]   |-- üëÄ Checking 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16' in provider named 'local-vllm' for model alias 'rephraser'...
[09:43:46] [INFO]   |-- ‚úÖ Passed!
[09:43:46] [INFO] üå± Sampling 2 records from seed dataset
[09:43:46] [INFO]   |-- seed dataset size: 18 records
[09:43:46] [INFO]   |-- sampling strategy: ordered
[09:43:46] [INFO] üóÇÔ∏è llm-structured model config for column 'diverse_qa'
[09:43:46] [INFO]   |-- model: 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16'
[09:43:46] [INFO]   |-- model alias: 'rephraser'
[09:43:46] [INFO]   |-- model provider: 'local-vllm'
[09:43:46] [INFO]   |-- inference parameters:
[09:43:46] [INFO]   |  |-- generation_type=chat-completion
[09:43:46] [INFO]   |  |-- max_parallel_requests=4
[09:43:4

Preview shape: (2, 12)
Columns: ['text', 'title', 'id', 'url', 'language', 'source_id', 'file_name', 'categories', 'diverse_qa', 'distill', 'knowledge_extract', 'knowledge_list']


Unnamed: 0,text,title,id,url,language,source_id,file_name,categories,diverse_qa,distill,knowledge_extract,knowledge_list
0,Interoperability is a characteristic of a prod...,Interoperability,41285,https://en.wikipedia.org/wiki/Interoperability,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Interoperability, Computing terminology, Tele...",{'qa_pairs': [{'question': 'Does interoperabil...,"Interoperability is the capacity of products, ...",**Interoperability: A Textbook‚ÄëStyle Explanati...,1. Interoperability is a characteristic that e...
1,"In communications and computing, a machine-rea...",Machine-readable medium and data,41341,https://en.wikipedia.org/wiki/Machine-readable...,en,enwiki-20260201-enwiki-20260201-pages-articles...,enwiki-20260201-enwiki-20260201-pages-articles...,"[Machine-readable medium and data, Computing t...",{'qa_pairs': [{'question': 'Can machine-readab...,A machine-readable medium is a storage format ...,**Machine-Readable Medium and Data: A Pedagogi...,1. A machine-readable medium is a storage form...


In [32]:
# Inspect all strategies side-by-side for one document.

row = multi_preview_df.iloc[0]
print(f"Document: {row['title']}")
print(f"Original length: {len(row['text'])} chars")

strategy_cols = ["distill", "knowledge_extract", "knowledge_list"]
for col in strategy_cols:
    print(f"\n{'=' * 80}")
    print(f"Strategy: {col}")
    print(f"Length: {len(str(row[col]))} chars")
    print(f"{'-' * 80}")
    print(str(row[col])[:500])
    if len(str(row[col])) > 500:
        print("...")

print(f"\n{'=' * 80}")
print(f"Strategy: diverse_qa")
qa_data = row["diverse_qa"]
if isinstance(qa_data, dict) and "qa_pairs" in qa_data:
    print(f"Number of QA pairs: {len(qa_data['qa_pairs'])}")
    for j, pair in enumerate(qa_data["qa_pairs"]):
        print(f"  QA {j+1} ({pair['question_type']}): {pair['question'][:80]}...")


Document: Interoperability
Original length: 29874 chars

Strategy: distill
Length: 2051 chars
--------------------------------------------------------------------------------
Interoperability is the capacity of products, systems, or organizations to work together effectively, encompassing syntactic, semantic, and organizational dimensions. Syntactic interoperability requires shared data formats and protocols (e.g., XML, SQL), while semantic interoperability demands a common reference model to ensure exchanged information is unambiguously interpreted and used meaningfully by end users. Cross-domain interoperability extends this to social, political, or organizational 
...

Strategy: knowledge_extract
Length: 9495 chars
--------------------------------------------------------------------------------
**Interoperability: A Textbook‚ÄëStyle Explanation**

---

### 1. What Is Interoperability?

**Interoperability** describes the ability of two or more products, systems, or organizations to w

### Compute Allocation

Each strategy requires a full LLM pass over every document. With 4 strategies and a
fixed compute budget, the question becomes: what is the best allocation?

The BeyondWeb paper found that **equal allocation across strategies** works well in
practice. The diversity benefit comes from having *different* perspectives on the
same source material, not from over-investing in any single strategy. If
compute is limited, prioritize the strategies that produce the most distinct output
(QA and Knowledge List are typically more distinct from each other than Distill and
Knowledge Extract).

In [33]:
# Each strategy requires a full LLM pass over every document.
# Estimate output tokens per strategy:

strategies = {
    "diverse_qa":        {"output_ratio": 0.4, "desc": "Structured QA pairs (shorter, denser)"},
    "distill":           {"output_ratio": 0.3, "desc": "Compressed summary (2-3 paragraphs)"},
    "knowledge_extract": {"output_ratio": 1.0, "desc": "Full pedagogical rewrite"},
    "knowledge_list":    {"output_ratio": 0.5, "desc": "Numbered atomic facts"},
}

total_multi_flops = 0
print(f"{'Strategy':<22} {'Out tok/doc':>12} {'Total tokens':>14} {'FLOPs':>12}")
print("-" * 65)

for name, info in strategies.items():
    out_tokens = int(avg_input_tokens * info["output_ratio"])
    strat_tokens = len(seed_df) * (avg_input_tokens + out_tokens)
    strat_flops = strat_tokens * moe_flops_per_token
    total_multi_flops += strat_flops
    print(f"{name:<22} {out_tokens:>12,} {strat_tokens:>14,} {strat_flops:>12.2e}")

print("-" * 65)
total_multi_gpu_hours = total_multi_flops / effective_flops_per_sec / 3600
print(f"{'TOTAL':<22} {'':>12} {'':>14} {total_multi_flops:>12.2e}")
print(f"\nTotal GPU-hours (4xH100, 30% MFU): {total_multi_gpu_hours:.6f}")
print(f"Ratio vs single-strategy:          {total_multi_flops / single_total_flops:.2f}x")
print()
print("Strategy justification:")
print("These 4 strategies are chosen to maximize output diversity (BeyondWeb's key finding).")
print("Diverse QA exercises retrieval and reasoning; Distill compresses and denoises;")
print("Knowledge Extract restructures for pedagogy; Knowledge List surfaces atomic facts.")
print("QA and Knowledge List are the most distinct from each other ‚Äî one asks questions,")
print("the other states facts. Distill and Knowledge Extract overlap somewhat (both produce")
print("prose), but differ in tone (concise vs. pedagogical) and structure.")

Strategy                Out tok/doc   Total tokens        FLOPs
-----------------------------------------------------------------
diverse_qa                      803         50,616     3.24e+14
distill                         602         46,998     3.01e+14
knowledge_extract             2,009         72,324     4.63e+14
knowledge_list                1,004         54,234     3.47e+14
-----------------------------------------------------------------
TOTAL                                                  1.43e+15

Total GPU-hours (4xH100, 30% MFU): 0.000336
Ratio vs single-strategy:          3.10x

Strategy justification:
These 4 strategies are chosen to maximize output diversity (BeyondWeb's key finding).
Diverse QA exercises retrieval and reasoning; Distill compresses and denoises;
Knowledge Extract restructures for pedagogy; Knowledge List surfaces atomic facts.
QA and Knowledge List are the most distinct from each other ‚Äî one asks questions,
the other states facts. Distill and Knowl

### Design Takeaways

The BeyondWeb paper found that 3B rephrasers produce equivalent quality to 8B at
~2.7x less compute. The choice of Nemotron-3 Nano (3.2B active parameters) is
well-calibrated for this task: large enough to produce high-quality rephrasings,
small enough to keep per-document costs low.

Multi-strategy diversity matters more than model size. The cost of 4 strategies is
~4x single-strategy compute, but the quality improvement (+3.7pp over single-strategy
in BeyondWeb) is typically worth it. The key insight: each additional strategy shows
diminishing returns individually, but the *combination* of diverse formats prevents
the saturation that limits single-strategy approaches.

When scaling to millions of documents, the compute allocation decision becomes a
real engineering constraint. The FLOPs framework from Section 4 lets you estimate
costs before committing GPU hours.

Explore: Compare outputs across all 4 strategies. Which strategies produce the most
distinct content? Where is there redundancy?

---

## Evaluation (10 pts)

Two complementary quality signals help us iterate on synthetic data before committing
to a full pre-training run. **[GPT-2](https://huggingface.co/openai-community/gpt2) perplexity** measures statistical fit to a
pre-trained distribution (how "predictable" the text is under a reference model).
**LLM-as-judge** scores semantic quality dimensions: coherence, faithfulness, and
information density.

These measure fundamentally different things. Perplexity captures fluency and
predictability; the judge captures whether the text is well-structured, accurate,
and informationally rich. The tension between them is informative: a text can be
highly coherent but have high perplexity if it uses formatting or structure that
GPT-2 has not seen. Conversely, low-perplexity text may be fluent but vacuous.

GPT-2 perplexity is one convenient measure of dataset similarity to a pre-training
distribution, not the only one. You could substitute a different reference model
(e.g., a domain-specific LM) or use other distributional similarity measures
(n-gram overlap, embedding distance). We use GPT-2 here because it is small, fast,
and provides a baseline comparison to the Curator lab.

Frame this honestly: these are **proxy metrics** for iterating before Part B. The
real test of synthetic data quality is pre-training performance on downstream
benchmarks, which you will measure in Part B.

### Your Task

1. Load GPT-2 and implement a `compute_perplexity()` function that scores a list of texts.
2. Compute perplexity for both original seed data and your single-strategy rephrased output. Report per-document and mean scores.
3. Build an LLM-as-judge pipeline using `LLMJudgeColumnConfig` with at least 3 scoring dimensions (e.g., Coherence, Faithfulness, Information Density) on a 1-5 scale.
4. Run the judge on your rephrased output and display scores alongside perplexity.
5. In 3-5 sentences: How do the two metrics relate? Where do they agree vs. diverge? What does this tell you about proxy metrics for pre-training data quality?

In [34]:
# Load GPT-2 for perplexity evaluation.
# We use the same reference model as the Curator lab for continuity.

import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2").eval()
gpt2_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

device = "cuda" if torch.cuda.is_available() else "cpu"
gpt2_model = gpt2_model.to(device)

def compute_perplexity(texts: list[str], max_length: int = 1024) -> list[float]:
    """Compute GPT-2 perplexity for each text. Truncates to max_length tokens."""
    perplexities = []
    for text in texts:
        encodings = gpt2_tokenizer(text, return_tensors="pt", truncation=True, max_length=max_length)
        input_ids = encodings.input_ids.to(device)
        with torch.no_grad():
            outputs = gpt2_model(input_ids, labels=input_ids)
        perplexities.append(torch.exp(outputs.loss).item())
    return perplexities

print(f"GPT-2 loaded on {device}")
print(f"compute_perplexity() ready")

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT-2 loaded on cuda
compute_perplexity() ready


In [35]:
# Compute perplexity for original seed data and single-strategy rephrased output

original_texts = single_preview_df["text"].tolist()
rephrased_texts = single_preview_df["wiki_rephrase"].tolist()

original_ppl = compute_perplexity(original_texts)
rephrased_ppl = compute_perplexity(rephrased_texts)

print(f"{'Document':<40} {'Original PPL':>14} {'Rephrased PPL':>14}")
print("-" * 70)
for i in range(len(single_preview_df)):
    title = single_preview_df.iloc[i]["title"][:38]
    print(f"{title:<40} {original_ppl[i]:>14.1f} {rephrased_ppl[i]:>14.1f}")
print("-" * 70)
import statistics
print(f"{'Mean':<40} {statistics.mean(original_ppl):>14.1f} {statistics.mean(rephrased_ppl):>14.1f}")

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Document                                   Original PPL  Rephrased PPL
----------------------------------------------------------------------
Interoperability                                   23.3           25.1
Machine-readable medium and data                   37.2           29.2
Maintainability                                    63.2           22.9
----------------------------------------------------------------------
Mean                                               41.2           25.7


### Interpreting Perplexity

Lower perplexity means the text is more predictable under the GPT-2 distribution.
We expect rephrased text to have *different* perplexity, not necessarily lower.
Rephrasing adds structure and formatting that GPT-2 may not have seen in its
training data. Moderate perplexity is actually desirable: too low means the text
is trivially predictable or redundant, too high means it is incoherent or uses
highly unfamiliar patterns.

In [36]:
# Use NeMo Data Designer's LLM judge to score rephrased output.
# We build a small pipeline: seed with the preview results, then judge.

from data_designer.config import LLMJudgeColumnConfig
from data_designer.config.column_configs import Score

judge_prompt = """\
You are evaluating a rephrased version of a source document. \
Score the rephrased text on each dimension below.

Original title: {{ title }}

Original document (first 500 chars):
{{ text[:500] }}

Rephrased document:
{{ wiki_rephrase }}
"""

scores = [
    Score(
        name="coherence",
        description="How well-organized, logical, and readable is the rephrased text?",
        options={
            1: "Incoherent or disorganized",
            2: "Poorly structured with frequent issues",
            3: "Adequate structure, some rough transitions",
            4: "Well-organized with clear flow",
            5: "Excellent structure, highly readable",
        },
    ),
    Score(
        name="faithfulness",
        description="Does the rephrased text preserve the factual content of the original without hallucinating new information?",
        options={
            1: "Mostly fabricated or contradicts the original",
            2: "Significant factual errors or omissions",
            3: "Generally accurate with minor errors",
            4: "Accurate with very minor omissions",
            5: "Fully faithful to the original",
        },
    ),
    Score(
        name="information_density",
        description="How efficiently does the text convey information? High density means more useful content per token.",
        options={
            1: "Extremely verbose or vacuous",
            2: "Mostly filler with some useful content",
            3: "Average information density",
            4: "Dense and informative with little waste",
            5: "Maximally dense, every sentence adds value",
        },
    ),
]

# Seed the judge pipeline with single_preview_df (which has text, title, wiki_rephrase)
judge_seed_path = "wiki_curated_domain_data/judge_seed.jsonl"
single_preview_df.to_json(judge_seed_path, orient="records", lines=True)
judge_seed = LocalFileSeedSource(path=judge_seed_path)

judge_pipeline = DataDesignerConfigBuilder(model_configs=model_configs)
judge_pipeline = judge_pipeline.with_seed_dataset(judge_seed)
judge_pipeline = judge_pipeline.add_column(
    LLMJudgeColumnConfig(
        name="quality_judge",
        prompt=judge_prompt,
        model_alias="rephraser",
        scores=scores,
    )
)

judge_results = data_designer.preview(judge_pipeline, num_records=len(single_preview_df))
judge_df = judge_results.dataset
print(f"Judge results shape: {judge_df.shape}")
print(f"Columns: {list(judge_df.columns)}")

[09:45:35] [INFO] üî≠ Preview generation in progress
[09:45:35] [INFO] ‚úÖ Validation passed
[09:45:35] [INFO] ‚õìÔ∏è Sorting column configs into a Directed Acyclic Graph
[09:45:35] [INFO] ü©∫ Running health checks for models...
[09:45:35] [INFO]   |-- üëÄ Checking 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16' in provider named 'local-vllm' for model alias 'rephraser'...
[09:45:35] [INFO]   |-- ‚úÖ Passed!
[09:45:35] [INFO] üå± Sampling 3 records from seed dataset
[09:45:35] [INFO]   |-- seed dataset size: 3 records
[09:45:35] [INFO]   |-- sampling strategy: ordered
[09:45:35] [INFO] ‚öñÔ∏è llm-judge model config for column 'quality_judge'
[09:45:35] [INFO]   |-- model: 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16'
[09:45:35] [INFO]   |-- model alias: 'rephraser'
[09:45:35] [INFO]   |-- model provider: 'local-vllm'
[09:45:35] [INFO]   |-- inference parameters:
[09:45:35] [INFO]   |  |-- generation_type=chat-completion
[09:45:35] [INFO]   |  |-- max_parallel_requests=4
[09:45:35] [INF

Judge results shape: (3, 10)
Columns: ['text', 'title', 'id', 'url', 'language', 'source_id', 'file_name', 'categories', 'wiki_rephrase', 'quality_judge']


In [39]:
# Display the judge scores alongside perplexity

score_names = ["coherence", "faithfulness", "information_density"]

def extract_score(judge_val, score_name):
    """Extract numeric score from the quality_judge structured output."""
    if isinstance(judge_val, dict) and score_name in judge_val:
        entry = judge_val[score_name]
        if isinstance(entry, dict) and "score" in entry:
            return entry["score"]
    return None

header = f"{'Document':<30} {'Orig PPL':>9} {'Reph PPL':>9}"
for s in score_names:
    header += f" {s[:11]:>11}"
print(header)
print("-" * len(header))

all_scores = {s: [] for s in score_names}
for i in range(len(judge_df)):
    title = judge_df.iloc[i]["title"][:28]
    row_str = f"{title:<30} {original_ppl[i]:>9.1f} {rephrased_ppl[i]:>9.1f}"
    judge_val = judge_df.iloc[i]["quality_judge"]
    for s in score_names:
        val = extract_score(judge_val, s)
        all_scores[s].append(val)
        row_str += f" {val:>11}" if val is not None else f" {'N/A':>11}"
    print(row_str)

print("-" * len(header))
mean_row = f"{'Mean':<30} {statistics.mean(original_ppl):>9.1f} {statistics.mean(rephrased_ppl):>9.1f}"
for s in score_names:
    vals = [v for v in all_scores[s] if v is not None]
    mean_row += f" {statistics.mean(vals):>11.1f}" if vals else f" {'N/A':>11}"
print(mean_row)

print()
print("Perplexity and judge scores measure fundamentally different things. Perplexity")
print("captures how predictable text is under GPT-2's distribution ‚Äî rephrased text")
print("often has higher perplexity due to added markdown formatting and structural")
print("elements GPT-2 rarely saw. Meanwhile, the judge scores semantic quality:")
print("coherence and faithfulness can be high even when perplexity is elevated.")
print("This divergence shows why a single proxy metric is insufficient ‚Äî perplexity")
print("penalizes novel structure while the judge rewards it. For pre-training data")
print("quality, the real test remains downstream benchmark performance in Part B.")

Document                        Orig PPL  Reph PPL   coherence faithfulnes information
--------------------------------------------------------------------------------------
Interoperability                    23.3      25.1           5           5           5
Machine-readable medium and         37.2      29.2           4           5           4
Maintainability                     63.2      22.9           5           5           5
--------------------------------------------------------------------------------------
Mean                                41.2      25.7         4.7         5.0         4.7

Perplexity and judge scores measure fundamentally different things. Perplexity
captures how predictable text is under GPT-2's distribution ‚Äî rephrased text
often has higher perplexity due to added markdown formatting and structural
elements GPT-2 rarely saw. Meanwhile, the judge scores semantic quality:
coherence and faithfulness can be high even when perplexity is elevated.
This diver

### Perplexity vs. Judge: What Do They Tell Us?

How do the LLM-as-judge scores relate to GPT-2 perplexity? Do high-coherence texts
have lower perplexity? Where do the metrics agree vs. diverge?

Perplexity measures statistical fit to GPT-2's training distribution and captures
fluency and predictability. The LLM judge captures semantic quality: coherence,
faithfulness, information density. These are fundamentally different measurements:
a text can be highly coherent but have high perplexity if it uses unfamiliar
formatting or structure. The tension between these metrics is informative for
understanding what makes good pre-training data.

Explore: If you have multi-strategy outputs, compute perplexity and judge scores
for each strategy. How do the strategies compare? Which produces the most
"surprising" (high perplexity) yet high-quality (high judge score) text?

---

## Scale Up & Conclusion (10 pts)

Time to scale up. Use `create()` instead of `preview()` to generate a full dataset
sized for Part B's pre-training run. Students will pre-train Nemotron-Nano-V3 (the
hybrid Mamba + MoE + Attention model from Section 4) using this synthetic data.

### Your Task

1. Run `data_designer.create()` with your multi-strategy pipeline to generate the full dataset. Set `num_records` to match your seed dataset size.
2. Inspect the output: report shape, column names, mean character lengths per strategy column.
3. Export the dataset to parquet for use in Part B.
4. Compute the actual FLOPs consumed (estimate from output token counts).
5. In 3-5 sentences: What would change at 10x, 100x, 1000x scale? Which bottlenecks shift?

In [None]:
# Scale up: run full generation with the multi-strategy pipeline.
# Adjust num_records based on your seed dataset size and compute budget.

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Inspect the generated dataset

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Final cost estimate for the scaled generation.
# Use actual token counts from the create() run if available,
# otherwise use our per-strategy estimates from Section 7.

# YOUR CODE HERE
raise NotImplementedError()

### Scaling Reflection

What would change at 10x, 100x, 1000x scale?

At **10x**, the pipeline is the same but you might want to increase
`max_parallel_requests` in inference parameters and use tensor parallelism across
more GPUs.

At **100x**, you need multi-node serving and careful attention to data loading,
since the seed dataset itself becomes large.

At **1000x** (millions of documents), the cost framework from Section 4 becomes
essential: you must choose strategies carefully, as each strategy costs a full pass.
The BeyondWeb finding that 3B rephrasers match 8B quality means you can process
~2.7x more data for the same compute budget.

### Conclusion

In this notebook, you built a complete synthetic pre-training data pipeline using
NeMo Data Designer. Starting from curated Wikipedia articles (Part 1a), you
learned to:

- Boot an inference server and measure throughput
- Estimate generation costs using a FLOPs framework
- Implement single-strategy and multi-strategy rephrasing pipelines
- Generate structured outputs (diverse QA pairs) with Pydantic schemas
- Evaluate output quality using perplexity and LLM-as-judge metrics
- Scale up for production with `create()`

In Part B, you will use this synthetic data to pre-train Nemotron-Nano-V3 and
measure its impact on downstream benchmarks. The quality of your synthetic data
directly affects training outcomes. The iterative design loop
(preview -> inspect -> revise -> create) is how you close that feedback loop.

In [None]:
# Clean up
print("Done! Your synthetic dataset is ready for Part B.")
print(f"Output file: wiki_synthetic_multistrategy.parquet")

## References

- Parmar, J., Patel, J., Adler, B., et al. "[Nemotron-CC: Transforming Web Data into
  High-Quality Synthetic Data for Language Models.](https://arxiv.org/abs/2412.02595)" *arXiv:2412.02595*, 2024.

- Xu, C., Liu, Z., Shen, Y., et al. "[BeyondWeb: Multi-Strategy Rephrasing for
  Synthetic Pre-Training Data that Goes Beyond the Web.](https://arxiv.org/abs/2506.10426)" *arXiv:2506.10426*, 2025.