In this notebook, we will prepare the model and dataset for the subsequent steps.

Let's define the paths to the model and the dataset.

In [None]:
# NOTE: Change to the model you want to use
HF_MODEL_NAME_OR_PATH = "qwen/Qwen3-8B"

ROOT_DIR = "/workspace"
NEMO_OUTPUT_PATH = f"{ROOT_DIR}/Qwen3-8B-nemo"
DATA_PATH = f"{ROOT_DIR}/wikitext-data"

### Step 1: Convert the Hugging Face model to NeMo checkoint format

You can skip this step if you already have the model in NeMo 2.0 checkpoint format.

In [None]:
# NOTE: Change to the model you want to use. For Llama, you need to use llm.LlamaModel(llm.Llama31Config8B() instead of llm.Qwen3Model(llm.Qwen3Config8B())
!python -c 'from nemo.collections import llm; llm.import_ckpt(llm.Qwen3Model(llm.Qwen3Config8B()), source="hf://{HF_MODEL_NAME_OR_PATH}", output_path="{NEMO_OUTPUT_PATH}")'

This is an example of what the nemo checkpoint should look like:

```
Qwen3-8B-nemo/
├── context/
│   ├── artifacts/
│   │   └── generation_config.json
│   ├── nemo_tokenizer/
│   │   ├── added_tokens.json
│   │   ├── chat_template.jinja
│   │   ├── merges.txt
│   │   ├── special_tokens_map.json
│   │   ├── tokenizer.json
│   │   ├── tokenizer_config.json
│   │   └── vocab.json
│   ├── io.json
│   └── model.yaml
└── weights/
    ├── .metadata
    ├── __0_0.distcp
    ├── __0_1.distcp
    ├── common.pt
    └── metadata.json
```


`NOTE:` If you wish to convert the NeMo models back to Hugging Face format after pruning and distillation, you can use the following command:

```bash
python -c 'from nemo.collections import llm; llm.export_ckpt(path="<NEMO_MODEL_PATH>", target="hf", output_path="<HF_OUTPUT_PATH>")'
```

### Step 2: Prepare the dataset

**Obtain the dataset**: Generate the `wikitext-train.jsonl` split from the [WikiText-103-v1](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) dataset.

> `NOTE:` While this notebook uses the `wikitext` dataset as it is the most easy to get started with, in practice, we recommend using bigger, more recent and much higher quality datasets like [ClimbMix](https://huggingface.co/datasets/OptimalScale/ClimbMix) or [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1). These datasets are often split into multiple partitions so you would need to tokenize each partition separately.

In [None]:
import json
import os

from datasets import load_dataset

# Load the WikiText-103 dataset
dataset = load_dataset("wikitext", "wikitext-103-v1", split="train")

# Define the destination folder
os.makedirs(DATA_PATH, exist_ok=True)

# Save splits to JSONL files and calculate their sizes
with open(f"{DATA_PATH}/wikitext-train.jsonl", "w") as file:
    for item in dataset:
        file.write(json.dumps(item) + "\n")

print(f"Raw dataset saved to {DATA_PATH}/wikitext-train.jsonl")

**Tokenize the dataset**: Tokenize the dataset using the model's tokenizer to convert the data into a memory map format.

In [None]:
from modelopt.torch.utils.plugins import megatron_preprocess_data

megatron_preprocess_data(
    input_path=f"{DATA_PATH}/wikitext-train.jsonl",
    output_dir=DATA_PATH,
    tokenizer_name_or_path=HF_MODEL_NAME_OR_PATH,
    json_keys=["text"],
    workers=32,
    log_interval=100000
)

After running the above scripts, you will see the tokenized `<ROOT_DIR>/wikitext-data/wikitext-train_text_document.{bin/idx}` files. As we can see from the log, the tokenized dataset has only 125M tokens. In practice, for distillation of a pruned model, we recommend using atleast 50B tokens. 