# NVIDIA Nemotron Post-Training Datasets

This notebook downloads and explores NVIDIA's Nemotron post-training datasets, which are used for fine-tuning large language models.

## üìö Available Datasets

1. **Nemotron-Post-Training-Dataset-v1**: Original version with chat, code, math, stem, and tool_calling splits
2. **Nemotron-Post-Training-Dataset-v2**: Updated version with improvements
3. **Llama-Nemotron-Post-Training-Dataset**: Comprehensive dataset with SFT and RL subsets

---


## üîß Setup

First, install required packages and set up the environment.


In [1]:
# Install required packages
%pip install datasets huggingface_hub ipywidgets -q

Note: you may need to restart the kernel to use updated packages.


In [2]:
from datasets import load_dataset
import os
from IPython.display import display, Markdown

# Create datasets folder if it doesn't exist
os.makedirs("datasets", exist_ok=True)

---

## üì• Dataset Downloads

### 1Ô∏è‚É£ Nemotron Post-Training Dataset v1

Download the original Nemotron dataset with multiple domain-specific splits.


In [3]:
# https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1
print("üîΩ Downloading Nemotron-Post-Training-Dataset-v1...\n")

dataset_v1 = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v1",
    cache_dir="./datasets/nemotron-v1"
)

print(f"\n‚úÖ Download of v1 completed!")
print(f"   Dataset splits: {list(dataset_v1.keys())}")
print(f"   Total samples: {sum(len(dataset_v1[split]) for split in dataset_v1.keys()):,}")
print(f"   Location: {os.path.abspath('datasets/nemotron-v1')}\n")

üîΩ Downloading Nemotron-Post-Training-Dataset-v1...



Resolving data files:   0%|          | 0/183 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/159 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/660 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/183 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/159 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/660 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/175 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/152 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/649 [00:00<?, ?it/s]


‚úÖ Download of v1 completed!
   Dataset splits: ['chat', 'code', 'math', 'stem', 'tool_calling']
   Total samples: 25,659,642
   Location: /localhome/local-tranminhq/llm-analysis/datasets/nemotron-v1



### 2Ô∏è‚É£ Nemotron Post-Training Dataset v2

Download the updated version with improvements and refinements.


In [4]:
# https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2
print("üîΩ Downloading Nemotron-Post-Training-Dataset-v2...\n")

dataset_v2 = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v2",
    cache_dir="./datasets/nemotron-v2"
)

print(f"\n‚úÖ Download of v2 completed!")
print(f"   Dataset splits: {list(dataset_v2.keys())}")
print(f"   Total samples: {sum(len(dataset_v2[split]) for split in dataset_v2.keys()):,}")
print(f"   Location: {os.path.abspath('datasets/nemotron-v2')}\n")

üîΩ Downloading Nemotron-Post-Training-Dataset-v2...



Resolving data files:   0%|          | 0/37 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/38 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/38 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/33 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/37 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/37 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/38 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/38 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/33 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/37 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/36 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/37 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/37 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/32 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/36 [00:00<?, ?it/s]


‚úÖ Download of v2 completed!
   Dataset splits: ['stem', 'chat', 'math', 'code', 'multilingual_ja', 'multilingual_de', 'multilingual_it', 'multilingual_es', 'multilingual_fr']
   Total samples: 6,341,414
   Location: /localhome/local-tranminhq/llm-analysis/datasets/nemotron-v2



### 3Ô∏è‚É£ Llama-Nemotron Post-Training Dataset

Download the comprehensive Llama-Nemotron dataset including both **SFT** (Supervised Fine-Tuning) and **RL** (Reinforcement Learning) subsets.

**Note:** This is a large dataset and may take significant time and disk space.


In [5]:
# https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset

# Download SFT subset
print("üîΩ Starting download of Llama-Nemotron-Post-Training-Dataset (SFT subset)...")
print("   Splits: math, code, science, chat, safety\n")

dataset_sft = load_dataset(
    "nvidia/Llama-Nemotron-Post-Training-Dataset",
    "SFT",
    cache_dir="./datasets/llama-nemotron"
)

print(f"\n‚úÖ Download of SFT subset completed!")
print(f"   Dataset splits: {list(dataset_sft.keys())}")
print(f"   Total samples: {sum(len(dataset_sft[split]) for split in dataset_sft.keys()):,}\n")

# Download RL subset
print("üîΩ Starting download of Llama-Nemotron-Post-Training-Dataset (RL subset)...\n")

dataset_rl = load_dataset(
    "nvidia/Llama-Nemotron-Post-Training-Dataset",
    "RL",
    cache_dir="./datasets/llama-nemotron"
)

print(f"\n‚úÖ Download of RL subset completed!")
print(f"   Dataset splits: {list(dataset_rl.keys())}")
print(f"   Total samples: {sum(len(dataset_rl[split]) for split in dataset_rl.keys()):,}\n")

display(Markdown("### ‚úÖ Llama-Nemotron dataset downloaded successfully!"))
display(Markdown(f"**Location:** `{os.path.abspath('datasets/llama-nemotron')}`"))


üîΩ Starting download of Llama-Nemotron-Post-Training-Dataset (SFT subset)...
   Splits: math, code, science, chat, safety



Loading dataset shards:   0%|          | 0/91 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/141 [00:00<?, ?it/s]


‚úÖ Download of SFT subset completed!
   Dataset splits: ['code', 'math', 'science', 'chat', 'safety']
   Total samples: 32,955,418

üîΩ Starting download of Llama-Nemotron-Post-Training-Dataset (RL subset)...



Generating instruction_following split: 0 examples [00:00, ? examples/s]

DatasetGenerationError: An error occurred while generating the dataset

---

## üìä Dataset Exploration

Explore the structure and content of the downloaded datasets.


In [None]:
# Display summary of all datasets
print("=" * 80)
print("üìã DATASET SUMMARY")
print("=" * 80)

datasets_info = [
    ("Nemotron v1", dataset_v1),
    ("Nemotron v2", dataset_v2),
    ("Llama-Nemotron SFT", dataset_sft),
    ("Llama-Nemotron RL", dataset_rl)
]

for name, dataset in datasets_info:
    if dataset:
        print(f"\n{name}:")
        print(f"  Splits: {', '.join(dataset.keys())}")
        print(f"  Total samples: {sum(len(dataset[split]) for split in dataset.keys()):,}")
        for split in dataset.keys():
            print(f"    - {split}: {len(dataset[split]):,} samples")

print("\n" + "=" * 80)


In [None]:
# Examine a sample from the SFT dataset
print("üîç Sample from Llama-Nemotron SFT (Math split):\n")
print("=" * 80)

if dataset_sft and 'math' in dataset_sft:
    sample = dataset_sft['math'][0]
    for key, value in sample.items():
        print(f"\n{key}:")
        print("-" * 40)
        if isinstance(value, str) and len(value) > 500:
            print(value[:500] + "...")
        else:
            print(value)

print("\n" + "=" * 80)


In [None]:
# Additional exploration and analysis can be done here
