# Nemotron SFT dataset analysis
As part of the Llama-Nemotron release, Nvidia released a post training dataset:
 - https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset-v1/viewer


This notebook dives into this dataset to see what the SFT subset have to offer.

In [1]:
from datasets import load_dataset

dataset = load_dataset("nvidia/Llama-Nemotron-Post-Training-Dataset-v1")
dataset

Loading dataset shards:   0%|          | 0/60 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/67 [00:00<?, ?it/s]

DatasetDict({
    code: Dataset({
        features: ['input', 'output', 'category', 'license', 'reasoning', 'generator', 'used_in_training'],
        num_rows: 9612677
    })
    math: Dataset({
        features: ['input', 'output', 'category', 'license', 'reasoning', 'generator', 'used_in_training'],
        num_rows: 19840970
    })
    science: Dataset({
        features: ['input', 'output', 'category', 'license', 'reasoning', 'generator', 'used_in_training'],
        num_rows: 708920
    })
    chat: Dataset({
        features: ['input', 'output', 'category', 'license', 'reasoning', 'generator', 'used_in_training'],
        num_rows: 39792
    })
    safety: Dataset({
        features: ['input', 'output', 'category', 'license', 'reasoning', 'generator', 'used_in_training'],
        num_rows: 31426
    })
})

## Merge data

Nemotron dataset is divided into different categories, let's merge all into a single dataset and retain a column origin to act as key from which dataset it orignated. Due to big size of input data, let's batch and GC while executing. 

In [3]:
from datasets import concatenate_datasets
import gc 

tagged_splits = []

for key, ds in dataset.items():
    tagged_ds = ds.map(
        lambda examples: {"origin": [key] * len(examples["input"])},
        batched=True
    )    
    tagged_splits.append(tagged_ds)
    gc.collect()

merged_dataset = concatenate_datasets(tagged_splits)

Map:   0%|          | 0/9612677 [00:00<?, ? examples/s]

Map:   0%|          | 0/19840970 [00:00<?, ? examples/s]

Map:   0%|          | 0/708920 [00:00<?, ? examples/s]

Map:   0%|          | 0/39792 [00:00<?, ? examples/s]

Map:   0%|          | 0/31426 [00:00<?, ? examples/s]

In [4]:
merged_dataset

Dataset({
    features: ['input', 'output', 'category', 'license', 'reasoning', 'generator', 'used_in_training', 'origin'],
    num_rows: 30233785
})

## Sample data

Let's us sample some data to see what values are encapsulated in the rows:

In [8]:
sample = merged_dataset.select(range(5))
sample.to_pandas()


Unnamed: 0,input,output,category,license,reasoning,generator,used_in_training,origin
0,<|begin_of_text|><|start_header_id|>system<|en...,"<think>\nOkay, I need to write a Python functi...",code,cc-by-4.0,on,DeepSeek-R1,yes,code
1,<|begin_of_text|><|start_header_id|>system<|en...,"<think>\nOkay, let's see. The problem is to fi...",code,cc-by-4.0,on,DeepSeek-R1,yes,code
2,<|begin_of_text|><|start_header_id|>system<|en...,"<think>\nOkay, I need to write a Python functi...",code,cc-by-4.0,on,DeepSeek-R1,yes,code
3,<|begin_of_text|><|start_header_id|>system<|en...,"<think>\nOkay, let's see. The problem is to re...",code,cc-by-4.0,on,DeepSeek-R1,yes,code
4,<|begin_of_text|><|start_header_id|>system<|en...,"<think>\nOkay, I need to sort an array in asce...",code,cc-by-4.0,on,DeepSeek-R1,yes,code


## License

License is an interresting column to explore. Let's explore what unique values there are and what the count for them is.

In [9]:
from collections import Counter

license_counts = Counter(merged_dataset["license"])

for license_name, count in license_counts.items():
    print(f"{license_name}: {count}")

cc-by-4.0: 30051023
cc-by-sa: 172514
odc-by: 10248


## Category

In [12]:
from collections import Counter

category_counts = Counter(merged_dataset["category"])

for category_name, count in category_counts.items():
    print(f"{category_name}: {count}")

code: 9612677
math: 19840970
science: 708920
chat: 39792
safety: 31426


## Reasoning

The nemotron models were trained to turn on and off reasoning, so there are plenty of data entries that are without the typical `<think>` tokens. Let's filter out all non-reasoning.

In [10]:
filtered = merged_dataset.filter(lambda x: "on" in x["reasoning"])
filtered

Filter:   0%|          | 0/30233785 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'output', 'category', 'license', 'reasoning', 'generator', 'used_in_training', 'origin'],
    num_rows: 1228707
})