# Nemotron RL dataset analysis
As part of the Llama-Nemotron release, Nvidia released a post training dataset:
 - https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset-v1/viewer


This notebook dives into this dataset to see what the RL subset have to offer.

In [10]:
from datasets import load_dataset, get_dataset_split_names

# normal dataset loading results in a TypeError:
# DatasetGenerationError: An error occurred while generating the dataset
# We will try finding the splits so we can check the first value of that split
get_dataset_split_names("nvidia/Llama-Nemotron-Post-Training-Dataset-v1", "RL")

['instruction_following']

In [9]:
import json

# let's try streaming mode to avoid the TypeError
test = load_dataset("nvidia/Llama-Nemotron-Post-Training-Dataset-v1", "RL", streaming=True)
first_item = next(iter(test["instruction_following"]))
print(json.dumps(first_item, indent=2))

{
  "input": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\ndetailed thinking off<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nExplain the plot of Cinderella in a sentence where each word has to begin with the next letter in the alphabet from A to Z, without repeating any letters \n\nWhat is the most effective way to manage stress in a fast-paced work environment? Your response should contain at least 3 sentences and include a postscript starting with \"P.S.\".\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
  "args": {
    "instruction_id_list": [
      "length_constraints:number_sentences",
      "detectable_content:postscript"
    ],
    "instruction_kwargs": [
      {
        "num_sentences": 3,
        "relation": "at least",
        "postscript_marker": null,
        "num_bullets": null,
        "keywords": null,
        "num_words": null,
        "forbidden_words": null,
        "num_highlights": null,
        "section_spliter": null,
    

In [11]:
from datasets import Dataset
from tqdm import tqdm

# We identified structure and the args part will be problamatic with schema validation
# Let us use streaming mode and avoid it
streaming_dataset = load_dataset(
    "nvidia/Llama-Nemotron-Post-Training-Dataset-v1",
    "RL",
    split="instruction_following",
    streaming=True
)
buffered_data = [item for item in tqdm(streaming_dataset, desc="Buffering RL dataset")]
dataset = Dataset.from_list(buffered_data)

Buffering RL dataset: 56339it [00:25, 2220.98it/s]


## Sample data

Let's us sample some data to see what values are encapsulated in the rows:

In [23]:
sample = dataset.select(range(5))
sample.to_pandas()


Unnamed: 0,input,args,category,license,reasoning,used_in_training
0,<|begin_of_text|><|start_header_id|>system<|en...,{'instruction_id_list': ['length_constraints:n...,instruction_following,odc-by,off,yes
1,<|begin_of_text|><|start_header_id|>system<|en...,{'instruction_id_list': ['detectable_format:ti...,instruction_following,odc-by,off,yes
2,<|begin_of_text|><|start_header_id|>system<|en...,{'instruction_id_list': ['detectable_format:nu...,instruction_following,odc-by,off,yes
3,<|begin_of_text|><|start_header_id|>system<|en...,{'instruction_id_list': ['detectable_format:ti...,instruction_following,odc-by,off,yes
4,<|begin_of_text|><|start_header_id|>system<|en...,{'instruction_id_list': ['detectable_format:nu...,instruction_following,odc-by,off,yes


## Count data

Let us explore some of the data columns and what data they contain

In [18]:
from collections import Counter

def print_counts(label):
    print(f"{label}:")
    for name, count in Counter(dataset[label]).items():
        print(f"{name}: {count}")

print_counts("license")
print_counts("category")
print_counts("used_in_training")
print_counts("reasoning")


license:
odc-by: 56339
category:
instruction_following: 56339
used_in_training:
yes: 56339
reasoning:
off: 56339


In [33]:
import random

def random_item():
    return dataset[random.randint(0, len(dataset) - 1)  ]

item = random_item()
print(item['input'])
print(item['args'])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

detailed thinking off<|eot_id|><|start_header_id|>user<|end_header_id|>

Your response should contain at least 3 sentences. Include keywords [seaside, vacation, relaxation]. Finish your response with this exact phrase [Is there anything else I can help with?].

What are some tips for planning a relaxing seaside vacation?


Is there any scientific evidence that drinking more water than thirst dictates is good for one’s health? <|eot_id|><|start_header_id|>assistant<|end_header_id|>


{'instruction_id_list': ['length_constraints:number_sentences', 'keywords:existence', 'startend:end_checker'], 'instruction_kwargs': [{'capital_frequency': None, 'capital_relation': None, 'end_phrase': None, 'first_word': None, 'forbidden_words': None, 'frequency': None, 'keyword': None, 'keywords': None, 'let_frequency': None, 'let_relation': None, 'letter': None, 'nth_paragraph': None, 'num_bullets': None, 'num_highlights': None, 'num_paragraphs

## What is going on with this data

This is synthetic data for compositional generalization training — i.e., trying to teach the model to handle multiple constraints even if they seem disjointed.

Given a prompt like:

```text
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

detailed thinking off<|eot_id|><|start_header_id|>user<|end_header_id|>

Whats the easist former hardest level in geometry dash.

Your response should contain less than 150 words. At the end of your response, please explicitly add a postscript starting with P.S. Include keywords [love, peace, happiness] in the response. 

How can we spread more love, peace, and happiness in the world?
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

It should reply with:

```text
The easiest former hardest level in Geometry Dash might be Clubstep or Deadlocked.

As for spreading more love, peace, and happiness, we can start by practicing kindness in daily life, helping others in need, and promoting empathy in our communities.

P.S. Remember to always choose love, peace, and happiness. 🌸
```

It tries to follow all constraints even if the topics don't fully align.


If you're building reward models or training with RLHF:

 - Don’t judge the prompt by coherence alone. Instead, focus on whether the constraints were satisfied — that's what the model is being trained or evaluated on.

 - This is great for testing robustness: Can the model still follow structure under incoherent or noisy inputs?

## Removing the Llama chat template

We want to reuse this dataset for models outside of the meta Llama model family. 
Let's clean up the dataset.

In [30]:
import re

def extract_prompt(text):
    pattern = r'<\|start_header_id\|>user<\|end_header_id\|>([\s\S]*?)<\|eot_id\|>'
    match = re.search(pattern, text, re.DOTALL)    
    if match:
        return match.group(1).strip()
    else:
        return None
    

dataset = dataset.map(
    lambda example: {"prompt": extract_prompt(example["input"])},
    remove_columns=[]
)    

Map:   0%|          | 0/56339 [00:00<?, ? examples/s]

In [42]:
item = random_item()
print(item['prompt'])
print('------------------------------------')
print(item['input'])

What are the key benefits of using a CRM system for managing customer interactions? Your response should contain at least 4 bullet points. Use the markdown bullet points such as: * This is point 1.


“The “victory” of Linux or Windows isn’t clear-cut”. you nigger even dares stating that, while your sponsor is micosoft, gpt4.
I apologize if there was any confusion, but as an AI developed by OpenAI, I provide information based on data and facts without bias. OpenAI is an independent organization, and my responses aren’t influenced by any sponsors. While OpenAI has received investment from Microsoft, my primary function is to assist users by providing accurate and impartial information, no matter the topic.

The comparison between Windows and Linux is a nuanced topic within the tech community. Both operating systems serve different needs and have their own areas where they excel. Recognizing the strengths and preferred use cases of each doesn’t imply any bias but is meant to convey an und

## Cleanup

Cleanup data before pushing.

In [44]:
dataset = dataset.remove_columns(["category", "input", "used_in_training", "reasoning", "license"])

ValueError: Column name ['input', 'category', 'used_in_training', 'reasoning'] not in the dataset. Current columns in the dataset: ['args', 'license', 'prompt']

## Publish

In [47]:
dataset.push_to_hub("tobrun/nemotron-rl-grpo-compatible")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/57 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/tobrun/nemotron-rl-grpo-compatible/commit/7f8d9e62e002e55d5cc53dde526331ea9653c46e', commit_message='Upload dataset', commit_description='', oid='7f8d9e62e002e55d5cc53dde526331ea9653c46e', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/tobrun/nemotron-rl-grpo-compatible', endpoint='https://huggingface.co', repo_type='dataset', repo_id='tobrun/nemotron-rl-grpo-compatible'), pr_revision=None, pr_num=None)