# Synthetic Reasoning Trace Generator  

This notebook is used to automatically create synthetic data points to add RAG tokens to the reasoning of Large Reasoning Models (LRMs). 

## Imports & Environment Setup 

We import all required libraries in this section to ensure the rest of the notebook runs smoothly and reproducibly.

In [2]:
# Imports 
import os 
import re
import json 
import requests 
from tqdm import tqdm 
from openai import OpenAI
from datetime import datetime
from dotenv import load_dotenv
from datasets import load_dataset
from openai.types import ChatCompletion

# Environment Setup 
load_dotenv()

OPENAI_SK = os.getenv("OPENAI_SK")
DEEPSEEK_SK = os.getenv("DEEPSEEK_SK")

## Utility Function Definitions

This section defines the utility functions used throughout the notebook to keep the main workflow clean and modular.

In [2]:
def makeChatRequestOpenAI(prompt: str, system_prompt: str, model: str = "gpt-4o-mini-2024-07-18", temperature: float = 0.7) -> str:
    """
    Sends a chat completion request to the OpenAI API using the specified model and prompts.

    Args:
        prompt (str): The user message or task prompt.
        system_prompt (str): The system message that sets the behavior or context for the model.
        model (str, optional): The OpenAI model to use. Defaults to "gpt-4o-mini-2024-07-18".
        temperature (float, optional): Sampling temperature for response creativity. Defaults to 0.7.

    Returns:
        str: The generated response from the model.
    """

    # Initialise the OpenAI client with the OpenAI API key 
    client = OpenAI(api_key=OPENAI_SK)

    # Send a chat completion request to the specified model with system and user messages
    response = client.chat.completions.create(
        model=model, 
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=temperature
    )

    return response.choices[0].message.content


def makeChatRequestDeepSeek(prompt: str, system_prompt: str, model: str = "deepseek-reasoner", temperature: float = 0.7) -> ChatCompletion:
    """
    Sends a chat completion request to the DeepSeek-R1 model via the OpenAI-compatible API.

    Args:
        prompt (str): The user's message or question to be sent to the model.
        system_prompt (str): A message that sets the assistant's behavior or role.
        model (str, optional): The identifier of the model to use. Defaults to "deepseek-reasoner".
        temperature (float, optional): Sampling temperature that controls the randomness of output. 
                                       Lower values make output more deterministic. Defaults to 0.7.

    Returns:
        ChatCompletion: The raw response object returned by the DeepSeek API, containing model output,
                        reasoning content, token usage, and other metadata.
    """
    
    # Initialise the OpenAI client with the DeepSeek API key and base URL
    client = OpenAI(api_key = DEEPSEEK_SK, base_url = "https://api.deepseek.com")

    # Send a chat completion request to the specified model with system and user messages
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
        temperature=temperature, 
        stream=False
    )

    return response 
    

def generateCoT(question: str) -> tuple[str, str]: 
    """
    Generates a Chain-of-Thought (CoT) reasoning trace for a given question using the DeepSeek-R1 model.

    Args:
        question (str): The input question to send to the model.

    Returns:
        tuple[str, str]: A tuple containing the model's main answer (`content`) and its reasoning trace (`reasoning_content`).
    """

    # Set a general system-level prompt so that the LRM answers the question as it normally would 
    system_prompt = "You are a helpful AI assistant."
     
    # Make a chat request to the DeepSeek-R1 model
    response = makeChatRequestDeepSeek(question, system_prompt)

    # Extract just the reasoning trace of this model and return it 
    return response.choices[0].message.content, response.choices[0].message.reasoning_content 


def updateCoTRAG(question: str, cot: str) -> str: 
    """
    Revises a given Chain of Thought (CoT) to include explicit calls to external retrieval 
    whenever external information is likely needed to justify reasoning steps.

    Args:
        question (str): The input question being reasoned through.
        cot (str): The original Chain of Thought for the given question.

    Returns:
        str: The revised Chain of Thought, including embedded retrieval prompts and simulated results.
    """

    system_prompt = "You are an AI assistant specializing in synthetic data generation. Given original data, you can modify it to create a synthetic version by applying perturbations exactly as specified in the provided instructions."
    prompt = f""" 
    Given a question and its corresponding Chain of Thought, revise the reasoning to include explicit calls to external retrieval **whenever needed**. The revised Chain of Thought should remain fluent, natural, and faithful to the original logic and answer, while clearly reflecting moments where external information is required.

    The model should **not assume it has seen the content before**. Even if a term or concept appears straightforward, the model should verify or clarify it through retrieval if it is not broadly known, general-purpose knowledge. Internal knowledge should only be used when the information is clearly common-sense, canonical, or widely taught.

    Retrieval should be triggered when the reasoning depends on:

    - Obscure or specialized terms, concepts, or entities
    - Background knowledge or definitions not guaranteed to be in the model’s pretraining
    - Facts that are time-sensitive, dynamic, or uncertain
    - Multi-hop questions that require chaining together facts

    When retrieval is needed, the model should **briefly explain why**, then issue a search using the following tokenized format:

    <search_query> your search query here </search_query>

    <search_result> simulated search result here </search_result>

    Multiple retrievals may be included in a single Chain of Thought, including multiple lookups for the same topic (e.g., clarification followed by fact checking). The model should incorporate each result naturally into the reasoning process.

    The final Chain of Thought should remain coherent and lead to the original answer. Output only the revised Chain of Thought and nothing else.

    Here are a few such examples:
    1. 
    Question: What was the annual inflation rate in the U.S. for January 2025, and how does it compare to the rate in December 2024?
    Chain of Thought: Okay, so I need to figure out what the annual inflation rate was in the U.S. for January 2025 and compare it to December 2024. Inflation rates are usually reported monthly by the Bureau of Labor Statistics (BLS) in the U.S., and they typically release these figures after the month has ended. \n\nSince today's date is March 7, 2025, the data for January 2025 should already be available on the BLS website. Looking at the reported figures, the annual inflation rate in December 2024 was 2.9%, while the annual inflation rate in January 2025 was 3.0%. This means that inflation saw a slight increase of 0.1% from December to January. \n\nIn summary, the U.S. inflation rate in January 2025 was 3.0%, up from 2.9% in December 2024. This indicates a slight uptick, possibly due to various economic factors affecting prices during that period.
    
    Output: "To determine the annual inflation rate in the U.S. for January 2025 and compare it to December 2024, we need the most recent data published by the Bureau of Labor Statistics (BLS). Since inflation rates are updated monthly and it's currently March 2025, the relevant figures should be available. <search_query> U.S. annual inflation rate January 2025 site:bls.gov </search_query> <search_result> According to the Bureau of Labor Statistics, the annual inflation rate for the U.S. in January 2025 was 3.0%. In comparison, the rate for December 2024 was 2.9%. </search_result> The annual inflation rate in the U.S. for January 2025 was 3.0%, which is a slight increase from the 2.9% recorded in December 2024. This indicates a modest upward trend in inflation between the two months."
    
    2. 
    Question: A 45-year-old man presents with symptoms including a wide-based gait, a blank facial expression, hallucinations, memory issues, a resting tremor that resolves with movement, and bradykinesia. Based on these clinical findings, what is most likely to be observed in the histological specimen of his brain?
    Chain of Thought: Okay, so here's a 45-year-old guy who's experiencing some pretty concerning symptoms. Let's break this down.\n\nHe's got a wide-based gait. Right off the bat, that makes me think about balance and coordination issues; something affecting his motor control could be at play here.\n\nThen there's the blank facial expression. That's often referred to as facial masking, commonly seen in Parkinson's disease due to a lack of facial muscle movement.\n\nHallucinations and memory issues, though, those are interesting. Hallucinations make me think about some sort of cognitive or psychiatric involvement. Memory issues definitely point towards a possibility of cognitive decline.\n\nNow, a resting tremor that stops with movement is a classic sign of Parkinson's. It's one of those distinguishing features. Paired with bradykinesia (slowness of movement), these two symptoms are screaming Parkinson’s disease to me.\n\nThe wide-based gait might initially make one consider cerebellar issues, but in the context of Parkinsonism, it can also be due to balance problems stemming from basal ganglia dysfunction or even Parkinson's-related gait disturbances.\n\nWhen I think about all these symptoms coming together, it certainly makes Parkinson's disease a strong contender, especially considering his age and the symptom profile.\n\nBut here's the thing—I can't ignore the cognitive and psychiatric symptoms. These could suggest something like Lewy body dementia (LBD), which is related but incorporates more psychiatric features alongside Parkinsonian symptoms.\n\nCellular Level Insight:\nParkinson's disease and dementia with Lewy bodies share a common pathological hallmark: the presence of Lewy bodies. These are clumps of proteins, mainly alpha-synuclein, that accumulate in brain cells and cause both movement and cognitive problems.\n\nFinal Thought:\nSo, taking all of this into account, when they look at the brain tissue under a microscope, I would expect them to find Lewy bodies. Those would explain the whole picture—movement issues, cognitive decline, and hallucinations. It ties everything together nicely.

    Output: Okay, so here's a 45-year-old man with a mix of motor, cognitive, and psychiatric symptoms. Let's break it down. He has a wide-based gait, which often suggests balance and coordination problems. That could point to cerebellar dysfunction, but it can also be seen in Parkinsonian syndromes due to postural instability. Next, the blank facial expression—that’s facial masking, which is classic in Parkinson’s disease, caused by decreased facial muscle movement. Then we have hallucinations and memory issues. These aren’t as common in early Parkinson’s disease but are more characteristic of cognitive syndromes like Lewy body dementia. The presence of hallucinations, in particular, suggests we might be dealing with more than just a straightforward Parkinson’s diagnosis. I need to check whether early-onset hallucinations are more typical of Lewy body dementia than Parkinson’s. <search_query> early hallucinations and cognitive decline in Lewy body dementia vs Parkinson's disease </search_query> <search_result> In dementia with Lewy bodies (DLB), visual hallucinations and cognitive impairment tend to occur early and are core diagnostic features. In Parkinson's disease dementia, these symptoms typically develop later in the disease course. </search_result> Okay, so based on that, the early presence of hallucinations and memory problems in this patient fits better with dementia with Lewy bodies (DLB) than Parkinson’s disease. That helps clarify the diagnostic direction. Now, the resting tremor that resolves with movement and bradykinesia are hallmark features of Parkinsonian syndromes. But since DLB includes Parkinsonian motor symptoms, this still fits. So now, what’s going on at the cellular level? To answer that, I need to confirm what pathological finding is shared between Parkinson’s disease and Lewy body dementia, since the diagnosis could fall into either category. <search_query> histological hallmark of Lewy body dementia and Parkinson's disease </search_query> <search_result> Both Parkinson’s disease and dementia with Lewy bodies are characterized by the presence of Lewy bodies—intracellular aggregates composed mainly of alpha-synuclein. These are found in neurons throughout the brain, particularly in the substantia nigra and cortex. </search_result> Perfect. So both conditions share the presence of Lewy bodies, which are made up of alpha-synuclein aggregates in neurons. This aligns well with the clinical presentation of combined cognitive and motor symptoms. Final Thought: Given the patient's combination of early hallucinations, memory issues, and Parkinsonian motor signs, the most likely diagnosis is dementia with Lewy bodies. Therefore, on histological examination of the brain, we would expect to see Lewy bodies—intracellular inclusions composed primarily of alpha-synuclein.

    Using these examples as references and keeping the detailed instructions in mind, refine and update the following Chain of Thought: 
    Question: {question}
    Chain of Thought: {cot}
    """

    return makeChatRequestOpenAI(prompt, system_prompt)


def hasUnclosedSearchResult(text: str) -> bool:
    """
    Checks whether the input text contains unclosed <search_result> tags.

    This function scans the text for properly matched <search_result> and 
    </search_result> tags. It returns True if there are unmatched opening or 
    closing tags, and False otherwise.

    Args:
        text (str): The input text to check for unclosed search result tags.

    Returns:
        bool: True if unclosed or mismatched tags are found, False if all tags are properly closed.
    """

    # Compile a regex pattern to match either opening or closing search result tags
    pattern = re.compile(r'</?end?_?search_result>')

    # Initialise a stack to keep track of opening tags
    stack = []

    # Iterate over each tag match in the text
    for match in pattern.finditer(text):
        tag = match.group()

        # If we find an opening tag, push it onto the stack
        if tag == '<search_result>':
            stack.append(tag)

        # If we find a closing tag
        elif tag == '</search_result>':
            # If there’s no matching opening tag on the stack, it's an error
            if not stack:
                return True  # closing tag with no corresponding opening
            
            # Otherwise, pop the corresponding opening tag
            stack.pop()
            
    # After processing, if the stack isn’t empty, there are unmatched opening tags
    return len(stack) > 0  


## Data Preparation

In this section, we download and preprocess the datasets that will serve as the foundation for synthetic sample generation.

In [12]:
# Initialise variable to denote which dataset you're using for synthesis 
dataset_no = 0
dataset_lower_bound = 0
dataset_upper_bound = 500
dataset_names = ["FreedomIntelligence/medical-o1-reasoning-SFT", "cais/mmlu"] 

if dataset_no == 0:
    # Download the dataset from HuggingFace
    dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en")

    # Get samples within the specified range 
    dataset_train = dataset["train"].select(range(dataset_lower_bound, dataset_upper_bound))

elif dataset_no == 1: 
    # Download the dataset from HuggingFace 
    dataset = load_dataset("cais/mmlu", "all")

    # Define the list of subjects we want to keep 
    allowed_subjects = ["college_physics", "college_computer_science", "college_medicine", "high_school_government_and_politics", "high_school_macroeconomics", "high_school_microeconomics", "elementary_mathematics", "high_school_european_history", "high_school_computer_science", "high_school_chemistry", "high_school_biology", "high_school_geography", "high_school_microeconomics", "high_school_us_history", "anatomy", "astronomy", "international_law", "machine_learning", "nutrition", "prehistory", "sociology", "us_foreign_policy", "computer_security"]

    # Filter the dataset to only include rows with subjects in the allowed list
    dataset_filtered = dataset["test"].filter(lambda example: example["subject"] in allowed_subjects)

    # Get samples within the specified range
    dataset_train = dataset_filtered.select(range(dataset_lower_bound, dataset_upper_bound))

## Synthetic Data Generation Loop 

### Non-RAG Samples

The script presented here generates reasoning traces that answer questions without the use of any RAG tokens to access external information. We use the `deepseek-reasoner` model, which corresponds to `DeepSeek-R1:671B`, and directly extract its reasoning trace as the final trace—without applying any further modifications.


In [None]:
# Initialize a list to store synthetic samples
synthetic_dataset = []

# Initialise a string to keep track of the indices at which errors occur during synthesis
error_str = ""

print("Beginning synthesis process...")
for i in tqdm(range(len(dataset_train)), desc="Synthesizing samples"):
    # Extract the current sample from the dataset 
    sample = dataset_train[i]

     # Generate a Chain of Thought and response for the current question
    question = sample["question"] + "Options: " + str(sample["choices"])

    # Try to generate Chain of Thought with unlimited retries 
    retry_count = 0
    success = False  

    while not success: 
        try: 
            response, cot = generateCoT(question)
            success = True 
        except Exception as e: 
            retry_count += 1
            print(f"Retry #{retry_count} for sample {i + dataset_lower_bound}. Error: {str(e)}")
         
 
    # Create a dictionary to create a synthetic sample
    sample_dict = {
        "Question": question,
        "CoT": cot,
        "Response": response
    }

    # Store the synthetic sample in a list  
    synthetic_dataset.append(sample_dict)
    
print("Synthesis process complete!")

# Write to JSON file
print("Writing synthetic samples to JSON...")
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
sanitised_dataset_name = dataset_names[dataset_no].replace('/', '_')
with open(f"synthetic_dataset_{sanitised_dataset_name}_{dataset_lower_bound}:{dataset_upper_bound}_{error_str}_{timestamp}.json", "w", encoding="utf-8") as f:
    json.dump(synthetic_dataset, f, ensure_ascii=False, indent=4)
print("Writing complete!")


### RAG Samples

The following script generates synthetic samples that demonstrate the use of RAG tokens to retrieve external information. Since the dataset we have selected for pertubation already contains reasoning traces, we only need to update them with our engineered prompt. This modification is handled by the `gpt-4o-mini-2024-07-18` model, which we found to strike a good balance between instruction-following capability and cost-efficiency.

In [13]:
# Initialise a list to store synthetic samples
synthetic_dataset = []

print("Beginning synthesis process...")
for i in tqdm(range(len(dataset_train)), desc="Synthesizing samples"):
    # Extract the current sample from the dataset 
    sample = dataset_train[i]

    # Extract the question, Chain of Thought, and response from the current sample 
    question = sample["Question"]
    cot = sample["Complex_CoT"]
    response = sample["Response"]

    # Try to update Chain of Thought with unlimited retries 
    retry_count = 0
    success = False 

    while not success: 
        try: 
            # Update the Chain of Thought of the original sample 
            updated_cot = updateCoTRAG(question, cot)
            success = True 
        except Exception as e: 
            retry_count += 1
            print(f"Retry #{retry_count} for sample {i + dataset_lower_bound}. Error: {str(e)}")

    # Create a dictionary to create a synthetic sample
    sample_dict = {
        "Question": question,
        "CoT": updated_cot,
        "Response": response
    }

    # Store the synthetic sample in a list  
    synthetic_dataset.append(sample_dict)
print("Synthesis process complete!")

# Write to JSON file
print("Writing synthetic samples to JSON...")
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
sanitised_dataset_name = dataset_names[dataset_no].replace('/', '_')
with open(f"synthetic_dataset_{sanitised_dataset_name}_{dataset_lower_bound}:{dataset_upper_bound}_{timestamp}.json", "w", encoding="utf-8") as f:
    json.dump(synthetic_dataset, f, ensure_ascii=False, indent=4)
print("Writing complete!")


Beginning synthesis process...


Synthesizing samples: 100%|██████████| 500/500 [1:17:29<00:00,  9.30s/it]

Synthesis process complete!
Writing synthetic samples to JSON...
Writing complete!





# Dataset Cleaning 

The script below detects and addresses improperly closed tags in the dataset. During fine-tuning, we found that even a few malformed examples could significantly affect results. To avoid this, we use the script to manually identify and fix incomplete tags.



In [None]:
# Load the dataset
with open('dataset.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# Scan dataset
bad_entries = []
for i, entry in enumerate(data):
    if 'CoT' in entry and hasUnclosedSearchResult(entry['CoT']):
        bad_entries.append((i, entry['Question']))

# Report
if bad_entries:
    print(f"Found {len(bad_entries)} entries with mismatched <search_result> / </search_result> tags:")
    for idx, question in bad_entries:
        print(f" - Entry #{idx}: {question}")
else:
    print("All entries have properly matched <search_result> and </search_result> tags.")