# Synthetic Data Generation

This guide provides a quickstart for creating a synthetic QA and Retrieval-Augmented Generation (RAG) dataset using your own documents.

## Setup

Make sure you have the necessary packages installed. If not, install them using the following command:

In [None]:
!pip install langchain-community==0.3.2
!pip install langchain-core==0.3.10
!pip install langchain-experimental==0.3.2
!pip install "unstructured[local-inference]"
!pip install python-poppler==0.4.1

To use [SambaNova Cloud](https://cloud.sambanova.ai) models, you'll need to set your API key. Run the following code to securely input your [SambaNova Cloud API Key](https://cloud.sambanova.ai/apis)

In [1]:
import getpass
import os
if not os.getenv("SAMBANOVA_API_KEY"):
    os.environ["SAMBANOVA_API_KEY"] = getpass.getpass(
        "Enter your SambaNova Cloud API key: "
    )

##  Load data

First, we will load the data from your source files using the [Unstructured](https://docs.unstructured.io/open-source/introduction/quick-start) library.

In [None]:
from unstructured.partition.pdf import partition_pdf

def extract_pdf(file_path):
    """extract text, and tables from files"""
    raw_pdf_elements = partition_pdf(
        filename=file_path,
        strategy='hi_res',
        hi_res_model_name='yolox',
        infer_table_structure=True,
        chunking_strategy='by_title',
        max_characters=4096,
        combine_text_under_n_chars=500,
    )

    return raw_pdf_elements

Next, we will store the text of each unstructured document into a list.

In [5]:
from langchain_core.documents import Document

text_documents = []
table_documents = []
for document in extract_pdf("./data/SambaNova_Dataflow.pdf"):
    if document.category == 'Table':
        #transform table documents into langchain documents
        table_documents.append(Document(page_content=document.metadata.text_as_html))
    else:
        text_documents.append(document.text) 
        
text_documents

['Break through the limits of your GPU\n\nWHITEPAPER\n\nAccelerated Computing with a Reconfigurable Dataflow Architecture\n\nTrends Driving New Processing Architectures\n\nWith the rapid expansion of applications that can be characterized by dataflow processing, such as natural-language processing and recommendation engines, the performance and efficiency challenges of traditional, instruction set architectures have become apparent. To address this and enable the next generation of scientific and machine-learning applications, SambaNova Systems has developed the Reconfigurable Dataflow ArchitectureTM, a unique vertically integrated platform that is optimized from algorithm to silicon. Three key long-term trends infused SambaNova’s effort to develop this new accelerated computing architecture.\n\nFirst, the sizable, generation-to-generation performance gains for multi- core processors have tapered off. As a result, developers can no longer depend on traditional performance improvements 

## Split Documents

The Unstructured library partitions each document based on the title strategy, creating separate sections for each new title in the document. To enhance this, we can perform an additional partition step by examining the similarity between each paragraph. We'll use the [Semantic Chunker](https://python.langchain.com/docs/how_to/semantic-chunker/) from LangChain Experimental. This tool embeds each paragraph and groups those with similar vectors (high semantic similarity) while separating those with dissimilar vectors (low semantic similarity).

This approach ensures that each document chunk used to generate synthetic questions and answers contains all the necessary context for producing high-quality Q&A pairs.

In [8]:
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

# set an embedding model to embed the documents and calculate the similarity between them.
embedding_model = HuggingFaceInstructEmbeddings(
                model_name='intfloat/e5-large-v2',
                query_instruction='Represent this sentence for searching relevant passages: ',
                encode_kwargs={'normalize_embeddings': True},
            )

# split documents into chunks based on semantic similarity and length
def split_documents(documents,  breakpoint_threshold_amount = 95, min_doc_length = None):
    if isinstance(documents, str):
        documents = [documents]
    text_splitter = SemanticChunker(
        embeddings=embedding_model,
        breakpoint_threshold_type='percentile',
        breakpoint_threshold_amount=breakpoint_threshold_amount,
        sentence_split_regex=r'(?<=[.?!])\s+',
    )
    new_docs = text_splitter.create_documents(documents)
    if min_doc_length is not None:
        # remove short documents assuming there is not enough information to generate qa pairs from them
        new_docs = [doc for doc in new_docs if len(doc.page_content) >= min_doc_length]
    return new_docs

# join text documents with table documents if exist 
documents = split_documents(text_documents, 95, 200) + table_documents

documents

[Document(metadata={}, page_content='Break through the limits of your GPU\n\nWHITEPAPER\n\nAccelerated Computing with a Reconfigurable Dataflow Architecture\n\nTrends Driving New Processing Architectures\n\nWith the rapid expansion of applications that can be characterized by dataflow processing, such as natural-language processing and recommendation engines, the performance and efficiency challenges of traditional, instruction set architectures have become apparent. To address this and enable the next generation of scientific and machine-learning applications, SambaNova Systems has developed the Reconfigurable Dataflow ArchitectureTM, a unique vertically integrated platform that is optimized from algorithm to silicon. Three key long-term trends infused SambaNova’s effort to develop this new accelerated computing architecture. First, the sizable, generation-to-generation performance gains for multi- core processors have tapered off.'),
 Document(metadata={}, page_content='As a result, 

## Generate QA pairs

With our granular documents ready, we can use a Large Language Model (LLM) to create QA pairs. Consider the following:

- Depending on the dataset's purpose, you may want the model to include references used to generate the answer.
- You might want the model to include reasoning steps from context to answer. A good strategy for this is [Chain of Thought (CoT)](https://www.promptingguide.ai/techniques/cot).
- The model should generate a structured output from which we can extract the question, the thought process, the answer, and the references

First, we'll initialize our LLM and define the schema for the QA data.

In [9]:
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_models import ChatSambaNovaCloud 
from pydantic import BaseModel, Field
import json

class SyntheticDatum(BaseModel):
    """Model of a synthetic generated datum"""
    question: str = Field(description='generated question')
    answer: str = Field(description='generated answer')
    references: list[str] = Field(description='references for generated answer')
    thought: str = Field(description='thought for answer generation')


class SyntheticData(BaseModel):
    """Model of a synthetic data generation"""
    data: list[SyntheticDatum] = Field(description='synthetic data pairs')    
    
# Set the LLM to generate the QA pairs
llm = ChatSambaNovaCloud(
    model="Meta-Llama-3.1-405B-Instruct",
    temperature=0.7,
    max_tokens=1024,
    top_p=0.9,
    top_k=50
)

We will define a prompt instructing the model to generate QA pairs using the provided document and the specified number of QA pairs. The prompt will ask the model to generate a list of JSON objects containing the question, thought process, answer, and references.

In [10]:
prompt = ChatPromptTemplate([
        ("system", "You are a JSON generator who generates machine-readable JSON"),
        ("human", """
            Based on the following document, follow the instruction below
            Document:
            {document}
            Instruction:
            Generate {amount} of unique question, thought, answer, and references from the above document in the following JSON format. 
            The answers must avoid words that are not specific (e.g., "many", "several", "few", etc.). 
            The answers must contain specific, verbose, self-contained, grammatically correct sentences that answer the question comprehensively. 
            The answers must strictly contain content from the document and no content from outside the document. 
            There may be multiple references that contain verbatim text from the document to support the answers.             
            JSON format:
            [
                {{
                    "question": "<generated question>",            
                    "thought": "<generated thought on what is needed to answer the question. Start with 'To answer the question, I need'>",
                    "answer": "<generated answer>",
                    "references": [
                        "<verbatim text from document that supports the answer>",
                        "<verbatim text from document that supports the answer>"
                    ]
                }}
            ]
            The first character of the response must be '[' and the last character must be ']'. No header text should be included.
            """
        )
    ]
)


With the prompt defined, we can create a method to instantiate a LangChain chain, pass the input arguments (the context document and the number of QA pairs to generate), and process the model's response using the defined QA data schemas.

In [11]:
def generate_qa_pairs(context, amount, include_context = True, include_thoughts = True, include_references = True):
    synthetic_datum_parser = JsonOutputParser(pydantic_object=SyntheticData)
    qa_generate_chain = prompt | llm | synthetic_datum_parser
    qa_pairs = []
    generation = qa_generate_chain.invoke({'document': context, 'amount': amount})
    for datum in generation:
        qa_pair = {
            'question': datum['question'],
            'context': context if include_context else None,
            'answer': datum['answer'],
            'thought': datum['thought'] if include_thoughts else None,
            'references': datum['references'] if include_references else None,
        }
        qa_pair = {k: v for k, v in qa_pair.items() if v is not None}
        qa_pairs.append(qa_pair)
    return qa_pairs

Here is an example where we create a series of synthetic data pairs, including the original context (useful for training models for Retrieval-Augmented Generation (RAG) applications).

In [12]:
sample_doc="""Elephants are the largest living land animals. 
Three living species are currently recognised:
the African bush elephant (Loxodonta africana),
the African forest elephant (L. cyclotis), and the Asian elephant (Elephas maximus). 
They are the only surviving members of the family Elephantidae and the order Proboscidea;
extinct relatives include mammoths and mastodons."""

generate_qa_pairs(sample_doc, 5)

[{'question': 'What are the largest living land animals?',
  'context': 'Elephants are the largest living land animals. \nThree living species are currently recognised:\nthe African bush elephant (Loxodonta africana),\nthe African forest elephant (L. cyclotis), and the Asian elephant (Elephas maximus). \nThey are the only surviving members of the family Elephantidae and the order Proboscidea;\nextinct relatives include mammoths and mastodons.',
  'answer': 'Elephants are the largest living land animals.',
  'thought': 'To answer the question, I need to identify the specific animal mentioned in the document as the largest living land animal.',
  'references': ['Elephants are the largest living land animals.']},
 {'question': 'How many living species of elephants are currently recognised?',
  'context': 'Elephants are the largest living land animals. \nThree living species are currently recognised:\nthe African bush elephant (Loxodonta africana),\nthe African forest elephant (L. cyclotis

## Generate full dataset

We will create a simple method to convert each QA pair dictionary into a single string with the format required for the fine-tuning process. Then, we will iterate over each chunk of our source data.

In [13]:
def qa_pairs_to_prompt_completion(qa_pairs):
    """convert the QA pairs into a format suitable for fine-tuning."""
    if isinstance(qa_pairs, dict):
        qa_pairs = [qa_pairs]
    lines = []
    for pair in qa_pairs:
        line = {'prompt': f'{"You are a helpful assistant for question-answering tasks."}{pair["question"]}', 'completion': ''}
        if pair.get('context'):
            line['prompt'] += f'\nContext: {pair["context"]}\n'
        if pair.get('thought'):
            line['completion'] += f'Thought: {pair["thought"]}\n'
        line['completion'] += f'Answer: {pair["answer"]}\n'
        if pair.get('references'):
            line['completion'] += f'References: {pair["references"]}\n'

        lines.append(json.dumps(line))
    return lines

In [15]:
lines = []
for document in documents:
    try: 
        qa_pairs = generate_qa_pairs(
            context=document.page_content,
            amount=5,
            include_context=True,
            include_thoughts=True,
            include_references=True,
        )
        lines.extend(qa_pairs_to_prompt_completion(qa_pairs))
    except Exception as e:
        print(f"Error generating Q&A pairs for document: {document.page_content}")
        print(e)
lines

['{"prompt": "You are a helpful assistant for question-answering tasks.What are the performance and efficiency challenges associated with traditional instruction set architectures?\\nContext: Break through the limits of your GPU\\n\\nWHITEPAPER\\n\\nAccelerated Computing with a Reconfigurable Dataflow Architecture\\n\\nTrends Driving New Processing Architectures\\n\\nWith the rapid expansion of applications that can be characterized by dataflow processing, such as natural-language processing and recommendation engines, the performance and efficiency challenges of traditional, instruction set architectures have become apparent. To address this and enable the next generation of scientific and machine-learning applications, SambaNova Systems has developed the Reconfigurable Dataflow ArchitectureTM, a unique vertically integrated platform that is optimized from algorithm to silicon. Three key long-term trends infused SambaNova\\u2019s effort to develop this new accelerated computing archit