# Create a Synthetic Test Set with Ragas

This code demonstrates how to generate synthetic question/ground truth pairs using Ragas. Synthetic data generation is useful for testing and evaluating retrieval models.

We leverage the LangChain document loader and integration with OpenAI to generate realistic question/answer sets based on provided documents. 

This example will use a simple text document. This document was generated using ChatGPT.

Benefits of Synthetic Data Generation:
- Improve model training
- Test retrieval systems in varied contexts
- Generate diverse question-answer pairs efficiently

Install required dependencies

In [14]:
!pip install -qU langchain-core langchain-community langchain-openai
!pip install -qU ragas

Input the OpenAI API key

In [15]:
import os
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OpenAI API Key:")

Pull the .txt document and load it using the LangChain TextLoader

Note: other document formats are supported

In [16]:
from langchain_community.document_loaders import TextLoader

file_path = "./ReasonsForSDG.txt"

loader = TextLoader(file_path)
documents = loader.load()

Set up:
- generator model - for creating the questions
- critic model - for analyzing and accepting questions or suggesting changes
- use the standard OpenAI embeddings
- create the generator
- define the distribution of questions (50% simple, 40% multi-context, 10% reasoning)

In [17]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings


generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
critic_llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

Generate the questions using Ragas and 

Note: set the max_retries and max_wait to ensure the questions all successfully completed

In [19]:
from ragas.run_config import RunConfig

testset = generator.generate_with_langchain_docs(documents, 10, distributions, run_config=RunConfig(max_retries=50, max_wait=100)) 
# testset.to_pandas()

Filename and doc_id are the same for all nodes.               
Generating: 100%|██████████| 10/10 [00:13<00:00,  1.33s/it]


Display the data as 
- question
- context
- ground truth
- evolution type

In [13]:
for data_row in testset.test_data:
    question = data_row.question
    contexts = data_row.contexts
    ground_truth = data_row.ground_truth
    evolution_type = data_row.evolution_type
    metadata = data_row.metadata
    
    # Process each element as needed
    print(f"Question: {question}")
    print(f"Contexts: {contexts}")
    print(f"Ground Truth: {ground_truth}")
    print(f"Evolution Type: {evolution_type}")
    print(f"Metadata: {metadata}")
    print("\n")  # For better readability

Question: What are the primary reasons for utilizing synthetic data generation in various industries and applications?
Contexts: ['Reasons for Synthetic Data Generation\nSynthetic data generation is the process of artificially creating data rather than collecting it from real-world events. This technique has gained traction across various industries and applications due to several key benefits. Below are the primary reasons for generating synthetic data:\n\n1. Privacy and Security Concerns\nData Anonymization: In industries like healthcare and finance, handling sensitive personal information is subject to strict privacy regulations (e.g., GDPR, HIPAA). Synthetic data eliminates concerns around exposing personal information by generating datasets that mirror the statistical properties of real data without revealing sensitive details.\nRisk Mitigation: It allows companies to share and use data across departments, teams, or partners without the risk of leaking confidential information.\n2