### Generating a Synthetic Test Set for RAG-Based Question Answering with Ragas  

In this tutorial, we will explore the **test set generation module in Ragas** to create a **synthetic test set** for a **Retrieval-Augmented Generation (RAG)-based question-answering bot**. Our hypothetical use case is a **Ragas Airline assistant**, designed to answer customer queries on a range of topics, including:  

- Flight booking  
- Flight changes and cancellations  
- Baggage policies  
- Viewing reservations  
- Flight delays  
- In-flight services  
- Special assistance  

To ensure our synthetic dataset is as **realistic and diverse** as possible, we will create **various customer personas**, each representing different traveler and behaviors. This approach helps build a **comprehensive and representative test set**, allowing us to evaluate the robustness of our RAG model effectively.  

Let's dive in!

### Understanding Data Structure for Effective Test Set Generation in Ragas  

Before generating a synthetic test set in Ragas, it's essential to have a clear understanding of your data's structure. This knowledge enables you to determine the necessary transformations for your knowledge graph, allowing better control over the test set generation process.

Ragas operates by converting your data into a knowledge graph, enriching it through various transformations, and then generating diverse and representative questions based on this structured information. By leveraging these transformations, you can ensure that your test set aligns with the real-world scenarios your Retrieval-Augmented Generation (RAG) system will encounter.

For a deeper dive into the mechanics of synthetic test set generation in Ragas, refer to the [documentation](https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/).

### Understanding Data Structure for Effective Test Set Generation in Ragas  

Before generating a **synthetic test set** in Ragas, it is crucial to have a **clear understanding of your data’s structure**. In this example, the data is highly **structured in a Markdown file**, where **procedures and steps are numbered** and **arranged into well-defined sections and subsections**.  

Understanding this structure helps in determining the **necessary transformations** for **test set generation**. Since Ragas works by **converting data into a knowledge graph, enriching it through transformations, and generating diverse and representative questions**, having structured data makes it easier to **control the test set's direction and relevance**.  

For a deeper dive into the mechanics of synthetic test set generation in Ragas, refer to the [documentation](https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/)  

In [1]:
from langchain_community.document_loaders import DirectoryLoader

path = "dummy_data"
loader = DirectoryLoader(path, glob="**/*.md")
docs = loader.load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


In [2]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

In [3]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
    
kg

KnowledgeGraph(nodes: 8, relationships: 0)

In [4]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embedding = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [5]:
from ragas.testset.transforms import apply_transforms
from ragas.testset.transforms import HeadlinesExtractor, HeadlineSplitter, EmbeddingExtractor, CosineSimilarityBuilder

headline_extractor = HeadlinesExtractor(llm=generator_llm, max_num=20)
embedding_extractor = EmbeddingExtractor()
headline_splitter = HeadlineSplitter(min_tokens=300, max_tokens=1000)

transforms = [
    headline_extractor,
    embedding_extractor,
    headline_splitter,
]

apply_transforms(kg, transforms=transforms)

Applying HeadlinesExtractor:   0%|          | 0/8 [00:00<?, ?it/s]

Applying EmbeddingExtractor:   0%|          | 0/8 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/8 [00:00<?, ?it/s]

In [6]:
kg

KnowledgeGraph(nodes: 25, relationships: 26)

In [7]:
kg.nodes

[Node(id: 447c81, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: be5162, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: 3272d9, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: 002062, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: 494ba4, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: 179054, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: c233a2, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: b79240, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: d4b0c0, type: NodeType

In [8]:
for i in kg.nodes:
    if i.type != NodeType.DOCUMENT:
        property_name, embedding = await embedding_extractor.extract(i)
        i.properties[property_name] = embedding

In [9]:
kg.nodes

[Node(id: 447c81, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: be5162, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: 3272d9, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: 002062, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: 494ba4, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: 179054, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: c233a2, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: b79240, type: NodeType.DOCUMENT, properties: ['page_content', 'document_metadata', 'headlines', 'embedding']),
 Node(id: d4b0c0, type: NodeType

In [10]:
from ragas.testset.transforms import CosineSimilarityBuilder

cosine_similarity_builder = CosineSimilarityBuilder()

transforms = [
    cosine_similarity_builder,
]

apply_transforms(kg, transforms=transforms)

Applying CosineSimilarityBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

In [11]:
kg

KnowledgeGraph(nodes: 25, relationships: 92)

In [12]:
kg.save("transformed_kg.json")

In [13]:
from ragas.testset.synthesizers import QueryDistribution, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer, SingleHopSpecificQuerySynthesizer

query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm),0.3),
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm),0.7)
]

In [14]:
from ragas.testset.persona import Persona

persona_first_time_flier = Persona(
    name="First Time Flier",
    role_description="Is flying for the first time and may feel anxious. Needs clear guidance on flight procedures, safety protocols, and what to expect throughout the journey.",
)

persona_frequent_flier = Persona(
    name="Frequent Flier",
    role_description="Travels regularly and values efficiency and comfort. Interested in loyalty programs, express services, and a seamless travel experience.",
)

persona_angry_business_flier = Persona(
    name="Angry Business Class Flier",
    role_description="Demands top-tier service and is easily irritated by any delays or issues. Expects immediate resolutions and is quick to express frustration if standards are not met.",
)

personas = [persona_first_time_flier, persona_frequent_flier, persona_angry_business_flier]
personas

[Persona(name='First Time Flier', role_description='Is flying for the first time and may feel anxious. Needs clear guidance on flight procedures, safety protocols, and what to expect throughout the journey.'),
 Persona(name='Frequent Flier', role_description='Travels regularly and values efficiency and comfort. Interested in loyalty programs, express services, and a seamless travel experience.'),
 Persona(name='Angry Business Class Flier', role_description='Demands top-tier service and is easily irritated by any delays or issues. Expects immediate resolutions and is quick to express frustration if standards are not met.')]

In [15]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embedding, knowledge_graph=kg, persona_list=personas)

with_debugging_logs=False
testset = generator.generate(testset_size=10, query_distribution=query_distribution, )
testset.to_pandas()

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples: 0it [00:00, ?it/s]

In [16]:
testset

Testset(samples=[])