# Generate synthetic test dataset (with RAGAS)

- Author: [Yoonji](https://github.com/samdaseuss)
- Design: 
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb)

## Overview

### Welcome Back!

Hi everyone! Welcome to our first lecture in the evaluation section.  
We're going to try something special today!  
While we've been building RAG systems, we haven't really talked about how to test if they're working well.  
To properly evaluate a RAG system, we need good test data—and that's exactly what we'll be creating in this tutorial!  
We'll learn how to build datasets that will help us measure our RAG pipeline's performance.  


### Today, what we are going to learn...
In this session, we'll focus on using RAGAS to create evaluation datasets for RAG systems. Our main tasks will include:
- Preprocessing documents for evaluation.
- Defining evaluation objects.
- Defining Knowledge Graphs, creating Nodes, and establishing relationships between nodes
- Concepts of Extractor 
- Configuring data distributions to generate various types of test questions.

We'll explore these concepts through hands-on practice, giving you a practical foundation for building evaluation datasets.

### Why this matters...
The goal is to craft datasets that objectively assess the performance of your RAG system. A well-designed test can highlight how your system handles diverse questions and scenarios, revealing both strengths and areas needing improvement.

By the end of this tutorial, you'll have the skills to build robust datasets for comprehensive evaluation. 

Without further ado, let's get started!

### Table of Contents
- 🌟 **[Overview](#overview)**  
- 🛠️ **[Environment Setup](#environment-setup)**  
- 🔙 **[Looking Back at What We've Learned](#looking-back-at-what-weve-learned)**  
- 📥 **[Installation](#installation)**  
- ❓ **[What is RAGAS?](#what-is-ragas)**  
- 🐍 **[RAGAS in Python](#ragas-in-python)**  
- 📄 **[Document Preprocessing](#document-preprocessing)**  
- 🧩 **[Dataset Generation](#dataset-generation)**  
- 📊 **[Distribution of Question Types](#distribution-of-question-types)**  
- 🚀 **[Summary: Moving Forward with Generated and Prepared Datasets](#summary-moving-forward-with-generated-and-prepared-datasets)**  
- 🎉 **[Bonus: Refactoring Section](#bonus-refactoring-section)** 

### References

- [Testset Generation for RAG](https://docs.ragas.io/en/stable/getstarted/rag_testset_generation/)
- [Testset Generation for RAG : 📚 Core Concepts > Test Data Generation > RAG](https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/)

----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Generate synthetic test dataset (with RAGAS)",
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [4]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Looking Back at What We've Learned

### We Have Learned About RAG

LLM is a powerful technology, but it has limitations in reflecting real-time information due to the constraints of its training data.

For example, let's say NASA discovered a new planet yesterday, making the total number of planets in the solar system nine. What would happen if we asked an LLM about the number of planets in the solar system? Because LLM responds based on its trained data, it would say there are eight planets. We call this phenomenon **hallucination** , and to resolve this, we need to wait for a model **version up** .

RAG emerged to overcome these limitations. Instead of immediately responding to user questions, the RAG pipeline first searches for the latest information from external knowledge repositories and then generates responses based on this information. This enables the system to provide answers that reflect the most **up-to-date** information.

### Is Our RAG Design Effective?

You have learned various techniques for implementing RAG. Some of you may have already built your own RAG systems and applied them to your work.

However, we need to ask an important question: Is our RAG system truly a 'good' RAG? How can we judge the quality of RAG?

Simply saying "this RAG doesn't perform well" is not enough. We need to be able to measure and verify RAG's performance through objective evaluation metrics.

### Why Use Synthetic Test Dataset?

Evaluating the performance of RAG systems is a crucial process. However, manually creating hundreds of question-answer pairs requires enormous time and effort.

Moreover, manually written questions often remain at a simple and superficial level, making it difficult to thoroughly evaluate the performance of RAG systems.

By utilizing synthetic data to solve these problems, we can reduce developer time spent on building test datasets by up to 90%. Additionally, it enables more thorough performance evaluation by automatically generating test cases of various difficulty levels and types.

## Installation

To proceed with this tutorial, you need to install the `RAGAS` and `pdfplumber` package. Through the command below, we'll install the `RAGAS`and `pdfplumber` package, and immediately after, we'll explore **the concept of RAGAS** and learn about Python's **RAGAS package** in detail.

In [5]:
%pip install -qU ragas pdfplumber

Note: you may need to restart the kernel to use updated packages.


## What is RAGAS?
RAGAS (Retrieval Augmented Generation Assessment Suite) is a comprehensive evaluation framework designed to assess the performance of RAG systems. It helps developers and researchers measure how well their RAG implementations are working through various metrics and evaluation methods.

Let's revisit the example we saw earlier.

Let's say NASA discovered a new planet yesterday, making the total number of planets in our solar system nine. To evaluate the performance of a RAG system, let's ask the test question "How many planets are in our solar system?" RAGAS evaluates the system's response using these key metrics:

1. `Answer Relevancy`: Checks if the answer directly addresses the question about the number of planets
2. `Context Relevancy`: Checks if the system retrieved the recent NASA announcement instead of old astronomy textbooks
3. `Faithfulness`: Checks if the answer about nine planets is based on the NASA announcement and not on outdated data
4. `Context Precision`: Checks if the system used the NASA announcement efficiently without including unnecessary space information

For example, if the RAG system responds with **outdated information** saying there are eight planets, RAGAS will give it a low context relevancy score. Or if it makes claims about the new planet that aren't in the NASA announcement, it will receive a low faithfulness score.

## RAGAS in Python
You can easily use `RAGAS` with Python libraries.

Ragas is a library that provides tools to supercharge the evaluation of Large Language Model (LLM) applications. It is designed to help you evaluate your LLM applications with ease and confidence.

## Document Processing

### Document
While the official RAGAS package website demonstrates tutorials using markdown, in this tutorial, we'll be working with **pdf files** . Please use the files located in the **data folder** .

In [6]:
file_path = 'data/Newwhitepaper_Agents2.pdf'

### Document Preprocessing

In [7]:
from langchain_community.document_loaders import PDFPlumberLoader

# Create document loader
loader = PDFPlumberLoader(file_path)

# Load documents
docs = loader.load()

# Exclude table of contents and last page
docs = docs[3:-1]

# Get the number of document pages
len(docs)

38

Each document object includes a metadata dictionary that can be used to store additional information about the document, which can be accessed through **metadata** .

Please check if the metadata dictionary contains a key called **filename** .

This key will be used in the **Test datasets generation process** . The **filename** attribute in metadata is used to identify chunks belonging to the same document.

In [8]:
# Set metadata ('filename' must exist)
for doc in docs:
    doc.metadata["filename"] = doc.metadata["source"]

In [9]:
docs

[Document(metadata={'source': 'data/Newwhitepaper_Agents2.pdf', 'file_path': 'data/Newwhitepaper_Agents2.pdf', 'page': 3, 'total_pages': 42, 'CreationDate': "D:20241113100853-07'00'", 'Creator': 'Adobe InDesign 20.0 (Macintosh)', 'ModDate': "D:20241113100858-07'00'", 'Producer': 'Adobe PDF Library 17.0', 'Trapped': 'False', 'filename': 'data/Newwhitepaper_Agents2.pdf'}, page_content="Agents\nThis combination of reasoning,\nlogic, and access to external\ninformation that are all connected\nto a Generative AI model invokes\nthe concept of an agent.\nIntroduction\nHumans are fantastic at messy pattern recognition tasks. However, they often rely on tools\n- like books, Google Search, or a calculator - to supplement their prior knowledge before\narriving at a conclusion. Just like humans, Generative AI models can be trained to use tools\nto access real-time information or suggest a real-world action. For example, a model can\nleverage a database retrieval tool to access specific information

## Dataset Generation
We'll create datasets using ChatOpenAI. Before writing the code, let's define the roles of our objects:
- Dataset Generator: `generator_llm`
- Document Embeddings: `embeddings`

In [11]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from ragas.testset.graph import KnowledgeGraph
from ragas.testset.graph import Node, NodeType
from ragas.embeddings.base import embedding_factory

# Dataset Generator
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

# Document Embeddings
embeddings = embedding_factory()

First, let's initialize the DocumentStore. We'll configure it to use custom LLM and embeddings.

In [12]:
# Wrap LangChain's ChatOpenAI model with LangchainLLMWrapper to make it compatible with Ragas
langchain_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

# Create ragas_embeddings
ragas_embeddings = LangchainEmbeddingsWrapper(embeddings)

# Create a KnowledgeGraph object
kg = KnowledgeGraph()

for doc in docs:
   kg.nodes.append(
       Node(
           type=NodeType.DOCUMENT,
           properties={
               "page_content": doc.page_content,
               "document_metadata": doc.metadata
           }
       )
   )

### Self Check

```python
print(len(generator.knowledge_graph.nodes))
```
Run this code to verify if knowledge graph nodes have been created. If no nodes were created, there may be issues with executing subsequent code.

```python
for node in generator.knowledge_graph.nodes:
    print(node.properties)
```

In [13]:
# check relationships
print("Total number of nodes:", len(kg.nodes))
print("Total number of relationships:", len(kg.relationships))

Total number of nodes: 38
Total number of relationships: 0


Now we will establish relationships between nodes in the knowledge graph.

### Extractor
The extracted information is used to establish the relationship between the nodes. Before generating relationships between nodes, we will first examine only the three main extractors.
1. `KeyphrasesExtractor`
2. `SummaryExtractor`
3. `HeadlinesExtractor`

First, I will import all the necessary modules.

In [14]:
from ragas.testset.transforms.extractors import (
    KeyphrasesExtractor,
    SummaryExtractor,
    HeadlinesExtractor
)

from ragas.testset.transforms import (
    OverlapScoreBuilder
)

#### 1. Keyphrases Extractor

In [15]:
# [1] Initial version (before refactoring)
# Please run this first to see the original implementation.
keyphrase_extractor = KeyphrasesExtractor()
output = [await keyphrase_extractor.extract(node) for node in kg.nodes]
_ = [node.properties.update({key:val}) for (key,val), node in zip(output, kg.nodes)]
kg.nodes[0].properties

{'page_content': "Agents\nThis combination of reasoning,\nlogic, and access to external\ninformation that are all connected\nto a Generative AI model invokes\nthe concept of an agent.\nIntroduction\nHumans are fantastic at messy pattern recognition tasks. However, they often rely on tools\n- like books, Google Search, or a calculator - to supplement their prior knowledge before\narriving at a conclusion. Just like humans, Generative AI models can be trained to use tools\nto access real-time information or suggest a real-world action. For example, a model can\nleverage a database retrieval tool to access specific information, like a customer's purchase\nhistory, so it can generate tailored shopping recommendations. Alternatively, based on a\nuser's query, a model can make various API calls to send an email response to a colleague\nor complete a financial transaction on your behalf. To do so, the model must not only have\naccess to a set of external tools, it needs the ability to plan an

In [16]:
# [2] Optimized version (faster execution)
# Comment out the code above and run this version instead to compare execution times.
import asyncio
from multiprocessing.pool import ThreadPool

async def process_batch(batch):
    keyphrase_extractor = KeyphrasesExtractor()
    batch_output = await asyncio.gather(*[keyphrase_extractor.extract(node) for node in batch])
    return batch_output

def process_batch_in_thread(batch):
    return asyncio.run(process_batch(batch))

def process_with_thread_and_async(nodes, batch_size=5, num_threads=4):
    batches = [nodes[i:i + batch_size] for i in range(0, len(nodes), batch_size)]
    
    with ThreadPool(processes=num_threads) as pool:
        all_outputs = pool.map(process_batch_in_thread, batches)
    
    outputs = []
    for batch_output in all_outputs:
        outputs.extend(batch_output)
    
    _ = [node.properties.update({key:val}) for (key,val), node in zip(outputs, nodes)]
    
    return nodes[0].properties

_ = process_with_thread_and_async(kg.nodes)
kg.nodes[0].properties

{'page_content': "Agents\nThis combination of reasoning,\nlogic, and access to external\ninformation that are all connected\nto a Generative AI model invokes\nthe concept of an agent.\nIntroduction\nHumans are fantastic at messy pattern recognition tasks. However, they often rely on tools\n- like books, Google Search, or a calculator - to supplement their prior knowledge before\narriving at a conclusion. Just like humans, Generative AI models can be trained to use tools\nto access real-time information or suggest a real-world action. For example, a model can\nleverage a database retrieval tool to access specific information, like a customer's purchase\nhistory, so it can generate tailored shopping recommendations. Alternatively, based on a\nuser's query, a model can make various API calls to send an email response to a colleague\nor complete a financial transaction on your behalf. To do so, the model must not only have\naccess to a set of external tools, it needs the ability to plan an


---

**[Note] Refactoring for Performance Improvement!**

In the bonus section of this tutorial, we optimized the code to significantly reduce execution time—from **45 seconds to 1 minute** down to just **3-8 seconds**!  
If you're familiar with **parallel processing** and **asynchronous processing**, you can combine these techniques to further enhance performance.  
We used the `asyncio` module for asynchronous processing and the `multiprocessing` module for parallel processing.  
(**Tested on an M1 CPU with 4 performance cores and 4 efficiency cores.**)

Check out the details in the [Bonus Refactoring Section](#summary-moving-forward-with-generated-and-prepared-datasets)!

---

#### 2. Summary Extractor

In [17]:
summary_extractor = SummaryExtractor()
output = [await summary_extractor.extract(node) for node in kg.nodes]
_ = [node.properties.update({key:val}) for (key, val), node in zip(output, kg.nodes)]
kg.nodes[0].properties

{'page_content': "Agents\nThis combination of reasoning,\nlogic, and access to external\ninformation that are all connected\nto a Generative AI model invokes\nthe concept of an agent.\nIntroduction\nHumans are fantastic at messy pattern recognition tasks. However, they often rely on tools\n- like books, Google Search, or a calculator - to supplement their prior knowledge before\narriving at a conclusion. Just like humans, Generative AI models can be trained to use tools\nto access real-time information or suggest a real-world action. For example, a model can\nleverage a database retrieval tool to access specific information, like a customer's purchase\nhistory, so it can generate tailored shopping recommendations. Alternatively, based on a\nuser's query, a model can make various API calls to send an email response to a colleague\nor complete a financial transaction on your behalf. To do so, the model must not only have\naccess to a set of external tools, it needs the ability to plan an



---

[Note] **Refactoring for Performance Improvement!** 

You can refactor it like the Summary Keyphrases Extractor.

Check out the details in the [Bonus Refactoring Section](#summary-moving-forward-with-generated-and-prepared-datasets)!

---

#### 3. Headlines Extractor

In [18]:
headline_extractor = HeadlinesExtractor()
output = [await headline_extractor.extract(node) for node in kg.nodes]
_ = [node.properties.update({key:val}) for (key,val), node in zip(output, kg.nodes)]
kg.nodes[0].properties

{'page_content': "Agents\nThis combination of reasoning,\nlogic, and access to external\ninformation that are all connected\nto a Generative AI model invokes\nthe concept of an agent.\nIntroduction\nHumans are fantastic at messy pattern recognition tasks. However, they often rely on tools\n- like books, Google Search, or a calculator - to supplement their prior knowledge before\narriving at a conclusion. Just like humans, Generative AI models can be trained to use tools\nto access real-time information or suggest a real-world action. For example, a model can\nleverage a database retrieval tool to access specific information, like a customer's purchase\nhistory, so it can generate tailored shopping recommendations. Alternatively, based on a\nuser's query, a model can make various API calls to send an email response to a colleague\nor complete a financial transaction on your behalf. To do so, the model must not only have\naccess to a set of external tools, it needs the ability to plan an


---

[Note] **Refactoring for Performance Improvement!**

You can refactor it like the Headlines Extractor.

Check out the details in the [Bonus Refactoring Section](#summary-moving-forward-with-generated-and-prepared-datasets)!

---

### Relationship builder
We will define relationships using the extracted information from earlier.
In the case of technology documents, the relationship can be established between the nodes based on the entities present in the nodes. 

Since both the nodes have the same entities, the relationship is established between the nodes based on the entity similarity. let's take a look at this.

In [19]:
print(kg.nodes[0].properties['keyphrases'])
print(kg.nodes[1].properties['keyphrases'])
print(kg.nodes[2].properties['keyphrases'])
print(kg.nodes[3].properties['keyphrases'])
print(kg.nodes[4].properties['keyphrases'])

['Generative AI model', 'reasoning', 'external information', 'tailored shopping recommendations', 'self-directed fashion']
['Generative AI agent', 'autonomous agents', 'cognitive architecture', 'goal achievement', 'decision making']
['agent architecture', 'language model', 'instruction based reasoning', 'cognitive architecture', 'specific tools']
['foundational models', 'tools', 'agents', 'orchestration layer', 'retrieval augmented generation']
['agents vs. models', 'knowledge is limited', 'multi turn inference', 'native cognitive architecture', 'planning execution adjustment']


Relationships can be formed using the builder.

In [20]:
%pip install -qU rapidfuzz

Note: you may need to restart the kernel to use updated packages.


In [21]:
from ragas.testset.transforms import apply_transforms

relation_builder = OverlapScoreBuilder(
    property_name="keyphrases",
    new_property_name="overlap_score",
)

transforms = [
    keyphrase_extractor,
    relation_builder
]

apply_transforms(kg,transforms)

Applying KeyphrasesExtractor:   0%|          | 0/38 [00:00<?, ?it/s]Property 'keyphrases' already exists in node '7a5209'. Skipping!
Applying KeyphrasesExtractor:   3%|▎         | 1/38 [00:01<00:59,  1.60s/it]Property 'keyphrases' already exists in node '39e9bb'. Skipping!
Property 'keyphrases' already exists in node '3a4217'. Skipping!
Property 'keyphrases' already exists in node '3fe9c8'. Skipping!
Property 'keyphrases' already exists in node '4c2d0a'. Skipping!
Property 'keyphrases' already exists in node 'f48395'. Skipping!
Applying KeyphrasesExtractor:  16%|█▌        | 6/38 [00:01<00:06,  4.60it/s]Property 'keyphrases' already exists in node '939cf8'. Skipping!
Property 'keyphrases' already exists in node '399e66'. Skipping!
Property 'keyphrases' already exists in node 'e9062c'. Skipping!
Applying KeyphrasesExtractor:  24%|██▎       | 9/38 [00:02<00:05,  4.90it/s]Property 'keyphrases' already exists in node 'f2153b'. Skipping!
Property 'keyphrases' already exists in node '5601ec'.

In [22]:
from ragas.testset import TestsetGenerator
clusters = kg.find_indirect_clusters()
generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=ragas_embeddings,
    knowledge_graph=kg, # the graph with newly created relationships will be entered.
)

In [23]:
# check relationships
print("Total number of nodes:", len(kg.nodes))
print("Total number of relationships:", len(kg.relationships))

Total number of nodes: 38
Total number of relationships: 51


## Distribution of Question Types
Before we begin generating questions, let's first define the distribution (frequency) of questions by type. Using the **SingleHopSpecificQuerySynthesizer** , **MultiHopAbstractQuerySynthesizer** , **MultiHopSpecificQuerySynthesizer**  and **MultiHopQuerySynthesizer** , we aim to create a test set with the following distribution of question types:

- `simple`: Basic questions (40%) ㅡ **SingleHopSpecificQuerySynthesizer**
- `reasoning`: Questions requiring reasoning (20%) ㅡ **MultiHopAbstractQuerySynthesizer** 
- `multi_context`: Questions requiring consideration of multiple contexts (20%) ㅡ **MultiHopSpecificQuerySynthesizer** 
- `conditional`: Conditional questions (20%) ㅡ **MultiHopQuerySynthesizer** 

### Role of the synthesizers Module
The synthesizers module in Ragas is a core module responsible for Query Synthesis. It provides functionality to generate various types of questions based on documents stored in the Knowledge Graph. This module is used to automatically generate test sets for evaluating RAG (Retrieval-Augmented Generation) systems.

In [24]:
from ragas.testset.synthesizers.multi_hop import (
    MultiHopAbstractQuerySynthesizer,
    MultiHopSpecificQuerySynthesizer,
)
from ragas.testset.synthesizers.single_hop.specific import (
    SingleHopSpecificQuerySynthesizer,
)
from ragas.testset.synthesizers.multi_hop.base import (
    MultiHopQuerySynthesizer,
)
from ragas.testset.synthesizers.base import BaseSynthesizer

In [25]:
from dataclasses import dataclass
import typing as t
from ragas.testset.synthesizers.multi_hop.base import (
    MultiHopScenario,
)
from ragas.testset.synthesizers.prompts import (
    ThemesPersonasInput,
    ThemesPersonasMatchingPrompt,
)

@dataclass
class NewMultiHopQuery(MultiHopQuerySynthesizer):

    theme_persona_matching_prompt = ThemesPersonasMatchingPrompt()

    async def _generate_scenarios(
        self,
        n: int,
        knowledge_graph,
        persona_list,
        callbacks,
    ) -> t.List[MultiHopScenario]:

        # query and get (node_a, rel, node_b) to create multi-hop queries
        results = kg.find_two_nodes_single_rel(
            relationship_condition=lambda rel: (
                True if rel.type == "keyphrases_overlap" else False
            )
        )

        num_sample_per_triplet = max(1, n // len(results))

        scenarios = []
        for triplet in results:
            if len(scenarios) < n:
                node_a, node_b = triplet[0], triplet[-1]
                overlapped_keywords = triplet[1].properties["overlapped_items"]
                if overlapped_keywords:

                    # match the keyword with a persona for query creation
                    themes = list(dict(overlapped_keywords).keys())
                    prompt_input = ThemesPersonasInput(
                        themes=themes, personas=persona_list
                    )
                    persona_concepts = (
                        await self.theme_persona_matching_prompt.generate(
                            data=prompt_input, llm=self.llm, callbacks=callbacks
                        )
                    )

                    overlapped_keywords = [list(item) for item in overlapped_keywords]

                    # prepare and sample possible combinations
                    base_scenarios = self.prepare_combinations(
                        [node_a, node_b],
                        overlapped_keywords,
                        personas=persona_list,
                        persona_item_mapping=persona_concepts.mapping,
                        property_name="keyphrases",
                    )

                    # get number of required samples from this triplet
                    base_scenarios = self.sample_diverse_combinations(
                        base_scenarios, num_sample_per_triplet
                    )

                    scenarios.extend(base_scenarios)

        return scenarios

In [26]:
query = NewMultiHopQuery(llm=generator_llm)
query

NewMultiHopQuery(name='NewMultiHopQuery', llm=LangchainLLMWrapper(langchain_llm=ChatOpenAI(...)), generate_query_reference_prompt=QueryAnswerGenerationPrompt(instruction=Generate a multi-hop query and answer based on the specified conditions (persona, themes, style, length) and the provided context. The themes represent a set of phrases either extracted or generated from the context, which highlight the suitability of the selected context for multi-hop query creation. Ensure the query explicitly incorporates these themes.### Instructions:
1. **Generate a Multi-Hop Query**: Use the provided context segments and themes to form a query that requires combining information from multiple segments (e.g., `<1-hop>` and `<2-hop>`). Ensure the query explicitly incorporates one or more themes and reflects their relevance to the context.
2. **Generate an Answer**: Use only the content from the provided context to create a detailed and faithful answer to the query. Avoid adding information that is 

### Implementation of Custom Distribution
I've revamped the distribution setup to make it more flexible. Now it features four query types: simple, reasoning, multi_context, and conditional. Users can freely adjust the frequency of each type according to their needs.

In [27]:
import typing as t
from ragas.llms import BaseRagasLLM

QueryDistribution = t.List[t.Tuple[BaseSynthesizer, float]]

Due to insufficient cluster size, we were unable to use MultiHopAbstractQuerySynthesizer(llm=llm) and SingleHopSpecificQuerySynthesizer(llm=llm). We will proceed with implementation using only NewMultiHopQuery.

In [28]:
simple_synthesizer = SingleHopSpecificQuerySynthesizer(llm=generator_llm)
reasoning_synthesizer = NewMultiHopQuery(llm=generator_llm)
multi_context_synthesizer = NewMultiHopQuery(llm=generator_llm)
conditional_synthesizer = NewMultiHopQuery(llm=generator_llm)

In [29]:
def custom_query_distribution(
   llm: BaseRagasLLM,
   distributions: t.List[float],
   kg: t.Optional[KnowledgeGraph] = None
) -> QueryDistribution:
   default_queries = [
       simple_synthesizer,
       reasoning_synthesizer,
       multi_context_synthesizer,
       conditional_synthesizer
   ]

   if kg is not None:
       available_queries = [q for q in default_queries if q.get_node_clusters(kg)]
   else:
       available_queries = default_queries

   return list(zip(available_queries, distributions))

In [30]:
distributions = [0.4, 0.2, 0.2, 0.2]

query_distribution = custom_query_distribution(generator_llm, distributions)
query_distribution

[(SingleHopSpecificQuerySynthesizer(name='single_hop_specifc_query_synthesizer', llm=LangchainLLMWrapper(langchain_llm=ChatOpenAI(...)), generate_query_reference_prompt=QueryAnswerGenerationPrompt(instruction=Generate a single-hop query and answer based on the specified conditions (persona, term, style, length) and the provided context. Ensure the answer is entirely faithful to the context, using only the information directly from the provided context.### Instructions:
  1. **Generate a Query**: Based on the context, persona, term, style, and length, create a question that aligns with the persona's perspective and incorporates the term.
  2. **Generate an Answer**: Using only the content from the provided context, construct a detailed answer to the query. Do not add any information not included in or inferable from the context.
  , examples=[(QueryCondition(persona=Persona(name='Software Engineer', role_description='Focuses on coding best practices and system design.'), term='microserv

In [32]:
dataset = generator.generate_with_langchain_docs(
   documents=docs, # document data
   testset_size=10, # number of questions to generate
   query_distribution=query_distribution, # distribution by question type 
   with_debugging_logs=True # output debugging logs
)

Applying SummaryExtractor:   0%|          | 0/36 [00:00<?, ?it/s]

Applying CustomNodeFilter:  24%|██▎       | 9/38 [00:01<00:02,  9.70it/s] Node c2855407-9922-456c-b369-34965aebdcaf does not have a summary. Skipping filtering.
Applying CustomNodeFilter:  50%|█████     | 19/38 [00:02<00:01, 10.46it/s]Node a6843a7c-c3ef-45a3-a2f7-cad5f845d7fa does not have a summary. Skipping filtering.
Generating personas: 100%|██████████| 3/3 [00:02<00:00,  1.13it/s]                                             
Generating Scenarios: 100%|██████████| 4/4 [00:06<00:00,  1.66s/it]
Generating Samples: 100%|██████████| 10/10 [00:07<00:00,  1.29it/s]


In [33]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What Generative AI do?,"[Agents\nThis combination of reasoning,\nlogic...",Generative AI models can be trained to use too...,single_hop_specifc_query_synthesizer
1,What is the significance of September 2024 in ...,[Agents\nWhat is an agent?\nIn its most fundam...,The context does not provide specific informat...,single_hop_specifc_query_synthesizer
2,How does the ReAct framework integrate with la...,[Agents\nFigure 1. General agent architecture ...,"In the scope of an agent, a model refers to th...",single_hop_specifc_query_synthesizer
3,"What DELETE do in tools for agents, how it hel...","[Agents\nThe tools\nFoundational models, despi...",DELETE is a common web API method that tools c...,single_hop_specifc_query_synthesizer
4,How does a Generative AI model utilize reasoni...,[<1-hop>\n\nAgents\nWhat is an agent?\nIn its ...,A Generative AI model functions as an agent by...,NewMultiHopQuery
5,How does the orchestration layer enhance the c...,[<1-hop>\n\nAgents\nSummary\nIn this whitepape...,The orchestration layer enhances the capabilit...,NewMultiHopQuery
6,How does the ReAct framework enable agents to ...,"[<1-hop>\n\nAgents\n• Chain-of-Thought (CoT), ...",The ReAct framework enables agents to effectiv...,NewMultiHopQuery
7,How does a generative AI model utilize reasoni...,[<1-hop>\n\nAgents\nWhat is an agent?\nIn its ...,A generative AI model functions as a generativ...,NewMultiHopQuery
8,How fine-tuning based learning help chef learn...,[<1-hop>\n\nAgents\n• Imagine a chef has recei...,Fine-tuning based learning helps the chef lear...,NewMultiHopQuery
9,"How does a Generative AI model, like the one u...",[<1-hop>\n\nAgents\nPython\nfrom vertexai.gene...,"A Generative AI model, such as the one used in...",NewMultiHopQuery


In [34]:
dataset.to_pandas().to_csv("data/ragas_synthetic_dataset.csv", index=False)

## Summary: Moving Forward with Generated and Prepared Datasets
Now that we have generated our dataset or prepared datasets from the data folder, let's move on to the next section: Evaluation using RAGAS.

## Bonus: Refactoring Section

This tutorial's bonus section demonstrates how to improve code execution time from at 1 minute to 3-8 seconds.

If you're familiar with parallel and asynchronous processing, you can combine them to improve response time.
We'll use the `asyncio` module for asynchronous processing and `multiprocessing` for parallel processing.

Original code takes at least 50 seconds:
```python
keyphrase_extractor = KeyphrasesExtractor()
output = [await keyphrase_extractor.extract(node) for node in kg.nodes]
_ = [node.properties.update({key:val}) for (key,val), node in zip(output, kg.nodes)]
kg.nodes[0].properties
```
* `output = [await keyphrase_extractor.extract(node) for node in kg.nodes]` - Processing nodes sequentially, waiting for each extract to complete before processing the next node

Let's improve using ThreadPool:
```python
import asyncio
from multiprocessing.pool import ThreadPool

def process_node(node):
    keyphrase_extractor = KeyphrasesExtractor()
    return asyncio.run(extractor.keyphrase_extract(node))

def update_nodes_pool(kg_nodes, num_threads=4):
    with ThreadPool(processes=num_threads) as pool:
        outputs = pool.map(process_node, kg_nodes)
    _ = [node.properties.update({key:val}) for (key,val), node in zip(outputs, kg_nodes)]
    return kg_nodes[0].properties

_ = update_nodes_pool(kg.nodes)
kg.nodes[0].properties
```
Improved to approximately 14-15 seconds (14.6s, 15.2s, 14.3s).

Now let's improve using async processing:
```python
keyphrase_extractor = KeyphrasesExtractor()
async def process_keyphrase_batch(nodes, batch_size=5):
    outputs = []
    for i in range(0, len(nodes), batch_size):
        batch = nodes[i:i + batch_size]
        batch_output = await asyncio.gather(*[keyphrase_extractor.extract(node) for node in batch])
        outputs.extend(batch_output)
    return outputs
    
outputs = await process_keyphrase_batch(kg.nodes)
_ = [node.properties.update({key:val}) for (key,val), node in zip(outputs, kg.nodes)]
kg.nodes[0].properties
```
Improved to approximately 16 seconds.
Processing nodes in batches of 5 simultaneously using asyncio.gather.
The key improvement comes from asyncio.gather, which executes multiple coroutines simultaneously and waits for all results. Performance improvement is achieved because extract function includes I/O operations (API calls).

What happens when we combine both approaches?
```python
import asyncio
from multiprocessing.pool import ThreadPool

# Async function to process single batch
async def process_batch(batch):
    keyphrase_extractor = KeyphrasesExtractor()
    batch_output = await asyncio.gather(*[keyphrase_extractor.extract(node) for node in batch])
    return batch_output

# Function to run in thread
def process_batch_in_thread(batch):
    return asyncio.run(process_batch(batch))

def process_with_thread_and_async(nodes, batch_size=5, num_threads=4):
    # Divide data into batches
    batches = [nodes[i:i + batch_size] 
              for i in range(0, len(nodes), batch_size)]
    
    # Process batches using thread pool
    with ThreadPool(processes=num_threads) as pool:
        all_outputs = pool.map(process_batch_in_thread, batches)
    
    outputs = []
    for batch_output in all_outputs:
        outputs.extend(batch_output)
    
    # Update results
    _ = [node.properties.update({key:val}) 
         for (key,val), node in zip(outputs, nodes)]
    
    return nodes[0].properties

_ = process_with_thread_and_async(kg.nodes)
kg.nodes[0].properties
```
By effectively combining parallel and asynchronous processing, we can reduce execution time from 1 minute to approximately 3-8 seconds.