This notebook demonstrates how to create a comprehensive test dataset using the [RAGAS test set generator](https://docs.ragas.io/en/stable/getstarted/rag_testset_generation).

## Overview

We'll generate synthetic test queries, relevant context, and ground truth responses from our source documents. The process includes:

- Loading documents from the `test_data_files` directory
- Using GPT-4o-mini and OpenAI embeddings to generate diverse test cases
- Exporting the results for evaluation purposes

## Sample Data

For demonstration purposes, this notebook uses Chapter 1 from classic literature (Dracula and Sherlock Holmes) stored as `.txt` files in the `test_data_files` folder. 

**To use your own dataset:** Simply replace the files in the `test_data_files` directory with your documents.

In [1]:
from dotenv import load_dotenv
from ragas.testset import TestsetGenerator
from langchain_community.document_loaders import DirectoryLoader
from ragas.llms.base import llm_factory
from ragas.embeddings import OpenAIEmbeddings
import openai
import os

In [2]:
load_dotenv("configs.env")

True

In [3]:
TEST_DATA_DIR_PATH = "test_data_files"
os.makedirs(TEST_DATA_DIR_PATH, exist_ok=True)

OUTPUT_TEST_DATA_DIR = "generated_test_data"
os.makedirs(OUTPUT_TEST_DATA_DIR, exist_ok=True)

In [4]:
generator_llm = llm_factory("gpt-4o-mini")
openai_client = openai.AsyncOpenAI()
generator_embeddings = OpenAIEmbeddings(
    client=openai_client, model="text-embedding-3-small"
)

In [5]:
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

In [6]:
doc_files = DirectoryLoader(TEST_DATA_DIR_PATH).load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


In [7]:
len(doc_files)

2

In [8]:
# generate total 10 test samples
dataset = generator.generate_with_langchain_docs(doc_files, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/21 [00:00<?, ?it/s]

Applying EmbeddingExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying ThemesExtractor:   0%|          | 0/20 [00:00<?, ?it/s]

Applying NERExtractor:   0%|          | 0/20 [00:00<?, ?it/s]

Applying CosineSimilarityBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Skipping multi_hop_abstract_query_synthesizer due to unexpected error: No relationships match the provided condition. Cannot form clusters.


Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

In [9]:
df = dataset.to_pandas()

In [10]:
df.shape

(10, 4)

In [11]:
df.head()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What was the author's experience upon arriving...,"[3 May. Bistritz._--Left Munich at 8:35 P. M.,...",The author arrived in Vienna early the next mo...,single_hop_specific_query_synthesizer
1,Wot kind of experiences can a traveler expect ...,[such as we see in old missals; sometimes we r...,"In London, a traveler can expect to encounter ...",single_hop_specific_query_synthesizer
2,What does the Count's letter indicate about th...,[4 May._--I found that my landlord had got a l...,The Count's letter directed the narrator's lan...,single_hop_specific_query_synthesizer
3,"In the context of travel narratives, how is th...",[5 May. The Castle._--The grey of the morning ...,The term 'vrolok' is significant as it represe...,single_hop_specific_query_synthesizer
4,What is the significance of the Mittel Land in...,"[the Hospadars would not repair them, lest the...",The Mittel Land is described as a beautiful re...,single_hop_specific_query_synthesizer


In [12]:
os.makedirs(OUTPUT_TEST_DATA_DIR, exist_ok=True)

In [13]:
# save generated test data to csv
df.to_csv(os.path.join(OUTPUT_TEST_DATA_DIR, "test_data.csv"), index=False)

In [14]:
print("Finito")

Finito
