<a href="https://colab.research.google.com/github/revirevy/llm-tutorials/blob/main/RAG_synthetic_test_data_with_Unstructured_GPT_4o_and_Ragas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a Synthetic Test Dataset for your RAG system with Unstructured, GPT-4o, Ragas, and LangChain in 5 easy steps

In this quick 5-step tutorial, we'll create a test dataset for your RAG system from your pdfs. This data set includes different types of questions about your pdf, in addition to ground truth answers and the context which was used to create the answers.

We will demonstrate how easily your pdfs can be transformed into meaningfully chunked text segments to create data from via Unstructured's API. Leveraging the Ragas evaluation framework, this synthetic test dataset will enable you to evaluate your RAG system's performance across key metrics such as context precision, faithfulness, answer relevancy, and context recall.

Evaluating RAG systems comprehensively is challenging due to the need to have many custom questions and answers per document on which to evaluate performance. Rather than having human labelers pore over long documents, these can be created synthetically with a powerful, cost effective model like GPT-4o. GPT-4 has long been the standard for challenging tasks like creating synthetic test datasets for RAG, and the recent release of GPT-4o brings the cost down x2 and speed up x2, in addition to improvements across 50 languages!

However, even GPT-4o by default is not good at creating diverse samples as it tends to follow common paths. Ragas expands upon this by employing an evolutionary generation paradigm, where questions with different characteristics such as reasoning, conditioning, multi-context, and more are systematically crafted from the provided set of documents.

We'll use [Unstructured API](https://unstructured.io/) for preprocessing PDF files, [Ragas](https://docs.ragas.io/en/latest/getstarted/testset_generation.html) for the test set generation framework, [OpenAI's GPT-4o](https://platform.openai.com/docs/models) to do the Q & A data generation, and [LangChain](https://www.langchain.com/) for integration with Ragas and OpenAI.

For this demo, we are downloading one of the top papers from NeurIPS 2023 on [Scaling Data-Constrained Models (Muennighoff et al. 2023)](https://proceedings.neurips.cc/paper_files/paper/2023/file/9d89448b63ce1e2e8dc7af72c984c196-Paper-Conference.pdf). But you can alternatively reduce your data constraints by making your pdfs and other unstructured documents machine readable with Unstructured's API or Platform :)
_________________________________________

1. To get started, install all the libraries, get your [free unstructured API key](https://unstructured.io/api-key-free), input your OpenAI API Key, and instantiate the Unstructured client to preprocess your PDF file:

In [None]:
!pip install -q unstructured-client unstructured[all-docs] langchain ragas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.8/80.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.8/433.8 kB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m274.7/274.7 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [None]:
import os

os.environ["UNSTRUCTURED_API_KEY"] = "YOUR_UNSTRUCTURED_API_KEY"
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

In [None]:
from unstructured_client import UnstructuredClient

unstructured_api_key = os.environ.get("UNSTRUCTURED_API_KEY")

client = UnstructuredClient(
    api_key_auth=unstructured_api_key,
    # if using paid API, provide your unique API URL:
    # server_url="YOUR_API_URL",
)

2. Download, partition, and chunk your file so that the logical structure of the document is preserved for better question generation and RAG results. Note that this will take about 2 minutes for the linked URL pdf. While Ragas and GPT-4o can be used with any document dataset to generate synthetic test data, only Unstructured enables this generation on top of your unstructured data across connectors, file types, and languages.



In [None]:
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
from unstructured.staging.base import dict_to_elements
import requests
import tempfile

path_to_pdf = 'https://proceedings.neurips.cc/paper_files/paper/2023/file/9d89448b63ce1e2e8dc7af72c984c196-Paper-Conference.pdf'

# Function to download file from URL
def download_file(url):
    try:
        response = requests.get(url)
        print("Download succeeded")
        return response.content
    except Exception as e:
        print("Download failed:", e)
        return None

# Download the PDF file
pdf_content = download_file(path_to_pdf)

# Check if download was successful
if pdf_content:
    # Create a temporary file to save the PDF content
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_file:
        tmp_file.write(pdf_content)
        tmp_file_path = tmp_file.name

    # Preprocess with Unstructured
    with open(tmp_file_path, "rb") as f:
        files = shared.Files(
            content=f.read(),
            file_name=tmp_file.name,
        )
        req = shared.PartitionParameters(
            files=files,
            chunking_strategy="by_title",
            max_characters=512,
        )
        try:
            resp = client.general.partition(req)
            elements = dict_to_elements(resp.elements)
            # Use 'elements' as needed
        except SDKError as e:
            print(e)

    # Clean up: Remove temporary file
    os.remove(tmp_file_path)

else:
    print("File download failed.")

Download succeeded


3. Generate a langchain document to create Q&A text from:


In [None]:
from langchain_core.documents import Document

documents = []
for element in elements:
    metadata = element.metadata.to_dict()
    documents.append(Document(page_content=element.text, metadata=metadata))

# Shorten the document for a quick demo:
documents_test = documents[1:30]

4. Import and combine Ragas + OpenAI's GPT-4o for testset generation. In this section, we will define which model to use to generate the questions and answers ('generator_llm'), as well as to evaluate the quality of the answers ('critic_llm'). We have chosen an even distribution across question types since we do not know which kinds of questions a user would ask about these data.  

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# generator with openai models
generator_llm = ChatOpenAI(model="gpt-4o") # "gpt-3.5-turbo-16k" is another option
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# Change resulting question type distribution
distributions = {
    simple: 0.33,
    multi_context: 0.33,
    reasoning: 0.34
}

5. Run the generation with your documents, set the number of questions, and use the distribution defined above for the question distribution:

Note that this will take 2 minutes to run for the 6 questions in this example

In [None]:
# use generator.generate_with_llamaindex_docs if you use llama-index as document loader
testset = generator.generate_with_langchain_docs(documents_test, 6, distributions)

embedding nodes:   0%|          | 0/58 [00:00<?, ?it/s]

Generating:   0%|          | 0/6 [00:00<?, ?it/s]

Voila, your questions and answers to evaluate your RAG system with:

In [None]:
import pandas as pd

test_df = testset.to_pandas()

for index, row in test_df[test_df['ground_truth'] != 'nan'].iterrows(): # Remove nan
    print(f"Question: {row['question']}")
    print(f"Ground Truth: {row['ground_truth']}")
    print("-" * 30)  # Adding a separator for better readability

Question: What is the role of cross-entropy in quantifying a model's progress?
Ground Truth: Cross-entropy is used to quantify a model's progress by measuring the model’s loss on held-out data, which reflects the ability to predict the underlying data.
------------------------------
Question: What is recorded to quantify the impact of multiple epochs in LLM training?
Ground Truth: Final test loss is recorded to quantify the impact of multiple epochs in LLM training.
------------------------------
Question: How many models (10M-9B params) were trained to study repeated data use in LLM scaling?
Ground Truth: More than 400 models ranging from 10 million to 9 billion parameters were trained to study repeated data use in LLM scaling.
------------------------------
Question: How does Chinchilla manage resources and predict LLM scaling?
Ground Truth: Chinchilla manages resources by dividing them roughly equally between scaling of parameters and data. It uses three methods for making scaling p

In [None]:
test_df.head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What is the role of cross-entropy in quantifyi...,[steps.1 The metric used to quantify progress ...,Cross-entropy is used to quantify a model's pr...,simple,"[{'filetype': 'application/pdf', 'languages': ...",True
1,What is recorded to quantify the impact of mul...,[decreasing validation loss and improving down...,Final test loss is recorded to quantify the im...,simple,"[{'filetype': 'application/pdf', 'languages': ...",True
2,"How does cross-entropy loss on held-out data, ...",[and hence N and D should be scaled proportion...,,multi_context,"[{'filetype': 'application/pdf', 'languages': ...",True
3,How many models (10M-9B params) were trained t...,[Our main focus is to quantify the impact of m...,More than 400 models ranging from 10 million t...,multi_context,"[{'filetype': 'application/pdf', 'languages': ...",True
4,How does Chinchilla manage resources and predi...,"[Currently, there are established best practic...",Chinchilla manages resources by dividing them ...,reasoning,"[{'filetype': 'application/pdf', 'languages': ...",True


With your test data ready to go, you can evaluate your RAG system by following Ragas's [evaluation documentation](https://docs.ragas.io/en/latest/getstarted/evaluation.html). You can also check out our previous notebook to [build a RAG system with Llama3](https://t.co/3qNcPuxhSy) to help get started with RAG. Feel free to copy and scale up this demo for your RAG evaluation needs!    