# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!


  1. Use RAGAS to Generate Synthetic Data
  2. Load them into a LangSmith Dataset
  3. Evaluate our RAG chain against the synthetic test data
  4. Make changes to our pipeline
  5. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

> NOTE: DO NOT RUN THESE CELLS IF YOU ARE RUNNING THIS NOTEBOOK LOCALLY

In [1]:
#!pip install -qU ragas==0.2.10

In [2]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [2]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DadaV\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DadaV\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [4]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"Philippines AI Bills RAG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")


In [29]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefull be familiar at this point since it's our Loan Data use-case!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [6]:
from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader

path = "bills/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [7]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
# generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [8]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [9]:
from ragas.testset.graph import Node, NodeType

### NOTICE: We're using a subset of the data for this example - this is to keep costs/time down.
for doc in docs[:20]:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 20, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [10]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying SummaryExtractor:   0%|          | 0/16 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/20 [00:00<?, ?it/s]

Node 1d683277-4947-4e86-92e9-6852810f7514 does not have a summary. Skipping filtering.
Node 997e4588-63e7-4bb6-b03f-dc185b06a1ce does not have a summary. Skipping filtering.
Node 249c7c95-d982-4e7d-83f1-9496dad1496f does not have a summary. Skipping filtering.
Node 26160b66-9a9f-4915-bd6d-ca424ad30ad2 does not have a summary. Skipping filtering.


Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/56 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 20, relationships: 137)

We can save and load our knowledge graphs as follows.

In [11]:
kg.save("bills/ai_law.json")
bills_data_kg = KnowledgeGraph.load("bills/ai_law.json")
bills_data_kg

KnowledgeGraph(nodes: 20, relationships: 137)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [12]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=bills_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [13]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

Finally, we can use our `TestSetGenerator` to generate our testset!

In [14]:
testset = generator.generate(testset_size=20, query_distribution=query_distribution)
testset.to_pandas()
testset.to_jsonl("bills/golden_dataset.json")  # Save for reuse

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/20 [00:00<?, ?it/s]

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [15]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying SummaryExtractor:   0%|          | 0/16 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/20 [00:00<?, ?it/s]

Node 91d45b63-134f-4a2c-8b5f-42646c369283 does not have a summary. Skipping filtering.
Node fe0b4b07-9395-4043-9735-35520124fe06 does not have a summary. Skipping filtering.
Node 3eb8450b-1d07-4756-9bac-c59c58e67535 does not have a summary. Skipping filtering.
Node f7a87d43-4c37-46ff-a937-f55f3e4a18d1 does not have a summary. Skipping filtering.


Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/56 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [16]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Who is PIA S. CAYETANO in the context of the P...,[TWENTIETH CONGRESS OF THE \nREPUBLIC OF THE P...,PIA S. CAYETANO is the senator who introduced ...,single_hop_specifc_query_synthesizer
1,Why Georgetown University is important for AI ...,[AI presents enormous opportunities for the Ph...,The context mentions Georgetown University in ...,single_hop_specifc_query_synthesizer
2,What is the significance of the REPUBLIC OF TH...,[TWENTIETH CONGRESS OF THE \nREPUBLIC OF THE P...,The context discusses the TWENTIETH CONGRESS O...,single_hop_specifc_query_synthesizer
3,AI is like what do it do for Philippines?,"[1 \na) Promote innovation, technological adva...",AI refers to systems that allow machines to th...,single_hop_specifc_query_synthesizer
4,How does the creation of an AI Ethics Review B...,[<1-hop>\n\nAI presents enormous opportunities...,"The creation of an AI Ethics Review Board, as ...",multi_hop_abstract_query_synthesizer
5,How do the registration and licensing requirem...,[<1-hop>\n\n1 \nground its responses in verifi...,The Philippine AI regulations specify that any...,multi_hop_abstract_query_synthesizer
6,How do coordnaton with gov agencies and mainte...,[<1-hop>\n\n1 \nSec. 8. NAICSecretariat. - The...,"Coordnaton with gov agencies, local units, and...",multi_hop_abstract_query_synthesizer
7,How does the Philippines' AI regulation framew...,[<1-hop>\n\nAI presents enormous opportunities...,The Philippines' AI regulation framework aims ...,multi_hop_abstract_query_synthesizer
8,How does DepEd relate to AI regulation under t...,[<1-hop>\n\n1 \nSec. 6. Jurisdiction of the NA...,DepEd is listed as one of the government agenc...,multi_hop_specific_query_synthesizer
9,How does the bill address ASI risks and promot...,[<1-hop>\n\nAI presents enormous opportunities...,The bill recognizes the potential rise of Arti...,multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [30]:
from langsmith import Client

client = Client()

dataset_name = "Philippines AI Bills x5"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Philippines AI Bills"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [31]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [32]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [33]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [34]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

To get the "A" in RAG, we'll provide a prompt.

In [24]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful and informative assistant. Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context, or you are unsure, you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

For our LLM, we will be using TogetherAI's endpoints as well!

We're going to be using Meta Llama 3.1 70B Instruct Turbo - a powerful model which should get us powerful results!

In [25]:
from langchain_openai import ChatOpenAI

# llm = ChatOpenAI(model="gpt-4.1-mini")
chat_model = ChatOpenAI(model="gpt-4.1-nano")

As usual, we will power our RAG application with Qdrant!

In [35]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

semantic_documents = semantic_chunker.split_documents(rag_documents[:20])

# semantic chunking OFF - use rag_documents, else semantic_documents

from langchain_community.vectorstores import Qdrant

text_vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="AI Bills RAG")

semantic_vectorstore = Qdrant.from_documents(
    documents=semantic_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="AI Bills RAG")

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [40]:
eval_llm = ChatOpenAI(model="gpt-4.1")

from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={
    "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["result"],      # student's answer (model output)
        "reference": example.outputs["answer"],   # gold / true answer
        "input": example.inputs["query"],         # the question
    }
    )

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["result"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

empathy_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "empathy": "Is this response empathetic? Does it make the user feel like they are being heard?",
        },
        "llm" : eval_llm
    }
)

In [None]:
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

from langchain.retrievers.multi_query import MultiQueryRetriever

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models
from langchain_qdrant import QdrantVectorStore

from langchain.retrievers import EnsembleRetriever

# Naive retriever
# retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
retriever_names = [
    "Naive", "BM25", "Multi-Query", "Parent-Document", "Contextual Compression", "Ensemble"
]
results = []

import time

for sc in [False, True]:  # semantic chunking off/on
    for retriever_name in retriever_names: # loop through each retriever
    
        start_time = time.time()
        if sc:
            vectorstore = semantic_vectorstore
        else:
            vectorstore = text_vectorstore

        match retriever_name:
        
            case 'Naive':
                retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
                naive_retriever = retriever
            
            case 'BM25':
                retriever = BM25Retriever.from_documents(rag_documents) 
                bm25_retriever = retriever
            
            case 'Contextual Compression':
                compressor = CohereRerank(model="rerank-v3.5")
                retriever = ContextualCompressionRetriever(
                            base_compressor=compressor, base_retriever=naive_retriever)
                compression_retriever = retriever

            case 'Multi-Query':
                retriever = MultiQueryRetriever.from_llm(retriever=naive_retriever, llm=chat_model)
                multi_query_retriever = retriever
                
            case 'Parent-Document':
                parent_docs = rag_documents
                child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

                client = QdrantClient(location=":memory:")

                client.create_collection(
                    collection_name="full_documents",
                    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
                )

                parent_document_vectorstore = QdrantVectorStore(
                    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
                )
                store = InMemoryStore()
                retriever = ParentDocumentRetriever(
                                vectorstore = parent_document_vectorstore,
                                docstore=store,
                                child_splitter=child_splitter)
                retriever.add_documents(parent_docs, ids=None)
                parent_document_retriever = retriever

            case 'Ensemble':
                retriever_list = [naive_retriever, bm25_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
                equal_weighting = [1/len(retriever_list)] * len(retriever_list)
                retriever = EnsembleRetriever(retrievers=retriever_list, weights=equal_weighting)
            
            case ' ':
                retriever = "Invalid option"
        
        from operator import itemgetter
        from langchain_core.runnables import RunnablePassthrough, RunnableParallel
        from langchain.schema import StrOutputParser

        rag_chain = (
            # {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
            # | rag_prompt | llm | StrOutputParser()
            {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
            | RunnablePassthrough.assign(context=itemgetter("context"))
            | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
        )

        rag_chain.invoke({"question" : "How does the Philippines regulate the development of AI in the country?"})
        # ["response"].content

        latency = time.time() - start_time
        evaluate(
            rag_chain.invoke,
            data=dataset_name,
            evaluators=[
                qa_evaluator,
                labeled_helpfulness_evaluator,
                empathy_evaluator],
            metadata={
                "retriever": retriever_name,
                "semantic chunking": sc
                # "latency": latency
                }
        )

View the evaluation results for experiment: 'bold-peace-15' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/4f35f17d-671b-460c-9fd0-ae389801030a/compare?selectedSessions=c4bf537c-6c37-4149-8046-d3a09d7c4ab8




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 1a4b593d-03a3-46f8-9b74-3c552f633759: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!\\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(c

View the evaluation results for experiment: 'new-root-96' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/4f35f17d-671b-460c-9fd0-ae389801030a/compare?selectedSessions=5e1912ac-6a60-4505-bdeb-a423df366be1




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 3b89702f-9fe5-4da7-aa9e-6fdcb78e7ff9: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!\\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(c

View the evaluation results for experiment: 'drab-nut-90' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/4f35f17d-671b-460c-9fd0-ae389801030a/compare?selectedSessions=e1ec8a3d-03b6-48d3-b1d1-beb4d5c2ba83




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run afb48638-bb71-46bb-8d56-10b9b7d978e5: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!\\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(c

View the evaluation results for experiment: 'loyal-invention-66' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/4f35f17d-671b-460c-9fd0-ae389801030a/compare?selectedSessions=70dcf428-5210-482d-875b-3acf28062187




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 0bde4a67-09ab-4d52-b439-4a622a79082e: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!\\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(c

View the evaluation results for experiment: 'dear-stomach-69' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/4f35f17d-671b-460c-9fd0-ae389801030a/compare?selectedSessions=f0d49c8f-370f-4471-aef9-82986b9946c8




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run e131ec2e-a21a-4711-b89a-62e50259a2ad: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!\\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(c

View the evaluation results for experiment: 'roasted-pleasure-32' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/4f35f17d-671b-460c-9fd0-ae389801030a/compare?selectedSessions=916a0593-126f-4e24-be10-920656b1a272




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run d6daa772-8be2-4af7-83ae-4338967a4b8b: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!\\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(c

View the evaluation results for experiment: 'virtual-coffee-57' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/4f35f17d-671b-460c-9fd0-ae389801030a/compare?selectedSessions=de8c26bc-9c6e-4b0a-bbbc-1f31e71ef2e4




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run e4a4ab13-7b4d-4356-939b-42d22de6d0c5: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!\\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(c

View the evaluation results for experiment: 'elderly-acknowledgment-44' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/4f35f17d-671b-460c-9fd0-ae389801030a/compare?selectedSessions=0cb85e1b-fcfd-4369-92df-fb650eb7cf2e




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 0ff8ca69-b52c-49d1-a3a3-08eb73f3b8e4: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!\\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(c

View the evaluation results for experiment: 'enchanted-town-24' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/4f35f17d-671b-460c-9fd0-ae389801030a/compare?selectedSessions=e6d46e6a-a1f9-41a0-87c8-f9e67e320f7f




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 580a7cec-0518-4617-8d86-304e22803a60: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!\\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(c

View the evaluation results for experiment: 'complicated-rabbit-64' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/4f35f17d-671b-460c-9fd0-ae389801030a/compare?selectedSessions=7e594197-ab67-4793-bb20-90ff7d3d111f




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 22bb948c-e4a9-460d-84fd-1aac32a062fc: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!\\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(c

View the evaluation results for experiment: 'crushing-team-83' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/4f35f17d-671b-460c-9fd0-ae389801030a/compare?selectedSessions=6da8d800-14fd-4aac-818f-570445b333cf




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 5567d2f2-59f2-4bc3-bf07-a17bceeeec8a: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!\\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(c

View the evaluation results for experiment: 'ample-animal-5' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/4f35f17d-671b-460c-9fd0-ae389801030a/compare?selectedSessions=b26a64cd-90ef-4bfb-ae4f-f1948349bb81




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run ff689085-d4b3-4fa0-b2c2-e8de30e94170: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!\\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(c

****** EVALUATION ENDS HERE ********

Please ignore the rest down below


## LangSmith Evaluation

In [30]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        empathy_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'giving-trick-63' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/ef297373-b111-4973-ab34-102fa9ead893/compare?selectedSessions=59c23679-3fb3-463c-ae54-7bb8705f7490




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.empathy,execution_time,example_id,id
0,"How does the AI registration process, includin...","Based on the provided context, the AI registra...",,The AI registration process requires detailed ...,1,1,0,5.208305,5224b263-77a3-4a2b-b3e2-165729d7b4f0,a39166aa-51d4-48d9-ab83-6218f233cbec
1,"How does the NAIC's jurisdiction over AI, as o...",I don't know.,,"The NAIC's jurisdiction over AI, as specified ...",0,0,0,1.3004,5f49dd4e-5938-41f3-86a7-0134d10e9cfd,d2900b56-f564-445e-b917-bd0088a69d7d
2,how does the national AI commission work with ...,The National AI Commission (NAIC) coordinates ...,,the national AI commission develops policies a...,1,1,0,3.11134,703738e2-269e-49d9-92b3-76f70563ebdd,f68bcf18-e913-4b13-a85b-4f33e1540732
3,How do the laws and regulations regarding AI i...,The laws and regulations regarding AI in the P...,,The laws and regulations in the Philippines pr...,0,0,0,2.560005,6557c129-dcb4-465f-b1d3-3201a8b2470b,e4f4b466-ff97-47cf-8c68-ca368d407850
4,How do government agencies like DOST and DICT ...,I don't know.,,The context specifies that government agencies...,0,0,0,0.866825,262b891c-be05-4f04-b079-4407068bc933,903a9571-f41c-40b7-a4bb-33cecda1b736
5,whats the repleal of laws and AI hallusinations?,I don't know.,,The context discusses the repeal of laws that ...,0,0,0,1.990699,e2584e77-e0b0-4af4-b348-26650406ec39,5f9e6b05-9b7a-4801-af1b-b7ad6a4f29e0
6,How does the deployment of AI systems relate t...,"Based on the provided context, the deployment ...",,The deployment of AI systems is closely connec...,1,1,0,5.766852,0846ba3a-7e0c-499b-80cb-061874e55ac9,4c32fdc6-d8ad-4635-9e86-8e13155b4ae7
7,"How does the use of AI for manipulatn, disinfo...","Based on the provided context, the NAIC (Natio...",,The context highlights that AI used for manipu...,1,1,0,6.190494,3be306f4-3391-41f6-a9df-23d150ba847c,4cf4a9da-98ec-4578-8480-091332fa6fdc
8,How does the Philippines regulate the developm...,"According to the provided context, the Philipp...",,The Philippines regulates the development and ...,1,1,0,3.235278,1985fb8a-385b-4602-afc4-dfa241eead63,c93ec0d0-edd0-4370-9d45-c162f4c40cbb
9,What does the P4:56 refer to in the context of...,I don't know.,,The P4:56 appears to be a reference code or ti...,0,0,0,0.814284,f56ccff2-ab8d-4b7e-b993-e2c5236f6930,d1d62e0d-7d1e-4cc6-be27-71bca1498c95


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [32]:
EMPATHY_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the question using empathy and kindness, and make sure the user feels heard.

Context: {context}
Question: {question}
"""

empathy_rag_prompt = ChatPromptTemplate.from_template(EMPATHY_RAG_PROMPT)

In [33]:
rag_documents = docs

In [34]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

In [35]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [36]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="AI Bills RAG 2"
)

In [37]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [38]:
empathy_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | empathy_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [39]:
empathy_rag_chain.invoke({"question" : "Why is the Philippines AI Bill important?"})

'Thank you for your thoughtful question. Based on the context provided, the Philippines AI Bill is important because it aims to create a national framework that balances the encouragement of technological innovation with ensuring that AI systems are safe, ethical, transparent, and under meaningful human oversight. It recognizes the transformative impact of AI on industries and society and seeks to promote responsible and lawful AI development that supports Filipino ingenuity and addresses national development challenges.\n\nMoreover, the bill emphasizes protecting the rights and welfare of every citizen by preventing AI from being used to commit crimes, abuse rights, or cause harm, whether intentionally or accidentally. This shows a deep concern for the well-being of the people while fostering progress and innovation in technology.\n\nIt’s clear that the bill strives to create a secure, inclusive, and ethical digital future for the Philippines, thoughtfully anticipating both the opport

Finally, we can evaluate the new chain on the same test set!

In [40]:
evaluate(
    empathy_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        empathy_evaluator
    ],
    metadata={"revision_id": "empathy_rag_chain"},
)

View the evaluation results for experiment: 'complicated-station-33' at:
https://smith.langchain.com/o/92af5c49-0a9e-4f85-beea-085fbd240cd1/datasets/861faf01-85b7-44d1-8a8c-72463fffb4fb/compare?selectedSessions=112f58bd-e760-499e-9456-900ca391a6b3




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.empathy,execution_time,example_id,id
0,How does the DOST support AI policies and init...,Thank you for your thoughtful question. From t...,,The DOST supports AI policies and initiatives ...,1,1,1,5.70564,0768fd5a-970b-482f-aa5c-ec1ef5a0d906,c5f0d260-aa2b-4ab3-bdc3-508429074e06
1,How does TESDA collaborate with the NAIC to su...,Thank you for your thoughtful question. Based ...,,TESDA collaborates with the NAIC by supporting...,1,1,1,5.433222,7baaaa67-bf6a-48fd-9255-d6d474fc30d3,2dea2471-7639-479e-8b25-dc6a1333f59b
2,How does TESDA's role in AI regulation relate ...,Thank you for your thoughtful question. From t...,,"TESDA is listed as a member of the NAIC, which...",1,1,1,5.652752,a2141d56-0bcd-4a44-83d0-ce71916c01d0,a8a0436f-3524-477c-8046-67d06d6d5fcb
3,how DOST and NAIC work together in AI regulati...,Thank you for your thoughtful question. From t...,,"The NAIC, which is attached to the DOST, has j...",1,1,1,9.33319,38c5205d-984d-4c7f-96b3-2001897dc841,57be7d9d-48b4-4781-8139-4d9421ecf097
4,How does the regulation of AI responsibility f...,Thank you for your thoughtful question. Based ...,,The regulation of AI responsibility for harms ...,1,1,1,4.489557,db67c379-1fa4-4938-a726-861cd73c7e19,47964008-2975-4263-97db-ba3c15d056ea
5,How do the United Nations' recommendations for...,Thank you for your thoughtful question. From t...,,The context highlights that international effo...,1,1,1,5.487013,33a985ea-db0d-49d3-a7b7-22d7cd6567a1,2abf25db-1383-477d-b5d0-dece432f7ad8
6,How does the need for regulation and oversight...,Thank you for your thoughtful question. It's c...,,The context highlights that while AI offers si...,1,1,1,10.912735,cdcf2c9c-9950-4642-b234-36eec086d17f,951709d5-a3cd-4d29-a8de-5c65a50c48e1
7,How can the development of multimodal models e...,Thank you for your thoughtful question. From t...,,The development of multimodal models can enhan...,1,1,1,7.673174,b8033ddb-c877-49df-acac-bebfecc89e36,29c0abb1-dead-4a86-8f72-48abd4795d36
8,How does the regulation of Artificial General ...,Thank you for your thoughtful question. Based ...,,The context states that policies should promot...,1,1,1,5.787368,fdb75978-1d94-4b8d-b2c2-54e67e231d59,42463c9f-beca-4058-bd5d-51a25fec38db
9,What does the PHILIPPINE CONSTITUTION say abou...,Thank you for your question about what the Phi...,,"The context references Article XIV, Section 10...",0,0,1,2.799059,1f7ddec1-711c-4e11-bca1-351189c11a8d,b2306d1c-746d-4483-ab9b-2167cbc9a440
