# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/manikaranam/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/manikaranam/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [5]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data/MentalHealthGuide.txt', 'data/HealthWellnessGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [7]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [8]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [9]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 10, relationships: 23)

We can save and load our knowledge graphs as follows.

In [15]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 10, relationships: 23)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [16]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [None]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### Answer:
1. SingleHopSpecificQuerySynthesizer - It looks at one document and asks a direct question about a detail found there
2. MultiHopSpecificQuerySynthesizer - It requires the system to find two or more related facts across different documents to get the answer. Links specific details rtogether and asks connected question
3. MultiHopAbstractQuerySynthesizer - it looks at multiple pieces of information, but instead of asking for a fact, it asks for a high-level takeaway

Finally, we can use our `TestSetGenerator` to generate our testset!

In [18]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What role does mental health play in overall w...,[The Mental Health and Psychology Handbook A P...,"Mental health encompasses our emotional, psych...",single_hop_specifc_query_synthesizer
1,Who is Jon Kabat-Zinn in the context of mindfu...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Jon Kabat-Zinn developed Mindfulness-Based Str...,single_hop_specifc_query_synthesizer
2,How does Vitamin D influence mental health acc...,[Write letters to or from your future self Jou...,"Vitamin D, obtained from sunlight and fortifie...",single_hop_specifc_query_synthesizer
3,How does FOMO influence social media behaviors...,[social interactions How to set and maintain b...,Fear of missing out (FOMO) drives compulsive c...,single_hop_specifc_query_synthesizer
4,What is the purpose of the Cat-Cow Stretch in ...,[The Personal Wellness Guide A Comprehensive R...,The Cat-Cow Stretch is recommended for lower b...,single_hop_specifc_query_synthesizer
5,how building morning and evening routines impa...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,Building healthy routines like morning wake-up...,multi_hop_abstract_query_synthesizer
6,How can incorporating self-care practices like...,[<1-hop>\n\nWrite letters to or from your futu...,The context highlights that engaging in self-c...,multi_hop_abstract_query_synthesizer
7,How can natural headache remedies and sleep im...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,Natural headache remedies such as drinking wat...,multi_hop_abstract_query_synthesizer
8,How does Chapter 16's discussion on managing d...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,Chapter 16 emphasizes the importance of mindfu...,multi_hop_specific_query_synthesizer
9,how CBT and CBT-I help mental health and stress,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,The context explains that CBT (Cognitive Behav...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [19]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [20]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"like, what is the psychology handbook thingy f...",[The Mental Health and Psychology Handbook A P...,The Mental Health and Psychology Handbook is a...,single_hop_specifc_query_synthesizer
1,What is Chapter 12 about?,[PART 3: BUILDING RESILIENCE Chapter 7: What I...,"Chapter 12 discusses sleep and mental health, ...",single_hop_specifc_query_synthesizer
2,How does Dialectical Behavior Therapy (DBT) in...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Dialectical Behavior Therapy (DBT) combines CB...,single_hop_specifc_query_synthesizer
3,What role do B vitamins play in mental health?,[- Sleep restriction: Limiting time in bed to ...,"B vitamins are found in whole grains, eggs, an...",single_hop_specifc_query_synthesizer
4,How can managing stress through sleep and mind...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Managing stress through sleep and mindfulness ...,multi_hop_abstract_query_synthesizer
5,how can setting and maintaining healthy bounda...,[<1-hop>\n\n- Sleep restriction: Limiting time...,"Setting and maintaining healthy boundaries, su...",multi_hop_abstract_query_synthesizer
6,How does engaging in regular exercise and move...,[<1-hop>\n\nThe Mental Health and Psychology H...,"Engaging in regular exercise and movement, as ...",multi_hop_abstract_query_synthesizer
7,How can setting and maintaining healthy bounda...,[<1-hop>\n\n- Sleep restriction: Limiting time...,"Setting and maintaining healthy boundaries, in...",multi_hop_abstract_query_synthesizer
8,Considering the importance of developing emoti...,[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,"Practicing emotional regulation strategies, as...",multi_hop_specific_query_synthesizer
9,"How does Cognitive Behavioral Therapy, particu...",[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,"Cognitive Behavioral Therapy (CBT), especially...",multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
Abstrated Approach: 
- Is faster one function call. Uses standard defaults for chunking and extractions. Limited in customizations. Higher cost if re-running on the same data. 

Unrolled KG Approach:
- Slower to setup, manual overhead to setup nodes and transformation application. Gives more controls over nodes and relations and metadata. And customization options. Preload the kg.save to avoid the re-running costs

When choose Abstrated Approach: Prototyping, or simple straight foward document set or one off generation. 

When use unrolled approach: Production workflows - domain specific and balance ditribution of tests

---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [23]:
# Define a custom query distribution with different weights
custom_query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.2),  # 20% Simple
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.4),  # 40% Reasoning
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.4),  # 40% Linking
]

# Generate a new test set and compare with the default
custom_testset = generator.generate(
    testset_size=10, 
    query_distribution=custom_query_distribution
)

custom_testset.to_pandas()

# 3. Compare and 4. Explain
# I chose a 20/40/40 split because complex RAG systems often struggle more 
# with multi-step reasoning than simple retrieval. This distribution 
# forces the system to prove it can "hop" between documents effectively.

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Whaat is the impportance of mental health in h...,[The Mental Health and Psychology Handbook A P...,"Mental health encompasses our emotional, psych...",single_hop_specifc_query_synthesizer
1,What is the significance of Stanford in the co...,[PART 3: BUILDING RESILIENCE Chapter 7: What I...,The context discusses Stanford psychologist Ca...,single_hop_specifc_query_synthesizer
2,how sleep importance for health and cognition ...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,"Sleep is crucial for physical health, mental w...",multi_hop_abstract_query_synthesizer
3,How can managing digital wellness and setting ...,[<1-hop>\n\n- Sleep restriction: Limiting time...,Managing digital wellness by setting intention...,multi_hop_abstract_query_synthesizer
4,How does engaging in regular exercise and move...,[<1-hop>\n\nThe Mental Health and Psychology H...,Engaging in regular exercise and movement can ...,multi_hop_abstract_query_synthesizer
5,How can I build moring routines for wellness a...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,To build morning routines for wellness and pro...,multi_hop_abstract_query_synthesizer
6,how cognitive behavioral therapy and cognitive...,[<1-hop>\n\n- Sleep restriction: Limiting time...,The context discusses sleep strategies such as...,multi_hop_specific_query_synthesizer
7,how chapter 12 and chapter 8 relate to buildin...,[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,"chapter 12 talks about sleep and recovery, exp...",multi_hop_specific_query_synthesizer
8,how CBT for insomnia help sleep and how CBT he...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,CBT for insomnia (CBT-I) helps sleep by changi...,multi_hop_specific_query_synthesizer
9,How do Chapters 11 and 17 together inform a ho...,[<1-hop>\n\nPART 3: BUILDING RESILIENCE Chapte...,Chapter 11 discusses building resilience by de...,multi_hop_specific_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [24]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [25]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [27]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [28]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [29]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [30]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [31]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [32]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [33]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [34]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [35]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [37]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [38]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`:
> - `labeled_helpfulness_evaluator`:
> - `dopeness_evaluator`:

## LangSmith Evaluation

In [39]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'tart-fruit-17' at:
https://smith.langchain.com/o/ef4b4a58-573b-47d9-880a-61c634d77ebe/datasets/571db3ce-5804-46e5-b971-0d246cf8e45c/compare?selectedSessions=97d1eff6-2116-429a-bd7e-eac679ebc344




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,Chapter 11 and 14 how they help mental health ...,Based on the provided context:\n\n- Chapter 11...,,Chapter 11 explains that exercise improves men...,False,True,False,2.689079,7262d585-7465-413f-aca6-4cbf8038b3d5,019c6eb3-3ed3-7c41-bd1e-192e3f97c180
1,Tell me about chapter 8 and 14 how they help w...,Chapter 8 focuses on improving sleep quality b...,,"Chapter 8 talks about building healthy habits,...",False,True,True,3.783401,0084e3d8-ad86-40b5-b600-288339a0f563,019c6eb3-8e17-7cc0-b789-dbdf79cf39cd
2,"How does Cognitive Behavioral Therapy, particu...",I don't know.,,"Cognitive Behavioral Therapy (CBT), especially...",False,False,False,0.712362,a320e259-a42a-44f9-9098-5c523b229412,019c6eb3-da14-79c1-aaa4-70e625b55ab0
3,Considering the importance of developing emoti...,Practicing emotional regulation strategies con...,,"Practicing emotional regulation strategies, as...",True,True,True,3.00842,78f4a1cd-09b9-467d-a3f6-4ff86e0fb635,019c6eb4-05ae-73b3-ba48-6d3a5cd352c1
4,How can setting and maintaining healthy bounda...,Setting and maintaining healthy boundaries‚Äîemo...,,"Setting and maintaining healthy boundaries, in...",True,True,True,4.334897,4203b368-91ac-4c23-a929-37525403dfd2,019c6eb4-4695-7252-86e8-41d25270c8ec
5,How does engaging in regular exercise and move...,Engaging in regular exercise positively influe...,,"Engaging in regular exercise and movement, as ...",True,True,True,3.700917,43e17369-2b7c-4c54-9d41-794657e928c4,019c6eb4-7469-7403-9c6d-61b61aea4abe
6,how can setting and maintaining healthy bounda...,"Setting and maintaining healthy boundaries, in...",,"Setting and maintaining healthy boundaries, su...",True,True,True,2.776386,6844d19f-0fd2-4b02-a224-6a55df2e8468,019c6eb4-b404-7a50-842d-fee6a686a8f3
7,How can managing stress through sleep and mind...,Managing stress through sleep and mindfulness ...,,Managing stress through sleep and mindfulness ...,True,True,False,2.614425,8b7ee516-18c5-46ff-a44a-5f2bab37ee9a,019c6eb4-db0f-7100-8067-47e1ff0d0dfd
8,What role do B vitamins play in mental health?,B vitamins are essential for neurotransmitter ...,,"B vitamins are found in whole grains, eggs, an...",True,True,False,2.399834,b6c2fa82-cfc2-4d8c-8b74-94fe1c66a3de,019c6eb5-1348-76b0-8c11-9b35024e3f3f
9,How does Dialectical Behavior Therapy (DBT) in...,Dialectical Behavior Therapy (DBT) incorporate...,,Dialectical Behavior Therapy (DBT) combines CB...,True,True,False,1.969878,13a20ac3-c1fe-4637-9f4c-83c82e6d1d01,019c6eb5-41bc-7b62-99a8-36026c5b7574


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [40]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [41]:
rag_documents = docs

In [42]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:
- Small Chunks: chunk usually represents a single, clear idea, making it "sharper" in the vector space. These allow the retriever to pinpoint very specific facts.

- Large Chunks: provide more surrounding context, which is vital for the Multi-Hop Abstract queries you are synthesizing. 
- Large Chunks: Lost in the middle - but can make it harder for the retriever to find the exact matching snippet among all the extra text
- Large Chunks: Incrase API costs and response time 

In [43]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:
- Generic vs domain specific
- Retrival logic used
- Model Dimensions it can handle
- Semantic meaning parameters used

In [49]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [45]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [46]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [47]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

'Alright, ready to level up your sleep game? Here‚Äôs the ultimate playbook straight from the sleep sages of the Health and Mental Wellness realm:\n\n**1. Lock in a consistent sleep schedule** ‚Äî Yes, even on weekends. Your body loves rhythm like a DJ loves beats. Sync up!\n\n**2. Craft a chill bedtime routine** ‚Äî Picture yourself reading a dope book, stretching those muscles gently, or soaking in a warm bath. All vibes, no stress.\n\n**3. Environment is king** ‚Äî Set your throne (bedroom) cool (65-68¬∞F/18-20¬∞C), dark (blackout curtains or sleep masks for the win), and quiet (white noise machines or earplugs are your knights).\n\n**4. Screen detox** ‚Äî Power down those glowing screens at least 1‚Äì2 hours before bed. Your brain‚Äôs gotta switch from hustle mode to rest mode.\n\n**5. Caffeine cutoff** ‚Äî No caffeine after 2 PM, or you‚Äôll be wired like a live wire.\n\n**6. Move that body, but not too late** ‚Äî Exercise fuels good sleep, but don‚Äôt make your muscles party righ

Finally, we can evaluate the new chain on the same test set!

In [48]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'brief-moon-19' at:
https://smith.langchain.com/o/ef4b4a58-573b-47d9-880a-61c634d77ebe/datasets/571db3ce-5804-46e5-b971-0d246cf8e45c/compare?selectedSessions=762e1fde-1cd4-464a-a053-c4e1e9325982




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,Chapter 11 and 14 how they help mental health ...,"Alright, let‚Äôs crank this up to eleven! Chapte...",,Chapter 11 explains that exercise improves men...,True,True,True,6.425646,7262d585-7465-413f-aca6-4cbf8038b3d5,019c6ec5-5bcc-75f3-9a6e-62a51531f837
1,Tell me about chapter 8 and 14 how they help w...,"Yo, let‚Äôs dive into the magic of Chapters 8 an...",,"Chapter 8 talks about building healthy habits,...",True,True,True,4.775073,0084e3d8-ad86-40b5-b600-288339a0f563,019c6ec5-979e-7e83-911a-023cbae87461
2,"How does Cognitive Behavioral Therapy, particu...","Alright, let‚Äôs dive into this brainwave cockta...",,"Cognitive Behavioral Therapy (CBT), especially...",True,True,True,5.703913,a320e259-a42a-44f9-9098-5c523b229412,019c6ec5-ce33-79f0-bf08-3ffc7cd70c55
3,Considering the importance of developing emoti...,"Oh, this is pure mental mastery territory‚Äîlist...",,"Practicing emotional regulation strategies, as...",True,True,True,7.190113,78f4a1cd-09b9-467d-a3f6-4ff86e0fb635,019c6ec5-fe67-7931-aad6-2ce665a929f3
4,How can setting and maintaining healthy bounda...,"Alright, let‚Äôs dive into the mind-melting magi...",,"Setting and maintaining healthy boundaries, in...",True,True,True,6.014535,4203b368-91ac-4c23-a929-37525403dfd2,019c6ec6-41b6-7382-8f73-0fd194d442e9
5,How does engaging in regular exercise and move...,"Alright, strap in for a mind-body blast! üåü\n\n...",,"Engaging in regular exercise and movement, as ...",True,True,True,3.991839,43e17369-2b7c-4c54-9d41-794657e928c4,019c6ec6-78e4-7293-8308-543f45e9c140
6,how can setting and maintaining healthy bounda...,"Alright, let‚Äôs crank this up to max dopeness! ...",,"Setting and maintaining healthy boundaries, su...",True,True,True,5.425452,6844d19f-0fd2-4b02-a224-6a55df2e8468,019c6ec6-a281-7ab2-9c7a-5b4ab137e8c9
7,How can managing stress through sleep and mind...,"Oh, you just tapped into the ultimate stress-b...",,Managing stress through sleep and mindfulness ...,True,True,True,4.916185,8b7ee516-18c5-46ff-a44a-5f2bab37ee9a,019c6ec6-d81a-7622-88bf-c51bb3991661
8,What role do B vitamins play in mental health?,"Alright, buckle up for this vitamin-powered me...",,"B vitamins are found in whole grains, eggs, an...",True,True,True,2.663302,b6c2fa82-cfc2-4d8c-8b74-94fe1c66a3de,019c6ec7-05d0-7970-91c7-374d1f0d06bd
9,How does Dialectical Behavior Therapy (DBT) in...,"Alright, buckle up for this mental health ride...",,Dialectical Behavior Therapy (DBT) combines CB...,True,True,True,3.040607,13a20ac3-c1fe-4637-9f4c-83c82e6d1d01,019c6ec7-294f-77f1-bc37-ca2e4e5a1efd


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

##### Answer:
/data/brief-moon-19 - LangSmith.pdf

/data/tart-fruit-17 - Datasets & Experiments - LangSmith.pdf

tart-fruit-17 (Char size 500)
- 500 char smaller more granular
- Direct, factual and neutral
- row 11 shows an "I don't know" response followed by a factual statement

brief-moon-19
- Dopeness optimised
- 1000 chars
- Styled and conversational
- the model is more verbose in its refusal or explanation. The larger chunk size in the second chain likely allowed the retriever to capture the specific mention of "B vitamins" more effectively, or the change in prompt encouraged the model to "dig deeper" into the provided context rather than defaulting to a quick refusal.


---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores