# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /home/mmacek/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/mmacek/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [5]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data/HealthWellnessGuide.txt', 'data/MentalHealthGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  for match in re.finditer('{0}\s*'.format(re.escape(sent)), self.original_text):
  txt = re.sub('(?<={0})\.'.format(am), '‚àØ', txt)
  txt = re.sub('(?<={0})\.'.format(am), '‚àØ', txt)


Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [7]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [8]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [9]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 9, relationships: 19)

We can save and load our knowledge graphs as follows.

In [10]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 9, relationships: 19)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [11]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [12]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

#####  Answer:
I think that SingleHopSpecific makes straightforward questions that can be answered from one chunk and are pretty specific. MultiHopAbstract makes questions that require combining multiple pieces of info and are more highlevel and MultiHopSpecific also needs multiple hops but aims for a precise answer.

Finally, we can use our `TestSetGenerator` to generate our testset!

In [13]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is NUTRITION AND DIET?,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,NUTRITION AND DIET refers to the fundamentals ...,single_hop_specifc_query_synthesizer
1,How does understanding the habit loop contribu...,[13: The Science of Habit Formation Habits are...,Understanding how habits form can help you bui...,single_hop_specifc_query_synthesizer
2,Tuesday what should I do for mental health?,[The Personal Wellness Guide A Comprehensive R...,The context does not provide specific informat...,single_hop_specifc_query_synthesizer
3,Whaat is the significance of the United States...,[The Mental Health and Psychology Handbook A P...,The context mentions that approximately 40 mil...,single_hop_specifc_query_synthesizer
4,What is COGNITIV BEHAVIORAL THERAPY?,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Cognitive Behavioral Therapy is one of the mos...,single_hop_specifc_query_synthesizer
5,How can understanding the signs and strategies...,[<1-hop>\n\n13: The Science of Habit Formation...,Understanding the signs of poor work-life bala...,multi_hop_abstract_query_synthesizer
6,How can building and maintaining social relati...,[<1-hop>\n\n13: The Science of Habit Formation...,Building and maintaining social relationships ...,multi_hop_abstract_query_synthesizer
7,How do the foundational concepts of mental hea...,[<1-hop>\n\nThe Mental Health and Psychology H...,The handbook's foundational concepts of mental...,multi_hop_abstract_query_synthesizer
8,"How can understanding mental health resources,...",[<1-hop>\n\nWrite letters to or from your futu...,Understanding mental health resources like sle...,multi_hop_specific_query_synthesizer
9,how mental health and mental health are connec...,[<1-hop>\n\nWrite letters to or from your futu...,"mental health is about emotional, psychologica...",multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [14]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [15]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are some recommended exercises to relieve...,[The Personal Wellness Guide A Comprehensive R...,The guide suggests exercises such as neck roll...,single_hop_specifc_query_synthesizer
1,What does Stage 3 refer to in the context of s...,[PART 3: SLEEP AND RECOVERY Chapter 7: The Sci...,Stage 3 is the deep sleep stage during which t...,single_hop_specifc_query_synthesizer
2,What information does Chapter 19 cover regardi...,[PART 5: BUILDING HEALTHY HABITS Chapter 13: T...,Chapter 19 discusses building healthy habits b...,single_hop_specifc_query_synthesizer
3,In United States how mental health help us?,[The Mental Health and Psychology Handbook A P...,The context explains that mental health in the...,single_hop_specifc_query_synthesizer
4,"How does maintaining proper hydration, as emph...",[<1-hop>\n\nThe Personal Wellness Guide A Comp...,Maintaining proper hydration is crucial becaus...,multi_hop_abstract_query_synthesizer
5,"So like, if I wanna eat balanced meals and stu...",[<1-hop>\n\nThe Personal Wellness Guide A Comp...,The guide explains that a balanced diet includ...,multi_hop_abstract_query_synthesizer
6,Considering the importance of a healthy diet r...,[<1-hop>\n\nhour before bed - No caffeine afte...,Integrating a healthy diet that includes 5 or ...,multi_hop_abstract_query_synthesizer
7,How do sleep deprivation and sleep disorders i...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Sleep deprivation and sleep disorders can nega...,multi_hop_abstract_query_synthesizer
8,How do mental health and stress management rel...,[<1-hop>\n\nThe Mental Health and Psychology H...,The context explains that mental health encomp...,multi_hop_specific_query_synthesizer
9,How can Cognitive Behavioral Therapy for Insom...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Cognitive Behavioral Therapy for Insomnia (CBT...,multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
I would say that  ‚Äúunrolled‚Äù path gives me more control and makes debugging easier, but it‚Äôs more work and more code to maintain. The ‚Äúabstracted‚Äù path is faster to use and cleaner, but we give up some control over what‚Äôs generated and why.

---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [16]:
### YOUR CODE HERE ###

# Define a custom query distribution with different weights

from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.30),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.20),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.50),
]

# Generate a new test set and compare with the default

testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()


Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the Pelvic area in relation to lower b...,[The Personal Wellness Guide A Comprehensive R...,The context discusses exercises for lower back...,single_hop_specifc_query_synthesizer
1,What are some effective ways to improve sleep ...,[PART 3: SLEEP AND RECOVERY Chapter 7: The Sci...,Improving sleep quality involves practicing go...,single_hop_specifc_query_synthesizer
2,Can you tell me about Chapter 15 in the contex...,[PART 5: BUILDING HEALTHY HABITS Chapter 13: T...,Chapter 15: Evening Wind-Down Routines discuss...,single_hop_specifc_query_synthesizer
3,What types of exercise are recommended for hea...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,The Personal Wellness Guide states that the fo...,multi_hop_abstract_query_synthesizer
4,Whay is sleep hygeine and how it helps with sl...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Sleep hygeine refers to habits and practices t...,multi_hop_abstract_query_synthesizer
5,How do chaper 7 and chaper 13 help build healt...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,Chaper 13 explains that habits are behaviors t...,multi_hop_specific_query_synthesizer
6,How does mental health influence physical well...,[<1-hop>\n\nThe Mental Health and Psychology H...,Mental health affects physical well-being thro...,multi_hop_specific_query_synthesizer
7,How does Cognitive Behavioral Therapy for Inso...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Cognitive Behavioral Therapy for Insomnia (CBT...,multi_hop_specific_query_synthesizer
8,How can setting healthy boundaries support men...,[<1-hop>\n\nsocial interactions How to set and...,Setting healthy boundaries helps protect emoti...,multi_hop_specific_query_synthesizer
9,How can CBT and mindfulness-based therapies he...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Cognitive Behavioral Therapy (CBT) focuses on ...,multi_hop_specific_query_synthesizer


I bumped MultiHopSpecific up to 0.50 because it stresses the retriever + reasoning more (it‚Äôs a tougher, more realistic RAG check).
I lowered SingleHopSpecific to 0.30 so the set isn‚Äôt mostly ‚Äúeasy wins,‚Äù and kept MultiHopAbstract at 0.20 to still include some high-level synthesis questions without letting them dominate.


We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [17]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [18]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [19]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [20]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [21]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [22]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [23]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [24]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [25]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [26]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [27]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [29]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [31]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`:
> - `labeled_helpfulness_evaluator`:
> - `dopeness_evaluator`:

## LangSmith Evaluation

In [32]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'timely-jelly-44' at:
https://smith.langchain.com/o/7114d29f-9ab4-452b-a0e9-9884160ec1b8/datasets/afad71bc-0e3f-4b29-9398-88d959722e53/compare?selectedSessions=51816892-90a5-4522-bda9-08a5e1e6be2c




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can setting boundaries and managing digita...,Setting boundaries and managing digital mental...,,"Setting healthy boundaries, such as clearly co...",True,True,True,4.954854,8b9dfb2e-5dc1-446e-aa80-ae5cf77be9a3,019c674d-4385-7522-97f2-9dc8f2d3614b
1,How do chapters 7 and 16 connect about sleep a...,"Based on the provided context, Chapter 7 is no...",,Chapter 7 explains that sleep is crucial for h...,False,False,False,3.295085,594d1333-d272-448c-a77e-fd62f029f229,019c674d-7f50-7af1-bd4c-ac15749c513c
2,How can Cognitive Behavioral Therapy for Insom...,Cognitive Behavioral Therapy for Insomnia (CBT...,,Cognitive Behavioral Therapy for Insomnia (CBT...,True,True,False,3.868764,e36d0fed-3696-4fab-a2e3-259cd124c8fd,019c674d-c07f-7b31-be0a-7cf9ab9f6c73
3,How do mental health and stress management rel...,Mental health and stress management are closel...,,The context explains that mental health encomp...,True,True,True,2.859952,4128a3df-a508-43cd-a108-282734963f38,019c674d-ffb4-7a83-8465-9d5ce2f250b6
4,How do sleep deprivation and sleep disorders i...,Based on the provided context:\n\n**Impact of ...,,Sleep deprivation and sleep disorders can nega...,True,True,True,6.006572,6917ff5b-8f6f-4b31-8fee-e0a0a00577b4,019c674e-2ab9-78d2-b916-072f43217ecf
5,Considering the importance of a healthy diet r...,Integrating a healthy diet rich in fruits and ...,,Integrating a healthy diet that includes 5 or ...,True,True,False,4.667634,5c25eebb-8d10-4a9e-bfde-b0b6d127b496,019c674e-7b91-7982-946e-8db7dcd9e393
6,"So like, if I wanna eat balanced meals and stu...","Based on the context provided, the sample day ...",,The guide explains that a balanced diet includ...,True,True,False,4.738214,add9d10b-a56f-4536-9224-b5941ece8987,019c674e-cf83-7cb3-a76d-03fe39593619
7,"How does maintaining proper hydration, as emph...",Maintaining proper hydration supports virtuall...,,Maintaining proper hydration is crucial becaus...,True,True,False,2.066828,ed677987-fe84-4fd2-8440-97219a671b4c,019c674f-09d6-77c3-8b5a-4ff4007aa360
8,In United States how mental health help us?,I don't know.,,The context explains that mental health in the...,False,False,False,1.008523,76c57057-d15f-4879-b5b7-5ef4cf4f61ba,019c674f-4d88-7602-970b-e1f39af780fd
9,What information does Chapter 19 cover regardi...,Chapter 19 covers maintaining balance between ...,,Chapter 19 discusses building healthy habits b...,False,False,False,1.889404,80387308-cb20-4fcf-9d2a-cf66acba40f7,019c674f-82f7-77a1-a3e4-eb1ece90e550


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [33]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [35]:
rag_documents = docs

In [36]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:
I think that chunk size changes what the retriever can see, small chunks can miss important context, and big chunks can add noise. That directly affects how relevant my retrieved context is, so my answers and eval scores shift.

In [37]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:
Different embedding models ‚Äúunderstand‚Äù meaning differently, so they may search my documents more or less accurately. I think  that better semantic embeddings usually retrieve better context, which then improve answer quality and evaluation metrics.

In [38]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [39]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [40]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [41]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

'Alright, champ, let‚Äôs turn your sleep game from meh to legendary! Based on the dopest tips from the health scrolls:\n\n1. **Lock down a sleep schedule like clockwork.** Hit the sack and rise around the same time every day‚Äîeven on weekends‚Äîso your body vibes with a consistent rhythm.\n\n2. **Create a chill bedtime ritual.** Think reading a killer book, gentle stretches, or soaking in a warm bath to kick stress to the curb.\n\n3. **Set the scene like a sleep sanctuary:** Keep your room cool (65-68¬∞F, aka 18-20¬∞C), dark as a cave (blackout curtains or sleep mask FTW), and quiet (white noise machines or earplugs are your sidekicks).\n\n4. **Cut the screens way before lights out.** Ditch that blue light 1-2 hours ahead to let your melatonin flow like a boss.\n\n5. **Avoid caffeine post 2 PM.** Otherwise, Joe‚Äôs energy hits might keep your brain buzzing at bedtime.\n\n6. **Move your body regularly but not too late.** Exercise is like magic dust for sleep, just don‚Äôt torch yoursel

Finally, we can evaluate the new chain on the same test set!

In [42]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'flowery-shake-3' at:
https://smith.langchain.com/o/7114d29f-9ab4-452b-a0e9-9884160ec1b8/datasets/afad71bc-0e3f-4b29-9398-88d959722e53/compare?selectedSessions=f0d9ef29-5e1a-404a-8b7e-060489a2b25d




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can setting boundaries and managing digita...,"Yo, here‚Äôs the lowdown on how setting boundari...",,"Setting healthy boundaries, such as clearly co...",True,True,True,5.227669,8b9dfb2e-5dc1-446e-aa80-ae5cf77be9a3,019c6757-90f2-7d90-b4c1-3facb3518056
1,How do chapters 7 and 16 connect about sleep a...,"Alright, let‚Äôs break down this brain-buzzing s...",,Chapter 7 explains that sleep is crucial for h...,True,True,True,4.739577,594d1333-d272-448c-a77e-fd62f029f229,019c6757-da2e-7b43-b084-613b3d0be09f
2,How can Cognitive Behavioral Therapy for Insom...,"Oh, buckle up because CBT-I is like the ultima...",,Cognitive Behavioral Therapy for Insomnia (CBT...,True,True,True,7.989878,e36d0fed-3696-4fab-a2e3-259cd124c8fd,019c6758-2122-76c1-b2ad-69ee99c24ae7
3,How do mental health and stress management rel...,"Alright, strap in because mental health and st...",,The context explains that mental health encomp...,True,True,True,5.3372,4128a3df-a508-43cd-a108-282734963f38,019c6758-735a-7980-a2f9-42919aceee5c
4,How do sleep deprivation and sleep disorders i...,"Alright, buckle up for the ultimate crash cour...",,Sleep deprivation and sleep disorders can nega...,True,True,True,7.403042,6917ff5b-8f6f-4b31-8fee-e0a0a00577b4,019c6758-bd87-7b31-b665-7550a113383e
5,Considering the importance of a healthy diet r...,"Alright, here‚Äôs the ultimate wellness remix fo...",,Integrating a healthy diet that includes 5 or ...,True,True,True,5.504969,5c25eebb-8d10-4a9e-bfde-b0b6d127b496,019c6759-15a7-7ac1-860f-99d92ec21823
6,"So like, if I wanna eat balanced meals and stu...","Alright, let‚Äôs crank that balanced meal game t...",,The guide explains that a balanced diet includ...,True,True,True,7.137865,add9d10b-a56f-4536-9224-b5941ece8987,019c6759-551f-76c1-8c92-8787ad0007dc
7,"How does maintaining proper hydration, as emph...","Yo, hydration is straight-up the VIP pass for ...",,Maintaining proper hydration is crucial becaus...,True,True,True,4.602548,ed677987-fe84-4fd2-8440-97219a671b4c,019c6759-9627-7932-a4e1-122b70c50128
8,In United States how mental health help us?,"Alright, let‚Äôs dial into the mental health vib...",,The context explains that mental health in the...,True,True,True,4.398407,76c57057-d15f-4879-b5b7-5ef4cf4f61ba,019c6759-e590-7960-8a4b-229f9a8fdf26
9,What information does Chapter 19 cover regardi...,"Alright, here‚Äôs the 411 on Chapter 19 from the...",,Chapter 19 discusses building healthy habits b...,False,False,True,5.687709,80387308-cb20-4fcf-9d2a-cf66acba40f7,019c675a-289d-7202-9beb-321be6fe9197


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

##### Answer:
![Screenshot](./screenshot.png)


The new chain clearly improved quality scores like dopeness, helpfulness, and QA because the answers are more detailed and confident. However, it also increased latency, token usage, and cost since the responses are longer and more elaborate. Overall, it‚Äôs a simple tradeoff: better answers, but slightly more expensive and slower.

---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores