# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
from dotenv import load_dotenv

In [2]:
load_dotenv()

True

In [3]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/stevegoodman/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/stevegoodman/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [4]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [7]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data/MentalHealthGuide.txt', 'data/HealthWellnessGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [8]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [9]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [10]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [11]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 9, relationships: 17)

We can save and load our knowledge graphs as follows.

In [12]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 9, relationships: 17)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [13]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [14]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### Answer:
SingleHopSpecificQuerySynthesizer - is pulling a simple question answer pair from a single resource. That may be factual in nature.


MultihopX  - requires further reasoning where multiple resources need to be combined to generate an answer to a question. MultihopspecificX means there is a specific or direct question to be answered. Whereads MultiHopAbstractX, is for questions that are less direct and require more reasoning, or opinion forming gathered from multiple resources.

The second argument is the fraction of test cases created by each of the above synthesizers, so I guess it depends on how hard you want to make the dataset and whether you want straighforward verifiable facts versus more subjective abstract reasoning cases.

Finally, we can use our `TestSetGenerator` to generate our testset!

In [18]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Wha is the Psyhology Handbook?,[The Mental Health and Psychology Handbook A P...,The Mental Health and Psychology Handbook is a...,single_hop_specifc_query_synthesizer
1,What is Cognitive Behavioral Therapy and how d...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Cognitive Behavioral Therapy is one of the mos...,single_hop_specifc_query_synthesizer
2,What is CBT-I used for?,[Write letters to or from your future self Jou...,CBT-I is the recommended first-line treatment ...,single_hop_specifc_query_synthesizer
3,What role do Licensed Clinical Social Workers ...,[social interactions How to set and maintain b...,Licensed Clinical Social Workers provide thera...,single_hop_specifc_query_synthesizer
4,What are Proteins and why are they important i...,[PART 2: NUTRITION AND DIET Chapter 4: Fundame...,Proteins are essential for muscle repair and i...,single_hop_specifc_query_synthesizer
5,How can I use evening wind-down routines for b...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,To improve sleep quality and wake up refreshed...,multi_hop_abstract_query_synthesizer
6,How can incorporating daily wellness practices...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,Incorporating daily wellness practices such as...,multi_hop_abstract_query_synthesizer
7,H0w can settin and maintaing boundariez help i...,[<1-hop>\n\nsocial interactions How to set and...,"Setting and maintaining boundaries, such as id...",multi_hop_abstract_query_synthesizer
8,how do cbt and cbt-i help with mental health a...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Cognitive Behavioral Therapy (CBT) is an effec...,multi_hop_specific_query_synthesizer
9,How can practicing CBT-I and mindfulness-based...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,"Practicing CBT-I, which is the gold standard t...",multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [19]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [20]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"As a health and wellness coach, could you expl...",[The Mental Health and Psychology Handbook A P...,"Mental health encompasses our emotional, psych...",single_hop_specifc_query_synthesizer
1,What is DBT and how does it differ from other ...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Dialectical Behavior Therapy (DBT) was origina...,single_hop_specifc_query_synthesizer
2,How does sleep impact mental health according ...,[Write letters to or from your future self Jou...,Sleep and mental health have a bidirectional r...,single_hop_specifc_query_synthesizer
3,What are some strategies for managing digital ...,[social interactions How to set and maintain b...,Strategies for digital mental health include s...,single_hop_specifc_query_synthesizer
4,Based on the research findings on exercise ben...,[<1-hop>\n\nWrite letters to or from your futu...,Research findings indicate that regular exerci...,multi_hop_abstract_query_synthesizer
5,How can stress and mood tracking combined with...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Stress and mood tracking helps identify patter...,multi_hop_abstract_query_synthesizer
6,How can incorporating both balanced nutrition ...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,"In the provided health and wellness context, a...",multi_hop_abstract_query_synthesizer
7,how do micronutrients like vitamins and minera...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,The context explains that micronutrients such ...,multi_hop_abstract_query_synthesizer
8,How can understanding the habit loop and estab...,[<1-hop>\n\n13: The Science of Habit Formation...,By understanding the habit loop‚Äîcomprising cue...,multi_hop_specific_query_synthesizer
9,How can understanding the habit loop and estab...,[<1-hop>\n\n13: The Science of Habit Formation...,"Understanding the habit loop, which includes c...",multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
*Your answer here*

---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [None]:
### YOUR CODE HERE ###

# Define a custom query distribution with different weights
# Generate a new test set and compare with the default





SyntaxError: invalid syntax (3624240315.py, line 8)

Chosen the weights to be more towards the multihop questions because these seem the most challenging to answer well (this plays out in Activity 2 results) and therefore will be harder to optimise our app for.  


In [27]:
dataset.to_pandas().query("synthesizer_name=='multi_hop_abstract_query_synthesizer'")

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
4,Based on the research findings on exercise ben...,[<1-hop>\n\nWrite letters to or from your futu...,Research findings indicate that regular exerci...,multi_hop_abstract_query_synthesizer
5,How can stress and mood tracking combined with...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Stress and mood tracking helps identify patter...,multi_hop_abstract_query_synthesizer
6,How can incorporating both balanced nutrition ...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,"In the provided health and wellness context, a...",multi_hop_abstract_query_synthesizer
7,how do micronutrients like vitamins and minera...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,The context explains that micronutrients such ...,multi_hop_abstract_query_synthesizer


We'll need to provide our LangSmith API key, and set tracing to "true".

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [6]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [21]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [28]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [29]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [30]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [31]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [32]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [33]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [34]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [35]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [36]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [37]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [38]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`:
> - `labeled_helpfulness_evaluator`:
> - `dopeness_evaluator`:

## LangSmith Evaluation

In [40]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'formal-plastic-46' at:
https://smith.langchain.com/o/dd799170-e56a-56cc-b3a5-c34310fb0969/datasets/3b244119-045c-424c-8af2-5f37bec4f547/compare?selectedSessions=6141b63f-2c67-4972-8627-c6d2ecf269ed




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How do Chapters 4 and 16 together inform a hol...,I don't know.,,Chapter 4 emphasizes the importance of fundame...,False,False,False,0.712377,2c0ac3b9-bb66-4d6c-8922-b6ceeb4d51c1,019c519d-6c0a-7121-8d4f-3be5c2881185
1,How can understanding the fundamentals of heal...,I don't know.,,Understanding the fundamentals of healthy eati...,False,False,False,0.815102,d5a90058-19ea-400e-9dcd-7f8433d1a57b,019c519d-a336-7b53-ad0a-9a774d4d49cc
2,How can understanding the habit loop and estab...,Understanding the habit loop‚Äîwhich consists of...,,"Understanding the habit loop, which includes c...",True,True,True,2.572155,a8fe36d2-2de8-4f8a-ac1f-583c3cf8587d,019c519d-d324-76a1-935c-744634d27e25
3,How can understanding the habit loop and estab...,Understanding the habit loop‚Äîwhich consists of...,,By understanding the habit loop‚Äîcomprising cue...,True,True,True,2.555839,20b23fe0-63cb-4c9f-a559-d9f77e9743c4,019c519e-16d0-77d0-888b-0e5179c82b1d
4,how do micronutrients like vitamins and minera...,"Based on the provided context, micronutrients ...",,The context explains that micronutrients such ...,True,True,False,2.94982,37f0168d-3100-4f41-be2e-9570cb6c7170,019c519e-5b87-7313-8bb2-0da986ba4297
5,How can incorporating both balanced nutrition ...,Incorporating both balanced nutrition and regu...,,"In the provided health and wellness context, a...",True,True,True,6.19051,f8704981-25fe-476f-86bf-bc27d8c15dca,019c519e-96cb-7873-81ab-8c93efff9da3
6,How can stress and mood tracking combined with...,I don't know.,,Stress and mood tracking helps identify patter...,False,False,False,0.707236,514284d0-be2f-46a1-963c-489cb10ff498,019c519e-d8d0-7d10-8129-ac1040b65607
7,Based on the research findings on exercise ben...,Regular physical activity improves mental heal...,,Research findings indicate that regular exerci...,True,True,False,2.353746,9cfdef71-3d9c-4b44-8210-73e8ccb424f8,019c519f-03ff-7c72-99e0-63ea709ecf64
8,What are some strategies for managing digital ...,Some strategies for managing digital mental he...,,Strategies for digital mental health include s...,True,True,True,2.251114,29ec7283-c075-4c8e-af99-e1e3345a927f,019c519f-6201-71d1-86af-6c9d7400bfb5
9,How does sleep impact mental health according ...,"According to the context, sleep impacts mental...",,Sleep and mental health have a bidirectional r...,True,True,True,2.289578,5de2c7e6-0b28-41a3-899c-6b9fae32795c,019c519f-a114-73e1-af9e-9ca1f04cddf1


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [41]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [42]:
rag_documents = docs

In [43]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:
Chunk size will impact retreival. Bigger chunks will increase the likelihood of capturing more of the relevant information in a single chunk, and that chunk therefore being ranked higher in retrieval so that the context provided to the llm is more relevant. 
This would probably be beneficial in **some** multihop questions where the answer can be constructed  from the same (larger) chunk


Increasing chunk size will only work up to a point (after which context rot, lost in the middle start to take effect). Smaller chunks could be more effective for answering more direct factual questions as the retreive chunk is less likely to contain additional context that may be relevant to the question. 

In [44]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:
Embeddings matter in terms of the ability to retreive relevant chunks, because its the retreival that creates the context required to generate quality answers.

In [45]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [46]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [47]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [48]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

'Yo, wanna level up your sleep game like a true legend? Here‚Äôs the dope blueprint straight from the sleep sages:\n\n1. **Lock in that rhythm:** Hit the sheets and rise *consistently*‚Äîyes, even on those wild weekends. Your body‚Äôs internal clock loves predictability.\n\n2. **Ritualize your chill time:** Create a calming pre-sleep vibe‚Äîthink gentle stretches, a warm bath, or getting lost in a good book. Ditch the screens 1-2 hours before bedtime because that blue light is the ultimate sleep saboteur.\n\n3. **Craft your sleep fortress:** Keep the bedroom cool (65-68¬∞F or 18-20¬∞C), pitch-black with blackout curtains or a slick sleep mask, and whisper-quiet‚Äîwhite noise machines or earplugs are your allies. Invest in a mattress and pillows worthy of a sleep deity.\n\n4. **Mind your intake:** No caffeine past 2 PM and steer clear of heavy meals or booze before bed‚Äîyour body needs to chill, not scramble.\n\n5. **Move smart:** Get your sweat on regularly but not right before bedtim

Finally, we can evaluate the new chain on the same test set!

In [49]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'new-property-42' at:
https://smith.langchain.com/o/dd799170-e56a-56cc-b3a5-c34310fb0969/datasets/3b244119-045c-424c-8af2-5f37bec4f547/compare?selectedSessions=6a1a65c2-ed55-48a1-8165-5ae24e4d2d90




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How do Chapters 4 and 16 together inform a hol...,"Yo, here‚Äôs the ultimate mashup of Chapters 4 a...",,Chapter 4 emphasizes the importance of fundame...,False,True,True,5.744376,2c0ac3b9-bb66-4d6c-8922-b6ceeb4d51c1,019c51a0-eeaa-78f1-8535-d31d19d78a0d
1,How can understanding the fundamentals of heal...,"Yo, here‚Äôs the lowdown straight from the Healt...",,Understanding the fundamentals of healthy eati...,True,True,True,4.822495,d5a90058-19ea-400e-9dcd-7f8433d1a57b,019c51a1-4927-74b3-b6c8-b6950081c022
2,How can understanding the habit loop and estab...,"Yo, here‚Äôs the rad lowdown straight from the w...",,"Understanding the habit loop, which includes c...",True,True,True,5.186441,a8fe36d2-2de8-4f8a-ac1f-583c3cf8587d,019c51a1-8827-71e2-b270-b96fcaee8929
3,How can understanding the habit loop and estab...,"Oh, now we‚Äôre cooking with some next-level wel...",,By understanding the habit loop‚Äîcomprising cue...,True,True,True,5.114058,20b23fe0-63cb-4c9f-a559-d9f77e9743c4,019c51a1-d406-7dc0-874d-0036bc2d310d
4,how do micronutrients like vitamins and minera...,"Yo, let‚Äôs dive deep into the science of micron...",,The context explains that micronutrients such ...,True,True,True,4.239875,37f0168d-3100-4f41-be2e-9570cb6c7170,019c51a2-1c6a-70d3-90df-3102c9a259e5
5,How can incorporating both balanced nutrition ...,"Alright, buckle up for a health and mental wel...",,"In the provided health and wellness context, a...",True,True,True,5.576539,f8704981-25fe-476f-86bf-bc27d8c15dca,019c51a2-6230-7931-87b9-1831afa6e328
6,How can stress and mood tracking combined with...,"Alright, let's crank this up to max dopeness! ...",,Stress and mood tracking helps identify patter...,True,True,True,4.502214,514284d0-be2f-46a1-963c-489cb10ff498,019c51a2-aacf-77c2-a29f-5683ccea9e95
7,Based on the research findings on exercise ben...,"Alright, let‚Äôs dive into the rad science of ho...",,Research findings indicate that regular exerci...,True,True,True,5.183686,9cfdef71-3d9c-4b44-8210-73e8ccb424f8,019c51a2-de09-74a3-b509-58591e63bf54
8,What are some strategies for managing digital ...,"Alright, here‚Äôs the dopest lowdown on managing...",,Strategies for digital mental health include s...,True,True,True,3.941026,29ec7283-c075-4c8e-af99-e1e3345a927f,019c51a3-1b34-7981-be01-c2cf6fbded17
9,How does sleep impact mental health according ...,"Yo, listen up ‚Äî sleep isn‚Äôt just downtime for ...",,Sleep and mental health have a bidirectional r...,True,True,True,2.712017,5de2c7e6-0b28-41a3-899c-6b9fae32795c,019c51a3-4fcd-7aa2-8b23-5f77fa17ef04


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

##### Answer:
Baseline is our original run, and the comparison (turquoise) is the dope-ified run.
Inevitably dopeness is far better given that we've provided a specific prompt instruction to alter the style for that.


Helpfulness and QA, as defined in the LLMAsAJudge are likely unaffectived by dopeness instructions (although human feedback on helpfullness might be impacted depending on the individuals interpretation of a more street style (generational and English as a second language playing a part)

So can assume helpfulness and QA more a function of chunk size and higher fidelity embeddings. 
Bigger chunk sizes might help with more complex questions where the question and answers are at least somewhat adjacent in the source text) because they are retreived in 1 chunk which is likely to be in the top K retreived chunks,  rather than piecing the answers together through smaller chunks that may not all get retreived successfully. On the oher hand, larger chunks means potentially more inforamtion that is irrelevant for the task at hand, and polute the context window, leading to unfocused answers or irrelevant information (and thus helpfulness degrades).

Better embeddings models mean that semantic similarity is more accurately measured, particularly in cases where 2 semantically similar concepts are expressed very differently in the text. This increases retreival and by extension makes the context to the llm more relevant.


The example below is a small dataset, but thee 4 examples where the original rag pipeline performed worse on hepfulness & qa were all "don't know" answers, because based on the prompt, the information in the context was not relevant enough to the question. The questions are AND-style questions, Chapter 4 AND chapter 16, chapter 4 AND  5&8, so content that are unlikely to be in the same short chunk

![image](ragas_dopeness_comparison.png)


---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores