# Session 9: Synthetic Data Generation and RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, and use it to evaluate and iterate on a RAG pipeline with LangSmith!

**Learning Objectives:**
- Understand Ragas' knowledge graph-based synthetic data generation workflow
- Generate synthetic test sets with different query synthesizer types
- Load synthetic data into LangSmith for evaluation
- Evaluate a RAG chain using LangSmith evaluators
- Iterate on RAG pipeline parameters and measure the impact

## Table of Contents:

- **Breakout Room #1:** Synthetic Data Generation with Ragas
  - Task 1: Dependencies and API Keys
  - Task 2: Data Preparation and Knowledge Graph Construction
  - Task 3: Generating Synthetic Test Data
  - Question #1 & Question #2
  - üèóÔ∏è Activity #1: Custom Query Distribution

- **Breakout Room #2:** RAG Evaluation with LangSmith
  - Task 4: LangSmith Dataset Setup
  - Task 5: Building a Basic RAG Chain
  - Task 6: Evaluating with LangSmith
  - Task 7: Modifying the Pipeline and Re-Evaluating
  - Question #3 & Question #4
  - üèóÔ∏è Activity #2: Analyze Evaluation Results

---
# ü§ù Breakout Room #1
## Synthetic Data Generation with Ragas

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [6]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ryan.lustig/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ryan.lustig/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [7]:
import os
import getpass

# Load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()

def get_api_key(env_var: str, prompt: str) -> str:
    """Get API key from environment or prompt user."""
    value = os.environ.get(env_var, "")
    if not value:
        value = getpass.getpass(prompt)
        if value:
            os.environ[env_var] = value
    return value

# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

OpenAI's API Key!

In [8]:
openai_key = get_api_key("OPENAI_API_KEY", "OpenAI API Key (press Enter to skip): ")
if openai_key:
    print("OpenAI API key set")
else:
    print("OpenAI API key not configured (optional)")

OpenAI API key set


We'll also want to set a project name to make things easier for ourselves.

In [9]:
from uuid import uuid4

langsmith_key = get_api_key("LANGCHAIN_API_KEY", "LangSmith API Key (press Enter to skip): ")

if langsmith_key:
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"
    print(f"LangSmith tracing enabled. Project: {os.environ['LANGCHAIN_PROJECT']}")
else:
    os.environ["LANGCHAIN_TRACING_V2"] = "false"
    print("LangSmith tracing disabled")

LangSmith tracing enabled. Project: AIM - SDG - 6bc4eb78


## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data using two complementary guides ‚Äî a Health & Wellness Guide covering exercise, nutrition, sleep, and stress management, and a Mental Health & Psychology Handbook covering mental health conditions, therapeutic approaches, resilience, and daily mental health practices. The topical overlap between documents helps RAGAS build rich cross-document relationships in the knowledge graph.

Next, let's load our data into a familiar LangChain format using the `TextLoader`.

In [10]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("data/", glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
print(f"Loaded {len(docs)} documents: {[d.metadata['source'] for d in docs]}")

Loaded 2 documents: ['data/MentalHealthGuide.txt', 'data/HealthWellnessGuide.txt']


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [11]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  for match in re.finditer('{0}\s*'.format(re.escape(sent)), self.original_text):
  txt = re.sub('(?<={0})\.'.format(am), '‚àØ', txt)
  txt = re.sub('(?<={0})\.'.format(am), '‚àØ', txt)


Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [12]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [None]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [14]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 10, relationships: 20)

We can save and load our knowledge graphs as follows.

In [15]:
kg.save("usecase_data_kg.json")
usecase_data_kg = KnowledgeGraph.load("usecase_data_kg.json")
usecase_data_kg

KnowledgeGraph(nodes: 10, relationships: 20)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [16]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=usecase_data_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [17]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

## ‚ùì Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.

##### Answer:
These query synthesizers vary in two areas: the number of "hops" of retrieval for the query and how specific or abstract the question is.

- SingleHopSpecific - This is a single source, single fact retrieval, which comes from a single hop and concrete question.
- MultiHopSpecific - This requires multiple sources to extract a specific factual answer. It differs from the above because the generated question requires an answer that must be combined multiple sources.
- MultiHopAbstract - This also requires multiple sources, but the produces a higher level conceptual question requiring reasoning and potential synthesis/thinking.


Finally, we can use our `TestSetGenerator` to generate our testset!

In [18]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does the United States address mental heal...,[The Mental Health and Psychology Handbook A P...,The context discusses mental health in general...,single_hop_specifc_query_synthesizer
1,What is DBT and how does it incorporate mindfu...,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Dialectical Behavior Therapy (DBT) was origina...,single_hop_specifc_query_synthesizer
2,What does norepinephrine do in the body?,[Write letters to or from your future self Jou...,Norepinephrine is involved in increasing neuro...,single_hop_specifc_query_synthesizer
3,What is an appendix in mental health resources?,[social interactions How to set and maintain b...,The context does not explicitly define 'APPEND...,single_hop_specifc_query_synthesizer
4,Monday what do I need to do for health and wel...,[The Personal Wellness Guide A Comprehensive R...,The context states that starting a new exercis...,single_hop_specifc_query_synthesizer
5,How can supporting digestive health through di...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,Supporting digestive health by eating fiber-ri...,multi_hop_abstract_query_synthesizer
6,hoW can i improve my sleep hygiene and envirom...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,"To improve sleep hygiene and environment, you ...",multi_hop_abstract_query_synthesizer
7,how sleep hygiene and stress management techni...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,Sleep hygiene practices such as maintaining a ...,multi_hop_abstract_query_synthesizer
8,how chapter 16 about digital mental health con...,[<1-hop>\n\nPART 3: SLEEP AND RECOVERY Chapter...,chapter 16 discusses managing digital mental h...,multi_hop_specific_query_synthesizer
9,how chapter 16 and 20 help with building healt...,[<1-hop>\n\nPART 5: BUILDING HEALTHY HABITS Ch...,chapter 16 talks about managing headaches natu...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [19]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/7 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/16 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [20]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does the mental health situation in the Un...,[The Mental Health and Psychology Handbook A P...,The Mental Health and Psychology Handbook expl...,single_hop_specifc_query_synthesizer
1,What is CBT?,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Cognitive Behavioral Therapy (CBT) is one of t...,single_hop_specifc_query_synthesizer
2,How can mental health be affected by issues wi...,[Write letters to or from your future self Jou...,Sleep and mental health have a bidirectional r...,single_hop_specifc_query_synthesizer
3,Could you please explain the significance of C...,[social interactions How to set and maintain b...,Chapter 16: Managing Digital Mental Health dis...,single_hop_specifc_query_synthesizer
4,Considering the importance of dietary patterns...,[<1-hop>\n\nWrite letters to or from your futu...,Incorporating key nutrients such as Omega-3 fa...,multi_hop_abstract_query_synthesizer
5,How can self-compassion and communication stra...,[<1-hop>\n\nsocial interactions How to set and...,Practicing self-compassion when setting bounda...,multi_hop_abstract_query_synthesizer
6,How can practicing good sleep hygiene practice...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,Practicing good sleep hygiene practices and ro...,multi_hop_abstract_query_synthesizer
7,How does exercise impact mental health and phy...,[<1-hop>\n\nThe Mental Health and Psychology H...,Exercise plays a crucial role in improving men...,multi_hop_abstract_query_synthesizer
8,"How does adequate vitamin D intake, as part of...",[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,The context highlights that vitamins are essen...,multi_hop_specific_query_synthesizer
9,H0w c4n sleep b3nfits b3 r3lated to m3ntal h3a...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,"Sleep is crucial for physical health, mental w...",multi_hop_specific_query_synthesizer


## ‚ùì Question #2:

Ragas offers both an "unrolled" (manual) approach and an "abstracted" (automatic) approach to synthetic data generation. What are the trade-offs between these two approaches? When would you choose one over the other?

##### Answer:
With the "manual" approach, there are 2 primary benefits. The first is control. As seen above, the user has control over the node types (`NodeType.DOCUMENT` vs `NODE.CHUNK`) as well as the types of transforms to apply to the knowledge graph. The second is reuse. In a production system, it's very likely that the same knowledge graph may be large or used to generate many different test sets.

With the "automatic" approach, there is more convenience (less code), fast speed (no wiring required) and also less control (no specific knowledge graph or transforms to be modified). According to the docs, this approach also rebuilds the knowledge graph every time as well.

With those known tradeoffs, the choice of when to you one over the other is more obvious.
- Use "manual" approach if there is a need to modify the pipeline, reuse the knowledge graph, or use custom transforms. This approach should be used to generate data for productions systems due to the varying levels of control.
- Use "automatic" approach for quick tests and/or analysis on a pipeline. Additionally, use this approved if reuse of the knowledge graph is not needed. 



---
## üèóÔ∏è Activity #1: Custom Query Distribution

Modify the `query_distribution` to experiment with different ratios of query types.

### Requirements:
1. Create a custom query distribution with different weights than the default
2. Generate a new test set using your custom distribution
3. Compare the types of questions generated with the default distribution
4. Explain why you chose the weights you did

In [22]:
# Define a custom query distribution with different weights
query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.20),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.60),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.20),
]

# Generate a new test set and compare with the default
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()


Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Whaat is the importnce of mental health in our...,[The Mental Health and Psychology Handbook A P...,"Mental health encompasses our emotional, psych...",single_hop_specifc_query_synthesizer
1,What is Mindfulness-Based Cognitive Therapy?,[PART 2: THERAPEUTIC APPROACHES Chapter 4: Cog...,Mindfulness-Based Cognitive Therapy (MBCT) com...,single_hop_specifc_query_synthesizer
2,How does the connection between stress and men...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,The context emphasizes that stress significant...,multi_hop_abstract_query_synthesizer
3,How can incorporating self-care practices and ...,[<1-hop>\n\nPART 2: THERAPEUTIC APPROACHES Cha...,Incorporating self-care practices and practici...,multi_hop_abstract_query_synthesizer
4,Hwo can I buid a weekely exercise skedule that...,[<1-hop>\n\nThe Personal Wellness Guide A Comp...,"Based on the Personal Wellness Guide, to build...",multi_hop_abstract_query_synthesizer
5,how stress and mental wellness relate to stres...,[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,The context explains that stress is a body's r...,multi_hop_abstract_query_synthesizer
6,Considering the comprehensive understanding of...,[<1-hop>\n\nThe Mental Health and Psychology H...,The handbook explains that mental health condi...,multi_hop_abstract_query_synthesizer
7,"How can stress management techniques, such as ...",[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,Stress management techniques like deep breathi...,multi_hop_abstract_query_synthesizer
8,"How do B vitamins, as discussed in the context...",[<1-hop>\n\nPART 2: NUTRITION AND DIET Chapter...,"The context highlights that vitamins, includin...",multi_hop_specific_query_synthesizer
9,How can understanding Chapter 16 on managing d...,[<1-hop>\n\nsocial interactions How to set and...,Understanding Chapter 16 on managing digital m...,multi_hop_specific_query_synthesizer


In [29]:
df = testset.to_pandas()
for i in range(len(df)):
    print(f"Question: {df['user_input'].iloc[i]}")
    print(f"  Synthesizer: {df['synthesizer_name'].iloc[i]}")

Question: Whaat is the importnce of mental health in our overall well-being and how can it be maintaned despite challenges?
  Synthesizer: single_hop_specifc_query_synthesizer
Question: What is Mindfulness-Based Cognitive Therapy?
  Synthesizer: single_hop_specifc_query_synthesizer
Question: How does the connection between stress and mental wellness, as well as the mind-body connection highlighted in the context, inform strategies for improving emotional resilience and overall mental health?
  Synthesizer: multi_hop_abstract_query_synthesizer
Question: How can incorporating self-care practices and self-compassion in boundary management, as discussed in mental health strategies, enhance emotional resilience and support the development of emotional intelligence, especially considering the role of daily mental health practices like journaling and managing digital interactions?
  Synthesizer: multi_hop_abstract_query_synthesizer
Question: Hwo can I buid a weekely exercise skedule that incl

#### Compare the types of questions generated with the default distribution
The distribution I chose is 20% single-hop specific, 60% multi-hop abstract, and 20% multi-hop specific. I chose these weights because I wanted to see how well RAGAS handles a more complex query distribution.

Compared to the default distribution, the questions are more abstract and general in nature, requiring more reasoning and thinking to answer well. This makes sense as the "multi-hop" synthesizer should require multiple documents, synthesis, and analysis to answer multi-hop abstract questions. Additionally, these questions are more likely to be answered correctly by a more powerful model. For examples, one question is:

```How do B vitamins, as discussed in the context of nutrition, relate to mental health strategies like sleep improvement and stress management, considering their role as essential vitamins for emotional well-being?```

This type of question requires an understanding of B vitamins, which is found under nutrition, and then compare/contrast this understanding with mental health strategies, which is a completely separate issue.

---
# ü§ù Breakout Room #2
## RAG Evaluation with LangSmith

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [30]:
from langsmith import Client
import uuid

client = Client()

dataset_name = f"Use Case Synthetic Data - AIE9 - {uuid.uuid4()}"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Synthetic Data for Use Cases"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [31]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [32]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [33]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [34]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [35]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="use_case_rag"
)

In [36]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [37]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

As is usual: We'll be using `gpt-4.1-mini` for our RAG!

In [38]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [39]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [40]:
rag_chain.invoke({"question" : "What are some recommended exercises for lower back pain?"})

'Recommended exercises for lower back pain include:\n\n- Cat-Cow Stretch: Start on hands and knees, alternate between arching your back up (cat) and letting it sag down (cow). Do 10-15 repetitions.\n- Bird Dog: From hands and knees, extend opposite arm and leg while keeping your core engaged. Hold for 5 seconds, then switch sides. Do 10 repetitions per side.\n- Partial Crunches: Lie on your back with knees bent, cross arms over chest, tighten stomach muscles and raise shoulders off floor. Hold briefly, then lower. Do 8-12 repetitions.\n- Knee-to-Chest Stretch: Lie on your back, pull one knee toward your chest while keeping the other foot flat. Hold for 15-30 seconds, then switch legs.\n- Pelvic Tilts: Lie on your back with knees bent, flatten your back against the floor by tightening abs and tilting pelvis up slightly. Hold for 10 seconds, repeat 8-12 times.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [41]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [42]:
from openevals.llm import create_llm_as_judge
from langsmith.evaluation import evaluate

# 1. QA Correctness (replaces LangChainStringEvaluator("qa"))
qa_evaluator = create_llm_as_judge(
    prompt="You are evaluating a QA system. Given the input, assess whether the prediction is correct.\n\nInput: {inputs}\nPrediction: {outputs}\nReference answer: {reference_outputs}\n\nIs the prediction correct? Return 1 if correct, 0 if incorrect.",
    feedback_key="qa",
    model="openai:gpt-4o" ,  # pass your LangChain chat model directly
)

# 2. Labeled Helpfulness (replaces LangChainStringEvaluator("labeled_criteria"))
labeled_helpfulness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "helpfulness: Is this submission helpful to the user, "
        "taking into account the correct reference answer?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n"
        "Reference answer: {reference_outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="helpfulness",
    model="openai:gpt-4o" ,
)

# 3. Dopeness (replaces LangChainStringEvaluator("criteria"))
dopeness_evaluator = create_llm_as_judge(
    prompt=(
        "You are assessing a submission based on the following criterion:\n\n"
        "dopeness: Is this response dope, lit, cool, or is it just a generic response?\n\n"
        "Input: {inputs}\n"
        "Submission: {outputs}\n\n"
        "Does the submission meet the criterion? Return 1 if yes, 0 if no."
    ),
    feedback_key="dopeness",
    model="openai:gpt-4o" ,
)

> **Describe what each evaluator is evaluating:**
>
> - `qa_evaluator`: Simple evaluation for correctness - is the answer right or wrong.
> - `labeled_helpfulness_evaluator`: The helpfulness of the answer based on the context in the reference answer. The answer could be right, but may not be helpful to the user. Ex: "What time is it?" --> Ans: "It's time for lunch!"
> - `dopeness_evaluator`: The "coolness" of the answer vs a generic, "expected" answer.

## LangSmith Evaluation

In [43]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'warm-oil-75' at:
https://smith.langchain.com/o/bda06b5c-4a44-46dc-a4f6-a5c0d55dbe59/datasets/e1150e7d-50a3-48c0-96c4-ee7ee1dc0fd8/compare?selectedSessions=27380228-23ae-44ca-a32c-32938e70dd2a




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can setting and maintaing boundaries (chap...,Setting and maintaining boundaries can improve...,,"Setting and maintaining boundaries, as discuss...",True,True,False,2.763121,2c9e6e85-54ab-4e49-8a32-9bbb80f7426f,019c50b2-2263-7293-8d5b-2d0431ad39cb
1,"How can managing digital mental health, as dis...","Based on the provided context, managing digita...",,Managing digital mental health by setting inte...,True,True,True,4.003999,985e3326-c809-4a8b-ad97-1c9729858604,019c50b2-836e-7d62-9319-91aed35ee1b7
2,H0w c4n sleep b3nfits b3 r3lated to m3ntal h3a...,Sleep benefits mental health through multiple ...,,"Sleep is crucial for physical health, mental w...",True,True,True,9.703878,909e7213-6d02-45b5-ae5b-94a5ac0573ff,019c50b2-ca44-7402-82d1-717246f22973
3,"How does adequate vitamin D intake, as part of...","According to the context, adequate vitamin D i...",,The context highlights that vitamins are essen...,True,True,False,2.464779,eddb3c85-9923-4367-9eb9-7eb27b20b9b1,019c50b3-4ae6-72c0-a2b2-ce43f477d51c
4,How does exercise impact mental health and phy...,Exercise positively impacts mental health by r...,,Exercise plays a crucial role in improving men...,True,True,True,3.900259,e627a957-8670-4b3b-9b6f-9204c1053591,019c50b3-93fc-7972-9f10-3ed730b9873a
5,How can practicing good sleep hygiene practice...,Practicing good sleep hygiene practices and ro...,,Practicing good sleep hygiene practices and ro...,True,True,False,5.060091,4a05de4a-d877-43ad-ac20-1449fb824497,019c50b3-d2a0-7503-b164-675e6764d847
6,How can self-compassion and communication stra...,"Based on the context provided, self-compassion...",,Practicing self-compassion when setting bounda...,True,True,True,2.854041,25f06def-9fac-4d0b-b6f9-0acc95be20fb,019c50b4-0cb0-7253-9b45-5b6c2f782b0d
7,Considering the importance of dietary patterns...,Incorporating key nutrients such as Omega-3 fa...,,Incorporating key nutrients such as Omega-3 fa...,True,True,True,5.0062,1e0b395d-d313-4472-abdb-54f3a0c4463e,019c50b4-3fe5-7013-94f9-ebca4c991298
8,Could you please explain the significance of C...,"Chapter 16, titled ""Managing Digital Mental He...",,Chapter 16: Managing Digital Mental Health dis...,True,True,True,3.64893,3312d29a-cf05-4d26-ad5b-dc96fcf27cf4,019c50b4-83e7-70a2-b51f-aabff37d8109
9,How can mental health be affected by issues wi...,Mental health can be affected by issues with s...,,Sleep and mental health have a bidirectional r...,True,True,False,4.684834,5f2855dc-7721-44f0-9fd0-3f0e17bef709,019c50b4-b9c7-73f0-acee-c040ade15808


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [44]:
DOPENESS_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Make your answer rad, ensure high levels of dopeness. Do not be generic, or give generic responses.

Context: {context}
Question: {question}
"""

dopeness_rag_prompt = ChatPromptTemplate.from_template(DOPENESS_RAG_PROMPT)

In [45]:
rag_documents = docs

In [46]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

## ‚ùì Question #3:

Why would modifying our chunk size modify the performance of our application?

##### Answer:
Chunk size is the size of a segment of text from a larger document. 

With smaller chunks, the segments are more granular and specific. However, these chunks lose the surrounding context which may impact the retrieved documents. For example, a small chunk might separate a header from the text, upon which the text may seem like it can belong to any header/section in the document.

With larger chunks, the segments are also larger and we thereby gain more context on the surrounding text. However, if the chunks are too large, we may introduce "noise" into the system since it will include other segments that don't impact or relate well to the problem at hand. 

Overall, there is a trade-off of the chunk size which needs to be well-balanced for specific applications. There is no one-size-fits-all size.

In [51]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## ‚ùì Question #4:

Why would modifying our embedding model modify the performance of our application?

##### Answer:
An embedding model takes text and turns then into embedding vectors. Retrieval will then compare vectors (based on the embedding model).

First, modifying the embedding model changes how which chunks are chosen (i.e. which embeddings are closest to each other). Second, by changing the retrieved chunks, this ultimately affects the context provided to the LLM, giving different answers.

So the better the embedding model, the better similarity match. The better the similar match, the better the context provided to the LLM for a better answer.

In [52]:
from langchain_qdrant import QdrantVectorStore

vectorstore = QdrantVectorStore.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Use Case RAG Docs"
)

In [53]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [54]:
dopeness_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dopeness_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [55]:
dopeness_rag_chain.invoke({"question" : "How can I improve my sleep quality?"})

'Yo, sleep warrior, here‚Äôs the ultimate cheat code to level up your sleep quality straight from the guru‚Äôs playbook:\n\n1. **Stick to your sleep schedule like it‚Äôs a high-stakes mission** ‚Äî same bedtime and wake-up time ALL WEEK, no slacking on weekends.\n2. **Craft a chill bedtime ritual** ‚Äî dive into a book, stretch those muscles gently, or soak in a warm bath to melt away the day‚Äôs grind.\n3. **Turn your bedroom into a sleep fortress** ‚Äî keep it cool at 65-68¬∞F (18-20¬∞C), pitch-black with blackout curtains or a sleep mask, and quiet like a secret cave (white noise machines or earplugs can be your allies).\n4. **Kill screen time 1-2 hours before lights out** ‚Äî screens mess with your melatonin mojo, so give your eyeballs a break.\n5. **Ditch caffeine after 2 PM** ‚Äî no sneaky coffee jolts keeping you wired.\n6. **Exercise like a beast, but not right before bed** ‚Äî your body loves movement, but save those power hours for earlier.\n7. **Say no to alcohol and heavy m

Finally, we can evaluate the new chain on the same test set!

In [56]:
evaluate(
    dopeness_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dopeness_evaluator
    ],
    metadata={"revision_id": "dopeness_rag_chain"},
)

View the evaluation results for experiment: 'yellow-crate-41' at:
https://smith.langchain.com/o/bda06b5c-4a44-46dc-a4f6-a5c0d55dbe59/datasets/e1150e7d-50a3-48c0-96c4-ee7ee1dc0fd8/compare?selectedSessions=1d776477-ee38-4128-a71e-99027044d98e




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.qa,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How can setting and maintaing boundaries (chap...,"Alright, let‚Äôs crank this up to max dopeness‚Äîh...",,"Setting and maintaining boundaries, as discuss...",True,True,True,4.534081,2c9e6e85-54ab-4e49-8a32-9bbb80f7426f,019c5ab9-2f85-77c0-bbf5-8060283cc6ac
1,"How can managing digital mental health, as dis...","Alright, let‚Äôs light this up with some serious...",,Managing digital mental health by setting inte...,True,True,True,5.222825,985e3326-c809-4a8b-ad97-1c9729858604,019c5ab9-6058-7800-868e-3686b1080792
2,H0w c4n sleep b3nfits b3 r3lated to m3ntal h3a...,"Yo, buckle up for this sleep-mental health syn...",,"Sleep is crucial for physical health, mental w...",True,True,True,4.948796,909e7213-6d02-45b5-ae5b-94a5ac0573ff,019c5ab9-99e3-7ba2-9d4c-5b62cd6c1b47
3,"How does adequate vitamin D intake, as part of...","Alright, let‚Äôs drop some serious knowledge on ...",,The context highlights that vitamins are essen...,True,True,True,2.901406,eddb3c85-9923-4367-9eb9-7eb27b20b9b1,019c5ab9-eab1-7573-8038-b6c22c68dfe7
4,How does exercise impact mental health and phy...,"Alright, here‚Äôs the ultra-rad lowdown on how e...",,Exercise plays a crucial role in improving men...,True,True,True,6.041785,e627a957-8670-4b3b-9b6f-9204c1053591,019c5aba-0ea5-7ec1-aede-773eabb61bbd
5,How can practicing good sleep hygiene practice...,"Alright, buckle up for some next-level sleep w...",,Practicing good sleep hygiene practices and ro...,True,True,True,4.379497,4a05de4a-d877-43ad-ac20-1449fb824497,019c5aba-45c1-70c1-bd1e-f7b5509b8426
6,How can self-compassion and communication stra...,"Alright, let's turbocharge your mental game in...",,Practicing self-compassion when setting bounda...,True,True,True,3.010305,25f06def-9fac-4d0b-b6f9-0acc95be20fb,019c5aba-6e72-70b3-8fdb-b76e58bbb62a
7,Considering the importance of dietary patterns...,"Alright, buckle up for some next-level mental ...",,Incorporating key nutrients such as Omega-3 fa...,True,True,True,6.730179,1e0b395d-d313-4472-abdb-54f3a0c4463e,019c5aba-9397-7ed0-85e1-ee7a337894cc
8,Could you please explain the significance of C...,"Oh, hell yes‚Äîlet‚Äôs dive into the electric vibe...",,Chapter 16: Managing Digital Mental Health dis...,True,True,True,4.293797,3312d29a-cf05-4d26-ad5b-dc96fcf27cf4,019c5aba-cccc-76c2-a98c-34422b718c5a
9,How can mental health be affected by issues wi...,"Yo, here‚Äôs the lowdown straight from the menta...",,Sleep and mental health have a bidirectional r...,True,True,True,4.891675,5f2855dc-7721-44f0-9fd0-3f0e17bef709,019c5abb-0295-75a2-8efc-0ae46d95b76b


---
## üèóÔ∏è Activity #2: Analyze Evaluation Results

Provide a screenshot of the difference between the two chains in LangSmith, and explain why you believe certain metrics changed in certain ways.

## `rag_chain` metrics (non-dope version)

![Embedding small metrics](embedding_small_metrics.png)

### `dopeness_rag_chain` metrics (dope-version)

![Embedding large metrics](embedding_large_metrics.png)

| | **rag_chain** | **dopeness_rag_chain** |
|---|---|---|
| **Chunk size** | 500 | 1000 |
| **Embedding model** | `text-embedding-3-small` | `text-embedding-3-large` |

The table above shows the differences between the two chains. The dope rag chain performs much better primarily due to the larger chunk size and larger/better embedding model. As explained in Question 3 and 4, chuck size and embedding models play an important role in the overall performance.

With a large chunk size, more context is kept per segment/retrieval. For multi-hop questions, some answers may span a few sentences or paragraphs and a larger chunk size is able to capture this context more effectively. 

A large embedding model (in this case `text-embedding-3-large`) provides more details and nuance. This gives the LLM better context, thereby improving the helpfulness and QA scores.

Of course, the `dopeness_rag_chain` performs significantly better on the dopeness metric because the rag prompt says to respond in a cool and dope way. Without this addition, the response will likely be generic and have no personality.

---
## Summary

In this session, we:

1. **Generated synthetic test data** using Ragas' knowledge graph-based approach
2. **Explored query synthesizers** for creating diverse question types
3. **Loaded synthetic data** into a LangSmith dataset for evaluation
4. **Built and evaluated a RAG chain** using LangSmith evaluators
5. **Iterated on the pipeline** by modifying chunk size, embedding model, and prompt ‚Äî then measured the impact

### Key Takeaways:

- **Synthetic data generation** is critical for early iteration ‚Äî it provides high-quality signal without manually creating test data
- **LangSmith evaluators** enable systematic comparison of pipeline versions
- **Small changes matter** ‚Äî chunk size, embedding model, and prompt modifications can significantly affect evaluation scores