# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow!



- 🤝 BREAKOUT ROOM #1
  1. Use RAGAS to Generate Synthetic Data

- 🤝 BREAKOUT ROOM #2
  1. Load them into a LangSmith Dataset
  2. Evaluate our RAG chain against the synthetic test data
  3. Make changes to our pipeline
  4. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

# 🤝 BREAKOUT ROOM #1

## Task 1: Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. OpenAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

## Dependencies and API Keys:

> NOTE: DO NOT RUN THESE CELLS IF YOU ARE RUNNING THIS NOTEBOOK LOCALLY

In [None]:
#!pip install -qU ragas==0.2.10

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

### NLTK Import

To prevent errors that may occur based on OS - we'll import NLTK and download the needed packages to ensure correct handling of data.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ugurcekmez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ugurcekmez/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

We'll also want to set a project name to make things easier for ourselves.

In [3]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

OpenAI's API Key!

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [5]:
!mkdir data

mkdir: data: File exists


In [6]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31493    0 31493    0     0  37610      0 --:--:-- --:--:-- --:--:-- 37581


In [7]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70519    0 70519    0     0   162k      0 --:--:-- --:--:-- --:--:--  162k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [8]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Unrolled SDG

In [9]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  for match in re.finditer('{0}\s*'.format(re.escape(sent)), self.original_text):
  txt = re.sub('(?<={0})\.'.format(am), '∯', txt)
  txt = re.sub('(?<={0})\.'.format(am), '∯', txt)


Next, we're going to instantiate our Knowledge Graph.

This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [10]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [11]:
from ragas.testset.graph import Node, NodeType

for doc in docs:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 2, relationships: 0)

Now, we'll apply the *default* transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of [default transformations](https://docs.ragas.io/en/latest/references/transforms/#ragas.testset.transforms.default_transforms).

These default transformations are dependent on the corpus length, in our case:

- Producing Summaries -> produces summaries of the documents
- Extracting Headlines -> finding the overall headline for the document
- Theme Extractor -> extracts broad themes about the documents

It then uses cosine-similarity and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [12]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

unable to apply transformation: 'StringIO' object has no attribute 'output'


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

KnowledgeGraph(nodes: 14, relationships: 60)

We can save and load our knowledge graphs as follows.

In [13]:
kg.save("ai_across_years_kg.json")
ai_across_years_kg = KnowledgeGraph.load("ai_across_years_kg.json")
ai_across_years_kg

KnowledgeGraph(nodes: 14, relationships: 60)

Using our knowledge graph, we can construct a "test set generator" - which will allow us to create queries.

In [14]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=ai_across_years_kg)

However, we'd like to be able to define the kinds of queries we're generating - which is made simple by Ragas having pre-created a number of different "QuerySynthesizer"s.

Each of these Synthetsizers is going to tackle a separate kind of query which will be generated from a scenario and a persona.

In essence, Ragas will use an LLM to generate a persona of someone who would interact with the data - and then use a scenario to construct a question from that data and persona.

In [15]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

#### ❓ Question #1:

What are the three types of query synthesizers doing? Describe each one in simple terms.


- SingleHopSpecificQuerySynthesizer : What is the definition of X mentioned in document Y?

- MultiHopAbstractQuerySynthesizer : What are the general themes discussed across documents X and Y regarding topic Z?

- MultiHopSpecificQuerySynthesizer : Compare the specific features of product A mentioned in document X with the limitations of product B discussed in document Y.

Finally, we can use our `TestSetGenerator` to generate our testset!

In [16]:
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
testset.to_pandas()

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,2023 how did the LLMs get so easy to build and...,[My blog in 2023 Here’s the sequel to this pos...,"In 2023, it was found that LLMs are quite easy...",single_hop_specifc_query_synthesizer
1,"What are LLMs, and what challenges do they pre...",[openly licensed ones are still the world’s mo...,LLMs are large language models that are still ...,single_hop_specifc_query_synthesizer
2,How does the weblog describe the significance ...,[Simon Willison’s Weblog Subscribe Stuff we fi...,The weblog states that 2023 was the breakthrou...,single_hop_specifc_query_synthesizer
3,How does the concept of Stable Diffusion relat...,"[of what LLMs are, how they work and how they ...",The provided context does not explicitly menti...,single_hop_specifc_query_synthesizer
4,What is the significance of Alibaba in the con...,[Things we learned about LLMs in 2024 31st Dec...,"According to the provided context, Alibaba is ...",single_hop_specifc_query_synthesizer
5,How does the open licensing and accessibility ...,"[<1-hop>\n\non inference. The sequel to o1, o3...",The open licensing and accessibility of large ...,multi_hop_abstract_query_synthesizer
6,How do the advantages of synthetic data over o...,[<1-hop>\n\nways we should not be using genera...,The context explains that synthetic data offer...,multi_hop_abstract_query_synthesizer
7,How have emerging use-cases and applications o...,[<1-hop>\n\nThings we learned about LLMs in 20...,"In 2024, the field of large language models ha...",multi_hop_abstract_query_synthesizer
8,How do Meta's efforts in synthetic data and LL...,[<1-hop>\n\nways we should not be using genera...,Meta has emphasized the importance of syntheti...,multi_hop_specific_query_synthesizer
9,How do the recent advancements in large langua...,"[<1-hop>\n\nof what LLMs are, how they work an...",The 2024 review highlights significant advance...,multi_hop_specific_query_synthesizer


### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [17]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/18 [00:00<?, ?it/s]

unable to apply transformation: 'StringIO' object has no attribute 'output'


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [19]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How has OpenAI contributed to the development ...,[My blog in 2023 Here’s the sequel to this pos...,"According to the context, OpenAI was the organ...",single_hop_specifc_query_synthesizer
1,How is JavaScript relevant to large language m...,[openly licensed ones are still the world’s mo...,"The context mentions that writing code, includ...",single_hop_specifc_query_synthesizer
2,What are the recent developments and challenge...,[Simon Willison’s Weblog Subscribe Stuff we fi...,"In 2023, it was a breakthrough year for Large ...",single_hop_specifc_query_synthesizer
3,How does ChatGPT function as a large language ...,"[of what LLMs are, how they work and how they ...",ChatGPT is a large language model (LLM) that h...,single_hop_specifc_query_synthesizer
4,how LLMs are easy to build and their accessibi...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,Simon Willison’s weblog states that LLMs are q...,multi_hop_abstract_query_synthesizer
5,how does code gen and exec by LLMs relate to c...,[<1-hop>\n\nopenly licensed ones are still the...,The context explains that writing code is one ...,multi_hop_abstract_query_synthesizer
6,Considering the advancements in Large Language...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,Simon Willison’s weblog notes that while 2023 ...,multi_hop_abstract_query_synthesizer
7,how AI research and competition is changing wi...,[<1-hop>\n\nThe rise of inference-scaling “rea...,The context shows that AI research and competi...,multi_hop_abstract_query_synthesizer
8,Wht is GPT-4 and how it is impcted in 2024?,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"In 2024, GPT-4 was a major breakthrough in the...",multi_hop_specific_query_synthesizer
9,How do the themes of ChatGPT's multimodal capa...,[<1-hop>\n\nyou talk to me exclusively in Span...,The context highlights ChatGPT's multimodal fe...,multi_hop_specific_query_synthesizer


In [21]:
res = dataset.to_pandas()

In [26]:
print(res.values[:1])

[['How has OpenAI contributed to the development and accessibility of large language models, according to the context?'
  list(['My blog in 2023 Here’s the sequel to this post: Things we learned about LLMs in 2024. Large Language Models In the past 24-36 months, our species has discovered that you can take a GIANT corpus of text, run it through a pile of GPUs, and use it to create a fascinating new kind of software. LLMs can do a lot of things. They can answer questions, summarize documents, translate from one language to another, extract information and even write surprisingly competent code. They can also help you cheat at your homework, generate unlimited streams of fake content and be used for all manner of nefarious purposes. So far, I think they’re a net positive. I’ve used them on a personal level to improve my productivity (and entertain myself) in all sorts of different ways. I think people who learn how to use them effectively can gain a significant boost to their quality of 

We'll need to provide our LangSmith API key, and set tracing to "true".

# 🤝 BREAKOUT ROOM #2

## Task 4: LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [28]:
from langsmith import Client

client = Client()

dataset_name = "State of AI Across the Years!"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="State of AI Across the Years!"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [29]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

## Basic RAG Chain

Time for some RAG!


In [30]:
rag_documents = docs

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [31]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using OpenAI's [`text-embedding-3-small`](https://platform.openai.com/docs/guides/embeddings/embedding-models) embedding model.

In [32]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

As usual, we will power our RAG application with Qdrant!

In [33]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="State of AI"
)

  description="Check that the field is empty, alternative syntax for `is_empty: \&quot;field_name\&quot;`",
  description="Check that the field is null, alternative syntax for `is_null: \&quot;field_name\&quot;`",


In [34]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

To get the "A" in RAG, we'll provide a prompt.

In [35]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

For our LLM, we will be using TogetherAI's endpoints as well!

We're going to be using Meta Llama 3.1 70B Instruct Turbo - a powerful model which should get us powerful results!

In [36]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

Finally, we can set-up our RAG LCEL chain!

In [37]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [38]:
rag_chain.invoke({"question" : "What are Agents?"})

'Based on the provided context, "agents" is an infuriatingly vague and poorly defined term in AI. It generally refers to AI systems that can go away and act on your behalf, such as digital assistants or travel agents. There are two main interpretations: one sees agents as systems that perform actions on behalf of users (the "travel agent" model), and another sees them as large language models given access to tools which they can run iteratively to solve problems. However, the term lacks a single, clear, and widely understood meaning, and despite much discussion and excitement, fully functional AI agents have not yet been realized in production, partly due to challenges like the AI\'s gullibility.'

## LangSmith Evaluation Set-up

We'll use OpenAI's GPT-4.1 as our evaluation LLM for our base Evaluators.

In [39]:
eval_llm = ChatOpenAI(model="gpt-4.1")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [40]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dope_or_nope_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this submission dope, lit, or cool?",
        },
        "llm" : eval_llm
    }
)

#### 🏗️ Activity #2:

Highlight what each evaluator is evaluating.

- `qa_evaluator`:
- `labeled_helpfulness_evaluator`:
- `dope_or_nope_evaluator`:

## LangSmith Evaluation

In [41]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "default_chain_init"},
)

View the evaluation results for experiment: 'unique-glass-94' at:
https://smith.langchain.com/o/cf498312-9555-5a00-b1e2-6ef06fc7402f/datasets/21b33d2f-7c27-43d8-b252-a81d579b3f71/compare?selectedSessions=42accd94-3fa5-4b38-aadc-a1e22f25d183




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How do Microsoft Research and Microsoft contri...,"Based on the provided context, Microsoft Resea...",,Microsoft Research has played a significant ro...,0,0,0,2.297552,af8e8e73-f18c-4a74-94f8-d70e30af77ed,0c0941d4-3333-434a-a881-ac7245892474
1,How does Google’s integration of multimodal au...,Google’s integration of multimodal audio and v...,,Google’s integration of multimodal audio and v...,1,1,0,4.129476,8cac15f3-0379-44cc-b517-986808a43bfa,1ae268de-3069-46b5-8f79-d466899402a3
2,How do the themes of ChatGPT's multimodal capa...,"Based on the provided context, ChatGPT's multi...",,The context highlights ChatGPT's multimodal fe...,1,1,0,8.844527,11860319-3490-4a58-9bba-4f0d6487dc03,dfa11d61-bd85-4ff1-b1d3-e8a8110aeb8f
3,Wht is GPT-4 and how it is impcted in 2024?,Based on the context provided:\n\nGPT-4 is Ope...,,"In 2024, GPT-4 was a major breakthrough in the...",1,1,0,8.582754,524a6a0a-783b-4583-8d2d-4f672c6336ac,ebb02b5b-229f-47ec-9f49-143b26c85117
4,how AI research and competition is changing wi...,"Based on the provided context, here is what ca...",,The context shows that AI research and competi...,1,1,0,12.201042,4d5452b4-e148-4f98-a644-6b678ad4a8d7,7bd64c8b-adeb-4103-8df2-09fcbeefe8c8
5,Considering the advancements in Large Language...,Based on the provided context from Simon Willi...,,Simon Willison’s weblog notes that while 2023 ...,1,1,0,10.995598,84802d5f-3e86-4c23-ab09-f81d1e6e6665,abac11d9-6740-4d07-ab9f-4d4ce8c8390b
6,how does code gen and exec by LLMs relate to c...,"Based on the context, code generation and exec...",,The context explains that writing code is one ...,1,1,0,4.10554,a0572a83-269f-4be5-9be3-83eefa61fea1,7a7ace50-b4a7-4df4-bebf-4fe804b4f062
7,how LLMs are easy to build and their accessibi...,"According to the context, LLMs (Large Language...",,Simon Willison’s weblog states that LLMs are q...,1,0,0,2.897226,1a471055-cb33-470b-8ae5-559e175e415f,5c03888e-8c4a-4489-ab1c-5db33d822f6a
8,How does ChatGPT function as a large language ...,Based on the provided context:\n\nChatGPT func...,,ChatGPT is a large language model (LLM) that h...,1,0,0,7.077579,c8c58f74-2aef-49d5-b810-96a14c993608,ab95f32e-38d9-4676-877c-06773ec9b898
9,What are the recent developments and challenge...,"Based on the provided context, the recent deve...",,"In 2023, it was a breakthrough year for Large ...",1,1,0,10.518278,ce0d5f63-6985-48e1-a490-ed384a90afb4,c72336a3-5030-44f1-865f-7fac49087110


## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `text-embedding-3-large`

Let's see how this changes our evaluation!

In [45]:
DOPE_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the questions in a dope way, be cool!

Context: {context}
Question: {question}
"""

dope_rag_prompt = ChatPromptTemplate.from_template(DOPE_RAG_PROMPT)

In [46]:
rag_documents = docs

In [47]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

#### ❓Question #2:

Why would modifying our chunk size modify the performance of our application?

In [48]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

#### ❓Question #3:

Why would modifying our embedding model modify the performance of our application?

In [49]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="AI Across Years (Augmented)"
)

In [50]:
retriever = vectorstore.as_retriever()

Setting up our new and improved DOPE RAG CHAIN.

In [51]:
dope_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dope_rag_prompt | llm | StrOutputParser()
)

Let's test it on the same output that we saw before.

In [52]:
dope_rag_chain.invoke({"question" : "what are Agents?"})

'Alright, here’s the lowdown on “Agents” straight from the vibe of the context:\n\nAgents are this kinda slippery, buzzwordy concept in AI that folks hype up but don’t really nail down. Some see agents like travel agents—they act on your behalf, doing stuff for you. Others think of them as large language models (LLMs) hooked up to tools, running loops to solve problems. But here’s the kicker: nobody’s got a solid, clear definition that everyone agrees on. It’s kinda like chasing a mirage — agents *feel* like they’re “coming soon,” but the reality is they’re still struggling with big issues like gullibility (they believe everything, even if it’s fake). Without overcoming that, real, trustworthy agents that can go off and *act* autonomously? They’re still a ways out.\n\nSo in short: Agents = AI doers with blurry boundaries, promising but still hampered by trust issues and fuzzy definitions. Cool concept, still cooking though.'

Finally, we can evaluate the new chain on the same test set!

In [53]:
evaluate(
    dope_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "dope_chain"},
)

View the evaluation results for experiment: 'reflecting-store-38' at:
https://smith.langchain.com/o/cf498312-9555-5a00-b1e2-6ef06fc7402f/datasets/21b33d2f-7c27-43d8-b252-a81d579b3f71/compare?selectedSessions=59c9cac1-4077-49fb-96d6-da27b76d5df0




0it [00:00, ?it/s]

Unnamed: 0,inputs.question,outputs.output,error,reference.answer,feedback.correctness,feedback.helpfulness,feedback.dopeness,execution_time,example_id,id
0,How do Microsoft Research and Microsoft contri...,"Yo, based on the fresh context you dropped, th...",,Microsoft Research has played a significant ro...,0,0,1,2.943974,af8e8e73-f18c-4a74-94f8-d70e30af77ed,757d343a-c1fe-47f8-a4de-e3d5ab4bb9dc
1,How does Google’s integration of multimodal au...,"Yo, here’s the lowdown straight from the conte...",,Google’s integration of multimodal audio and v...,1,1,1,4.555548,8cac15f3-0379-44cc-b517-986808a43bfa,af66b803-1635-4041-aeab-6c03676ab4da
2,How do the themes of ChatGPT's multimodal capa...,"Alright, here’s the lowdown, fresh and fly: Ch...",,The context highlights ChatGPT's multimodal fe...,1,1,1,10.413519,11860319-3490-4a58-9bba-4f0d6487dc03,175e2f46-1b6b-411b-9270-b64c7530dcad
3,Wht is GPT-4 and how it is impcted in 2024?,"Alright, here’s the lowdown on GPT-4 and its 2...",,"In 2024, GPT-4 was a major breakthrough in the...",1,1,1,6.612632,524a6a0a-783b-4583-8d2d-4f672c6336ac,24dc84fd-9c84-45d4-801c-00bacc9e97df
4,how AI research and competition is changing wi...,"Yo, here’s the lowdown on how AI research and ...",,The context shows that AI research and competi...,1,1,1,13.81795,4d5452b4-e148-4f98-a644-6b678ad4a8d7,40b65c50-b020-4916-929c-71e777d7cc56
5,Considering the advancements in Large Language...,"Yo, here’s the lowdown straight from Simon Wil...",,Simon Willison’s weblog notes that while 2023 ...,1,1,1,5.031738,84802d5f-3e86-4c23-ab09-f81d1e6e6665,2cd99f42-96ce-4015-8ca9-dd5ad99ec2c5
6,how does code gen and exec by LLMs relate to c...,"Alright, here’s the lowdown—when LLMs whip up ...",,The context explains that writing code is one ...,1,1,1,5.718597,a0572a83-269f-4be5-9be3-83eefa61fea1,c0b71147-2993-4a32-924d-a86de64d1278
7,how LLMs are easy to build and their accessibi...,"Yo, check this out—LLMs ain’t the beasts you m...",,Simon Willison’s weblog states that LLMs are q...,1,1,1,4.127296,1a471055-cb33-470b-8ae5-559e175e415f,07a7feb1-694f-4a55-b8dd-e206d03cd453
8,How does ChatGPT function as a large language ...,"Alright, here’s the lowdown, fresh and fly: Ch...",,ChatGPT is a large language model (LLM) that h...,1,0,1,6.065333,c8c58f74-2aef-49d5-b810-96a14c993608,3c8fdb55-2e15-43f9-bec3-1e7c5b211494
9,What are the recent developments and challenge...,"Yo, check it out—2023 was the big breakout yea...",,"In 2023, it was a breakthrough year for Large ...",1,1,1,5.110234,ce0d5f63-6985-48e1-a490-ed384a90afb4,c07dcd5a-56b2-4683-a424-5ff8de3d5706


#### 🏗️ Activity #3:

Provide a screenshot of the difference between the two chains, and explain why you believe certain metrics changed in certain ways.