# RAG - Evaluation

LangChain offers several built-in evaluators that you can use to test the efficacy of your RAG pipeline. Since you've created a RAG pipeline, the QA Evaluator is a good fit.
Remember that LLMs are probablistic -- responses will not be the exact same for each invocation. Evaluation results will differ between invocations, and they may be imperfect. Using the metrics as part of a larger holistic testing strategy for your RAG application is recommended.

In [28]:
import os
import getpass
import urllib.request
from dotenv import dotenv_values
from langchain_community.document_loaders import TextLoader
from langchain_core.messages import HumanMessage
from langchain_openai import AzureChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.globals import set_llm_cache
from langchain.cache import SQLiteCache
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import AzureOpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough

In [2]:
os.environ["AZURE_OPENAI_API_KEY"] = getpass.getpass(prompt="Enter your Azure OpenAI API Key: ")

In [3]:
os.environ["AZURE_OPENAI_ENDPOINT"] = getpass.getpass(prompt="Enter your Azure OpenAI Endpoint: ")

In [30]:
filename = '../data/twenty-thousand-leagues-under-the-sea.txt'
loader = TextLoader(filename)
documents = loader.load()

In [31]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

Created a chunk of size 1076, which is longer than the specified 1000
Created a chunk of size 1042, which is longer than the specified 1000
Created a chunk of size 1378, which is longer than the specified 1000
Created a chunk of size 1094, which is longer than the specified 1000
Created a chunk of size 1127, which is longer than the specified 1000
Created a chunk of size 1057, which is longer than the specified 1000
Created a chunk of size 1240, which is longer than the specified 1000
Created a chunk of size 1034, which is longer than the specified 1000
Created a chunk of size 1084, which is longer than the specified 1000
Created a chunk of size 1048, which is longer than the specified 1000
Created a chunk of size 1149, which is longer than the specified 1000
Created a chunk of size 1289, which is longer than the specified 1000
Created a chunk of size 1091, which is longer than the specified 1000
Created a chunk of size 1016, which is longer than the specified 1000
Created a chunk of s

### Configure embeddings model and the vector store

In [32]:
embeddings_model = AzureOpenAIEmbeddings(
    azure_deployment="text-embedding-ada-002",
    openai_api_version="2023-05-15",
)

In [34]:
vector_db = Chroma.from_documents(
    documents=docs,
    embedding=embeddings_model,
    persist_directory="./chroma_db"
)

### Query the database

In [35]:
query = "What is the Nautilus?"
data = vector_db.similarity_search(query)
data[0].page_content

'The platform was only three feet out of water. The front and back of\nthe _Nautilus_ was of that spindle-shape which caused it justly to be\ncompared to a cigar. I noticed that its iron plates, slightly\noverlaying each other, resembled the shell which clothes the bodies of\nour large terrestrial reptiles. It explained to me how natural it was,\nin spite of all glasses, that this boat should have been taken for a\nmarine animal.\n\nToward the middle of the platform the long-boat, half buried in the\nhull of the vessel, formed a slight excrescence. Fore and aft rose two\ncages of medium height with inclined sides, and partly closed by thick\nlenticular glasses; one destined for the steersman who directed the\n_Nautilus_, the other containing a brilliant lantern to give light on\nthe road.'

### Basic retrieval

In [36]:
retriever = vector_db.as_retriever(search_kwargs={"k": 3})

In [37]:
docs = retriever.get_relevant_documents(query)
for doc in docs:
    print(doc)

page_content='The platform was only three feet out of water. The front and back of\nthe _Nautilus_ was of that spindle-shape which caused it justly to be\ncompared to a cigar. I noticed that its iron plates, slightly\noverlaying each other, resembled the shell which clothes the bodies of\nour large terrestrial reptiles. It explained to me how natural it was,\nin spite of all glasses, that this boat should have been taken for a\nmarine animal.\n\nToward the middle of the platform the long-boat, half buried in the\nhull of the vessel, formed a slight excrescence. Fore and aft rose two\ncages of medium height with inclined sides, and partly closed by thick\nlenticular glasses; one destined for the steersman who directed the\n_Nautilus_, the other containing a brilliant lantern to give light on\nthe road.' metadata={'source': '../data/twenty-thousand-leagues-under-the-sea.txt'}
page_content='During several hours the _Nautilus_ floated in these brilliant waves,\nand our admiration increased

### Maximum marginal relevance retrieval 
By default, the vector store retriever uses similarity search. If the underlying vector store supports maximum marginal relevance search, you can specify that as the search type.

In [38]:
retriever = vector_db.as_retriever(search_type="mmr")

In [39]:
docs = retriever.get_relevant_documents(query)
for doc in docs:
    print(doc)

page_content='The platform was only three feet out of water. The front and back of\nthe _Nautilus_ was of that spindle-shape which caused it justly to be\ncompared to a cigar. I noticed that its iron plates, slightly\noverlaying each other, resembled the shell which clothes the bodies of\nour large terrestrial reptiles. It explained to me how natural it was,\nin spite of all glasses, that this boat should have been taken for a\nmarine animal.\n\nToward the middle of the platform the long-boat, half buried in the\nhull of the vessel, formed a slight excrescence. Fore and aft rose two\ncages of medium height with inclined sides, and partly closed by thick\nlenticular glasses; one destined for the steersman who directed the\n_Nautilus_, the other containing a brilliant lantern to give light on\nthe road.' metadata={'source': '../data/twenty-thousand-leagues-under-the-sea.txt'}
page_content='But what has become of the _Nautilus?_ Did it resist the pressure of\nthe maelstrom? Does Captain N

### Similarity score threshold retrieval
You can also set a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold.

In [40]:
retriever = vector_db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5}
)

In [41]:
docs = retriever.get_relevant_documents(query)
for doc in docs:
    print(doc)

page_content='The platform was only three feet out of water. The front and back of\nthe _Nautilus_ was of that spindle-shape which caused it justly to be\ncompared to a cigar. I noticed that its iron plates, slightly\noverlaying each other, resembled the shell which clothes the bodies of\nour large terrestrial reptiles. It explained to me how natural it was,\nin spite of all glasses, that this boat should have been taken for a\nmarine animal.\n\nToward the middle of the platform the long-boat, half buried in the\nhull of the vessel, formed a slight excrescence. Fore and aft rose two\ncages of medium height with inclined sides, and partly closed by thick\nlenticular glasses; one destined for the steersman who directed the\n_Nautilus_, the other containing a brilliant lantern to give light on\nthe road.' metadata={'source': '../data/twenty-thousand-leagues-under-the-sea.txt'}
page_content='During several hours the _Nautilus_ floated in these brilliant waves,\nand our admiration increased

In [42]:
prompt_template = """
Answer the question based only on the supplied context. If you don't know the answer, say you don't know the answer.
Context: {context}
Question: {question}
Your answer:
"""
prompt = ChatPromptTemplate.from_template(prompt_template)

In [54]:
llm = AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_deployment="gpt-4",
)

In [44]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

chain.invoke(
    "In the given context, what is the Nautilus"
)

'The Nautilus is a vessel or boat that is able to travel under water.'

### Evaluation

In [45]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

langsmith_api_key = "LANGCHAIN_API_KEY"
if langsmith_api_key not in os.environ:
    os.environ[langsmith_api_key] = getpass.getpass(f"Enter {langsmith_api_key}: ")

os.environ["LANGCHAIN_PROJECT"] = input(
    "Project: "
)  # if not specified, defaults to "default"

In [68]:
eval_questions = [
    "What is the name of Captain Nemo's submarine?",
    "What passengers are there on Captain Nemo's submarine?",
    "How can Professor Aronnax, Ned Land and Conseil escape?",
    "What do the passengers eat for dinner?",
]

eval_answers = [
    "The name of Captain Nemo's submarine is Nautilus. It's featured in Jules Verne's novels Twenty Thousand Leagues Under the Sea and The Mysterious Island", 
    "The passengers aboard the Nautilus includes Professor Pierre Aronnax, a french marine biologist and the story's narrator, Conseil the loyal servant and Ned Land, a canadian harpooner.", 
    "They sneak onto a separate boat and make their escape from the submarine",
    "The passengers enjoy a variety of exotic seafood dishes like sea cucumber, seaweed, and other undersea plants and creatures, prepared in a sophisticated manner",
]

#examples = zip(eval_questions, eval_answers)
examples = [{"query": q, "ground_truths": [eval_answers[i]]} 
           for i, q in enumerate(eval_questions)]

In [69]:
list(examples)

[{'query': "What is the name of Captain Nemo's submarine?",
  'ground_truths': ["The name of Captain Nemo's submarine is Nautilus. It's featured in Jules Verne's novels Twenty Thousand Leagues Under the Sea and The Mysterious Island"]},
 {'query': "What passengers are there on Captain Nemo's submarine?",
  'ground_truths': ["The passengers aboard the Nautilus includes Professor Pierre Aronnax, a french marine biologist and the story's narrator, Conseil the loyal servant and Ned Land, a canadian harpooner."]},
 {'query': 'How can Professor Aronnax, Ned Land and Conseil escape?',
  'ground_truths': ['They sneak onto a separate boat and make their escape from the submarine']},
 {'query': 'What do the passengers eat for dinner?',
  'ground_truths': ['The passengers enjoy a variety of exotic seafood dishes like sea cucumber, seaweed, and other undersea plants and creatures, prepared in a sophisticated manner']}]

In [70]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
        llm,
        retriever=vector_db.as_retriever(),
        return_source_documents=True,
)

In [81]:
examples[1]["ground_truths"]

["The passengers aboard the Nautilus includes Professor Pierre Aronnax, a french marine biologist and the story's narrator, Conseil the loyal servant and Ned Land, a canadian harpooner."]

In [82]:
qa.run(examples[1]["ground_truths"])

ValueError: `run` not supported when there is not exactly one output key. Got ['result', 'source_documents'].

In [61]:
# Create your dataset in LangSmith
from langsmith import Client
from langsmith.utils import LangSmithError

client = Client()
dataset_name = "test_eval_dataset"

try:
    # Check if dataset exists
    dataset = client.read_dataset(dataset_name=dataset_name)
    print("using existing dataset: ", dataset.name)
except LangSmithError:
    # If not, create a new one with the eval questions
    dataset = client.create_dataset(
        dataset_name=dataset_name,
        description="sample evaluation dataset",
    )
    for question, answer in examples:
        client.create_example(
            inputs={"input": question},
            outputs={"answer": answer},
            dataset_id=dataset.id,
        )

    print("Created a new dataset: ", dataset.name)

using existing dataset:  test_eval_dataset


In [62]:
from langchain.chains import RetrievalQA

# Since chains and agents can be stateful (they can have memory),
# create a constructor to pass in to the run_on_dataset method.
# This is so any state in the chain is not reused when evaluating individual examples.
def create_qa_chain(llm, vector_store, return_context=True):
    qa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=vector_store.as_retriever(),
        return_source_documents=return_context,
    )
    return qa_chain

In [63]:
from langsmith import Client
from langchain.evaluation import EvaluatorType
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
    # LangChain offers several QA Evaluator types
    evaluators=[
        "qa", # grades a response as correct or incorrect based on reference answer
        "context_qa", # uses reference context to to determine correctness 
        # "cot_qa", # similar to context_qa, but uses chain-of-thought
    ],
    prediction_key="result",
)

client = Client()
run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=create_qa_chain(llm=llm, vector_store=vector_db),
    client=client,
    evaluation=evaluation_config,
    verbose=True,
)

View the evaluation results for project 'cooked-judge-39' at:
https://smith.langchain.com/o/93114361-1b4f-5f52-9a6b-e22e0e8fa9ec/datasets/f5b150f6-a2aa-4fd8-aab9-343a08ae245a/compare?selectedSessions=68b143ab-ec88-454e-8998-d6c907deffd1

View all tests for Dataset test_eval_dataset at:
https://smith.langchain.com/o/93114361-1b4f-5f52-9a6b-e22e0e8fa9ec/datasets/f5b150f6-a2aa-4fd8-aab9-343a08ae245a


ValueError: Evaluation with the <class 'langchain.evaluation.qa.eval_chain.QAEvalChain'> requires a language model to function. Failed to create the default 'gpt-4' model. Please manually provide an evaluation LLM or check your openai credentials.