# Lab | Langchain Evaluation

## Intro

Pick different sets of data and re-run this notebook. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications.

What did you learn? - Let's discuss that in class

## LangChain: Evaluation

### Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [4]:
import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
os.environ["PINECONE_KEY"] = userdata.get("PINECONE_KEY")

In [5]:
!pip install langchain_community



In [6]:
!pip install langchain_openai langchain_huggingface langchain_classic



### Example 1

#### Create our QandA application

In [7]:
# from langchain.chains import RetrievalQA
# from langchain_openai import ChatOpenAI
# from langchain.llms import OpenAI
# from langchain_huggingface import HuggingFaceEmbeddings
# from langchain.document_loaders import CSVLoader, TextLoader
# from langchain.indexes import VectorstoreIndexCreator
# from langchain.vectorstores import DocArrayInMemorySearch
# from langchain.chains import LLMChain

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain.agents import create_agent
from langchain_classic.tools.retriever import create_retriever_tool
from langchain_text_splitters import RecursiveCharacterTextSplitter


In [8]:
from langchain_community.document_loaders import CSVLoader, TextLoader

In [9]:
file = '/content/data/OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

In [10]:
# !pip install --upgrade --force-reinstall sentence-transformers

In [11]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
doc_splits = text_splitter.split_documents(data)

# Create embeddings with HuggingFace
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'}
)

# Create vector store and add documents
vectorstore = InMemoryVectorStore.from_documents(
    documents=doc_splits,
    embedding=embeddings
)

In [12]:
retriever = vectorstore.as_retriever(
    search_kwargs={"k": 5}  # Return top 5 most similar documents
)

In [13]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Create a simple RAG chain
template = """Use the following context to answer the question:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)


llm = ChatOpenAI(temperature = 0.0,api_key=os.environ["OPENAI_API_KEY"])
# qa = RetrievalQA.from_chain_type(
#     llm=llm,
#     chain_type="stuff",
#     retriever=index.vectorstore.as_retriever(),
#     verbose=True,
#     chain_type_kwargs = {
#         "document_separator": "<<<<>>>>>"
#     }
# )
chain = (
    {
        "context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)
result = chain.invoke("Do the Cozy Comfort Pullover Set")
print(result)

and the Cozy Cuddles Knit Pullover Set have the same fabric composition?


#### Coming up with test datapoints

In [14]:
data[10]

Document(metadata={'source': '/content/data/OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.")

In [15]:
data[11]

Document(metadata={'source': '/content/data/OutdoorClothingCatalog_1000.csv', 'row': 11}, page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.')

#### Hard-coded examples

In [16]:
from langchain_core.prompts import PromptTemplate


In [17]:
from langchain_core.output_parsers import BaseOutputParser

from pydantic import BaseModel, Field

examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

# Define the prompt template
prompt_template = ChatPromptTemplate.from_template(
    """Examples:
1. Query: Do the Cozy Comfort Pullover Set have side pockets?
   Answer: Yes
2. Query: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
   Answer: The DownTek collection

Query: {query}
Answer:"""
)
# prompt_template = ChatPromptTemplate.from_template(
#     "Question: {question}\nAnswer:"
# )

# Define the output model
class Answer(BaseModel):
    answer: str = Field(description="The answer to the query")

# Create the output parser
class AnswerOutputParser(BaseOutputParser):
    def parse(self, text: str) -> Answer:
        # Split the response to get the answer
        answer = text.strip().split("Answer:")[-1].strip()
        return Answer(answer=answer)

# Initialize the LLM
# llm = OpenAI()
# llm = ChatOpenAI()

# Create the LLMChain
# llm_chain = LLMChain(
#     llm=llm,
#     prompt=prompt_template,
#     output_parser=AnswerOutputParser()
# )
llm_chain=prompt | llm | AnswerOutputParser()
# Example query
query = "Is the Cozy Comfort Pullover Set available in different colors?"

# Run the chain
# result = llm_chain.invoke({"query": query})
result = llm_chain.invoke({
    "question": query,       # matches {question}
    "context": ""            # or actual context string that your prompt expects
})

# Print the result
print(result)


answer='Yes, the Cozy Comfort Pullover Set is available in multiple colors.'


#### LLM-Generated examples

In [18]:
from langchain_classic.evaluation.qa import QAGenerateChain

In [19]:
example_gen_chain = QAGenerateChain.from_llm(llm)

In [20]:
llm_chain_qa = prompt_template | llm

In [21]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)

  new_examples = example_gen_chain.apply_and_parse(


In [22]:
new_examples[0]

{'qa_pairs': {'query': "What is the approximate weight of the Women's Campside Oxfords per pair?",
  'answer': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz."}}

In [23]:
data[0]

Document(metadata={'source': '/content/data/OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

In [24]:
d_flattened = [data['qa_pairs'] for data in new_examples]
d_flattened

[{'query': "What is the approximate weight of the Women's Campside Oxfords per pair?",
  'answer': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz."},
 {'query': 'What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?',
  'answer': 'The small size of the Recycled Waterhog Dog Mat, Chevron Weave has dimensions of 18" x 28", while the medium size has dimensions of 22.5" x 34.5".'},
 {'query': "What features does the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece offer?",
  'answer': 'The swimsuit offers bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, fully lined bottom for secure fit and maximum coverage. It is recommended to machine wash and line dry for best results.'},
 {'query': 'What is the fabric composition of the Refresh Swimwear V-Neck Tankini Contrasts?',
  'answer': 'Th

#### Combine examples

In [25]:
# examples += new_example
examples += d_flattened

In [26]:
examples[0]

{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'answer': 'Yes'}

In [27]:
llm_chain_qa.invoke(examples[0]["query"])

AIMessage(content='Yes, the Cozy Comfort Pullover Set does have side pockets.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 14, 'prompt_tokens': 76, 'total_tokens': 90, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-Cg7Ke8Zp7nnGlEuQRAnEblo8DBplq', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--c57f49d9-1a29-4a03-8ae5-a76d505d6f23-0', usage_metadata={'input_tokens': 76, 'output_tokens': 14, 'total_tokens': 90, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

### Manual Evaluation - Fun part

In [28]:
import langchain
langchain.debug = True

In [29]:
llm_chain_qa.invoke(examples[0]["query"])

AIMessage(content='Yes, the Cozy Comfort Pullover Set does have side pockets.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 14, 'prompt_tokens': 76, 'total_tokens': 90, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-Cg7KehSTyApPYxmwGkoY01DXRjAxS', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--0af3f30b-1d16-4dd9-b436-7bb39c4fe36c-0', usage_metadata={'input_tokens': 76, 'output_tokens': 14, 'total_tokens': 90, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

In [30]:
# Turn off the debug mode
langchain.debug = False

### LLM assisted evaluation

In [31]:
examples += d_flattened

In [32]:
examples

[{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
  'answer': 'Yes'},
 {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection'},
 {'query': "What is the approximate weight of the Women's Campside Oxfords per pair?",
  'answer': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz."},
 {'query': 'What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?',
  'answer': 'The small size of the Recycled Waterhog Dog Mat, Chevron Weave has dimensions of 18" x 28", while the medium size has dimensions of 22.5" x 34.5".'},
 {'query': "What features does the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece offer?",
  'answer': 'The swimsuit offers bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, fully li

In [33]:
predictions = llm_chain_qa.batch(examples)

In [34]:
predictions

[AIMessage(content='Yes, the Cozy Comfort Pullover Set does have side pockets.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 14, 'prompt_tokens': 76, 'total_tokens': 90, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-Cg7KenfyeW7ypo5Pzuv6By8KgLJVn', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--1319accf-058d-40d2-bb7a-1dc1f05afad5-0', usage_metadata={'input_tokens': 76, 'output_tokens': 14, 'total_tokens': 90, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}),
 AIMessage(content='The DownTek collection', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'comple

In [35]:
from langchain_classic.evaluation.qa import QAEvalChain

In [36]:
llm = ChatOpenAI(temperature=0,api_key=os.environ["OPENAI_API_KEY"])
eval_chain = QAEvalChain.from_llm(llm)

In [37]:
print(type(predictions[0]))
print(predictions[0])

<class 'langchain_core.messages.ai.AIMessage'>
content='Yes, the Cozy Comfort Pullover Set does have side pockets.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 14, 'prompt_tokens': 76, 'total_tokens': 90, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-Cg7KenfyeW7ypo5Pzuv6By8KgLJVn', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None} id='lc_run--1319accf-058d-40d2-bb7a-1dc1f05afad5-0' usage_metadata={'input_tokens': 76, 'output_tokens': 14, 'total_tokens': 90, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}


In [38]:
predictions_fixed = [{"prediction": p.content} for p in predictions]

In [39]:
print("example[0]:", examples[0])
print("prediction[0]:", predictions_fixed[0])

example[0]: {'query': 'Do the Cozy Comfort Pullover Set        have side pockets?', 'answer': 'Yes'}
prediction[0]: {'prediction': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'}


In [40]:
# graded_outputs = eval_chain.evaluate(examples, predictions)
graded_outputs = eval_chain.evaluate(
    examples,
    predictions_fixed,
    question_key="query",    # key name in your examples
    answer_key="answer",        # key name in your examples
    prediction_key="prediction" # key used in predictions_fixed
)

In [41]:
graded_outputs

[{'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'INCORRECT'},
 {'results': 'INCORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'INCORRECT'},
 {'results': 'INCORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'}]

In [42]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + examples[i]['query'])
    print("Real Answer: " + examples[i]['answer'])
    print("Predicted Answer: " + predictions_fixed[i]['prediction'])
    # print("Predicted Grade: " + graded_outputs[i]['text'])


Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.
Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The DownTek collection
Example 2:
Question: What is the approximate weight of the Women's Campside Oxfords per pair?
Real Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb 6 oz.
Example 3:
Question: What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?
Real Answer: The small size of the Recycled Waterhog Dog Mat, Chevron Weave has dimensions of 18" x 28", while the medium size has dimensions of 22.5" x 34.5".
Predicted Answer: The small size is 18" x 27" and the medium size is 27" x 36".
Ex

### Example 2
One can also easily evaluate your QA chains with the metrics offered in ragas

In [43]:
pip install docarray



In [71]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.docarray import DocArrayInMemorySearch
loader = TextLoader("/content/data/nyc_text.txt")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
docs = loader.load()
doc_splits = text_splitter.split_documents(docs)

# Create embeddings with HuggingFace
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'}
)

# Create vector store and add documents
vectorstore = DocArrayInMemorySearch.from_documents(doc_splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k":5})

llm = llm
# qa_chain = RetrievalQA.from_chain_type(
#     llm,
#     retriever=index.vectorstore.as_retriever(),
#     return_source_documents=True,
# )
print(prompt)
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

qa_chain = RunnableParallel(
    {
        "question": RunnablePassthrough(),
        "docs": retriever,
        "context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
    }
).assign(
    answer=lambda x: (prompt | llm | StrOutputParser()).invoke({
        "context": x["context"],
        "question": x["question"]
    })
)



input_variables=['context', 'question'] input_types={} partial_variables={} messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='Use the following context to answer the question:\n\n{context}\n\nQuestion: {question}\n'), additional_kwargs={})]


In [75]:
# testing it out

question = "How did New York City get its name?"
answer = qa_chain.invoke(question)
print("Answer:", answer["answer"])

Answer: New York City was originally named New Amsterdam by Dutch colonists in 1626. It was later renamed New York after King Charles II of England granted the lands to his brother, the Duke of York, when the city came under British control in 1664.


In [76]:
result

{'question': 'Your question',
 'docs': [Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== English rule ==='),
  Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== Public safety ===\n\n\n==== Police and law enforcement ===='),
  Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== Sexual orientation and gender identity ==='),
  Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== Wall Street ==='),
  Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== Accent and dialect ===')],
 'context': '=== English rule ===\n\n=== Public safety ===\n\n\n==== Police and law enforcement ====\n\n=== Sexual orientation and gender identity ===\n\n=== Wall Street ===\n\n=== Accent and dialect ===',
 'answer': 'What are some areas or topics that may fall under the English rule?'}

Now in order to evaluate the qa system we generated a few relevant questions. We've generated a few question for you but feel free to add any you want.

In [77]:
eval_questions = [
    "What is the population of New York City as of 2020?",
    "Which borough of New York City has the highest population?",
    "What is the economic significance of New York City?",
    "How did New York City get its name?",
    "What is the significance of the Statue of Liberty in New York City?",
]

eval_answers = [
    "8,804,190",
    "Brooklyn",
    "New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter.",
    "New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.",
    "The Statue of Liberty in New York City holds great significance as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants who arrived in the U.S. by ship in the late 19th and early 20th centuries, representing hope and freedom for those seeking a better life. It has since become an iconic landmark and a global symbol of cultural diversity and freedom.",
]

examples = [
    {"query": q, "ground_truths": [eval_answers[i]]}
    for i, q in enumerate(eval_questions)
]

In [78]:
examples

[{'query': 'What is the population of New York City as of 2020?',
  'ground_truths': ['8,804,190']},
 {'query': 'Which borough of New York City has the highest population?',
  'ground_truths': ['Brooklyn']},
 {'query': 'What is the economic significance of New York City?',
  'ground_truths': ["New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter."]},
 {'query': 'How did New York City

#### Introducing RagasEvaluatorChain

`RagasEvaluatorChain` creates a wrapper around the metrics ragas provides (documented [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)), making it easier to run these evaluation with langchain and langsmith.

The evaluator chain has the following APIs

- `__call__()`: call the `RagasEvaluatorChain` directly on the result of a QA chain.
- `evaluate()`: evaluate on a list of examples (with the input queries) and predictions (outputs from the QA chain).
- `evaluate_run()`: method implemented that is called by langsmith evaluators to evaluate langsmith datasets.

lets see each of them in action to learn more.

In [79]:
question=eval_questions[1]

answer = qa_chain.invoke(question)
print("Answer:", answer)

Answer: {'question': 'Which borough of New York City has the highest population?', 'docs': [Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content="New York City is the most populous city in the United States, with 8,804,190 residents incorporating more immigration into the city than outmigration since the 2010 United States census. More than twice as many people live in New York City as compared to Los Angeles, the second-most populous U.S. city; and New York has more than three times the population of Chicago, the third-most populous U.S. city. New York City gained more residents between 2010 and 2020 (629,000) than any other U.S. city, and a greater amount than the total sum of the gains over the same decade of the next four largest U.S. cities, Los Angeles, Chicago, Houston, and Phoenix, Arizona combined. New York City's population is about 44% of New York State's population, and about 39% of the population of the New York metropolitan area. The majority of New Yo

In [84]:
result

{'question': 'Your question',
 'docs': [Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== English rule ==='),
  Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== Public safety ===\n\n\n==== Police and law enforcement ===='),
  Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== Sexual orientation and gender identity ==='),
  Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== Wall Street ==='),
  Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== Accent and dialect ===')],
 'context': '=== English rule ===\n\n=== Public safety ===\n\n\n==== Police and law enforcement ====\n\n=== Sexual orientation and gender identity ===\n\n=== Wall Street ===\n\n=== Accent and dialect ===',
 'answer': 'What are some areas or topics that may fall under the English rule?'}

In [85]:
key_mapping = {
    "question": "question",
    "answer": "answer",
    "docs": "docs"
}

result_updated = {}
for old_key, new_key in key_mapping.items():
    if old_key in result:
        result_updated[new_key] = result[old_key]


In [86]:
result_updated

{'question': 'Your question',
 'answer': 'What are some areas or topics that may fall under the English rule?',
 'docs': [Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== English rule ==='),
  Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== Public safety ===\n\n\n==== Police and law enforcement ===='),
  Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== Sexual orientation and gender identity ==='),
  Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== Wall Street ==='),
  Document(metadata={'source': '/content/data/nyc_text.txt'}, page_content='=== Accent and dialect ===')]}

In [54]:
# !pip install --no-cache-dir recordclass

In [55]:
!pip install ragas



In [58]:
!pip install ragas langchain



In [97]:
os.environ["LANGSMITH_API_KEY"] = userdata.get("LANGSMITH_API_KEY")

In [127]:

# from ragas import evaluate
from langsmith import Client, evaluate
os.environ["LANGSMITH_PROJECT"] = "default"
client = Client(api_key=os.environ["LANGSMITH_API_KEY"])
# Create dataset with examples in one call
# data = [
#     {
#         'query': 'What is the population of New York City as of 2020?',
#         'ground_truths': ['8,804,190']
#     },
#     {
#         'query': 'Which borough of New York City has the highest population?',
#         'ground_truths': ['Brooklyn']
#     },
#     {
#         'query': 'What is the economic significance of New York City?',
#         'ground_truths': ["New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter."]
#     },
#     {
#         'query': 'How did New York City get its name?',
#         'ground_truths': ['New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.']
#     },
#     {
#         'query': 'What is the significance of the Statue of Liberty in New York City?',
#         'ground_truths': ['The Statue of Liberty in New York City holds great significance as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants who arrived in the U.S. by ship in the late 19th and early 20th centuries, representing hope and freedom for those seeking a better life. It has since become an iconic landmark and a global symbol of cultural diversity and freedom.']
#     }
# ]

# # 1. Create dataset
# dataset = client.create_dataset(
#     dataset_name="nyc-qa-evaluation",
#     description="NYC Q&A dataset for RAG evaluation"
# )
# examples = [
#     {
#         "inputs": {"question": item["query"]},
#         "outputs": {"ground_truth": item["ground_truths"][0]}  # Use first ground truth
#     }
#     for item in data
# ]

# client.create_examples(
#     dataset_id=dataset.id,
#     examples=examples
# )

# 1. Define target function (what you're evaluating)
# ✅ FIXED: Simple working target function
def target(inputs: dict) -> dict:
    """Evaluation target - returns only answer"""
    question = inputs["question"]

    try:
        result = qa_chain.invoke(question)
        return {"answer": result["answer"]}
    except Exception as e:
        print(f"Error: {e}")
        import traceback
        traceback.print_exc()
        return {"answer": ""}

def answer_relevancy_eval(outputs: dict, reference_outputs: dict) -> bool:
    output = outputs.get("output") or outputs.get("answer")
    return output is not None and len(str(output)) > 0

def faithfulness_eval(outputs: dict, reference_outputs: dict) -> bool:
    output = outputs.get("output") or outputs.get("answer")
    ground_truth = reference_outputs.get("ground_truth", "")

    if not output:
        return False

    # Check if keywords from ground truth are in output
    keywords = str(ground_truth).split()[:5]
    return any(keyword.lower() in str(output).lower() for keyword in keywords)
# Add this evaluator to your existing code
def context_relevancy(outputs: dict, reference_outputs: dict) -> bool:
    """Simple context relevancy check"""
    docs = outputs.get("docs", [])
    question = outputs.get("question", "")

    if not docs or not question:
        return False

    # Check if docs have relevant content
    question_words = question.lower().split()
    for doc in docs:
        if any(word in doc.page_content.lower() for word in question_words if len(word) > 3):
            return True
    return False

# 3. Run evaluation
results = client.evaluate(
    target,
    data="nyc-qa-evaluation",  # Or use dataset object
    evaluators=[faithfulness_eval, answer_relevancy_eval,context_relevancy],
    experiment_prefix="my-rag-evaluation",
    max_concurrency=2  # Optional
)

# 4. Analyze results
print(results)
# Access as dataframe if needed:
# df = results.to_pandas()
# # create evaluation chains
# faithfulness_chain   = EvaluatorChain(metric=faithfulness)
# answer_rel_chain     = EvaluatorChain(metric=answer_relevancy)
# context_rel_chain    = EvaluatorChain(metric=context_relevancy)
# context_recall_chain = EvaluatorChain(metric=context_recall)

View the evaluation results for experiment: 'my-rag-evaluation-86d15348' at:
https://smith.langchain.com/o/02c87376-067d-4ecd-9171-3170aec6a5ad/datasets/74eb3478-cb50-4c1b-8a76-d2fd5a16a258/compare?selectedSessions=4ed28bdc-8333-4b1c-af1a-7253a935ffd5




0it [00:00, ?it/s]

<ExperimentResults my-rag-evaluation-86d15348>


1. `__call__()`

Directly run the evaluation chain with the results from the QA chain. Do note that metrics like context_relevancy and faithfulness require the `source_documents` to be present.

In [128]:
# Recheck the result that we are going to validate.
results

Unnamed: 0,inputs.question,outputs.answer,error,reference.ground_truth,feedback.faithfulness_eval,feedback.answer_relevancy_eval,feedback.context_relevancy,execution_time,example_id,id
0,Which borough of New York City has the highest...,Manhattan (New York County) has the highest po...,,Brooklyn,False,True,False,0.97457,060e233e-372c-4d7f-b97d-18173e44dd02,019ac046-6f61-7641-a146-10640214beb5
1,What is the population of New York City as of ...,The population of New York City as of 2020 is ...,,8804190,True,True,False,1.292398,1c0f6017-cc17-4cf3-a9d1-1aed67ba7120,019ac046-6f67-76dc-aab5-a9e396c9bc2d
2,How did New York City get its name?,New York City was originally named New Amsterd...,,New York City got its name when it came under ...,True,True,False,2.096449,d33d01f4-49e7-4695-902a-fdaff6746408,019ac046-7474-70f2-8dc3-45d6414b2856
3,What is the economic significance of New York ...,The economic significance of New York City is ...,,"New York City's economic significance is vast,...",True,True,False,2.913655,36a5e4d2-ed07-4903-b0fc-6f32245b05dc,019ac046-7330-7343-a47a-7c66a0a2b0fa
4,What is the significance of the Statue of Libe...,The Statue of Liberty in New York City is a sy...,,The Statue of Liberty in New York City holds g...,True,True,False,1.046146,fc73b7a8-a42c-4e66-8f81-77883170309c,019ac046-7ca5-7387-874a-5b6b19236148


**Faithfulness**

In [130]:
# Create evaluation dataframe
import pandas as pd
evaluation_data = []
for result in results:
    evaluation_data.append({
        "question": result['example'].inputs['question'],
        "generated_answer": result['run'].outputs['answer'],
        "ground_truth": result['example'].outputs['ground_truth'],
        "answer_relevancy": result['evaluation_results']['results'][0].score if result['evaluation_results']['results'] else None,
        "faithfulness": result['evaluation_results']['results'][1].score if len(result['evaluation_results']['results']) > 1 else None,
        "context_relevancy": result['evaluation_results']['results'][1].score if len(result['evaluation_results']['results']) > 1 else None,

        })

df = pd.DataFrame(evaluation_data)
print(df)

# Summary statistics
print("\n=== EVALUATION SUMMARY ===")
print(f"Total Questions: {len(df)}")
print(f"Average Answer Relevancy: {df['answer_relevancy'].mean():.2%}")
print(f"Average Faithfulness: {df['faithfulness'].mean():.2%}")
print(f"Questions with both metrics passing: {len(df[(df['answer_relevancy'] == True) & (df['faithfulness'] == True)])}")

                                            question  \
0  Which borough of New York City has the highest...   
1  What is the population of New York City as of ...   
2                How did New York City get its name?   
3  What is the economic significance of New York ...   
4  What is the significance of the Statue of Libe...   

                                    generated_answer  \
0  Manhattan (New York County) has the highest po...   
1  The population of New York City as of 2020 is ...   
2  New York City was originally named New Amsterd...   
3  The economic significance of New York City is ...   
4  The Statue of Liberty in New York City is a sy...   

                                        ground_truth  answer_relevancy  \
0                                           Brooklyn             False   
1                                          8,804,190              True   
2  New York City got its name when it came under ...              True   
3  New York City's economic si

In [133]:
from langsmith.evaluation import EvaluationResult

def faithfulness_chain(outputs: dict, reference_outputs: dict = None) -> EvaluationResult:
    """Evaluate faithfulness with EvaluationResult"""
    if reference_outputs is None:
        reference_outputs = {}

    answer = str(outputs.get("answer", "")).lower()
    ground_truth = str(reference_outputs.get("ground_truth", "")).lower()

    # Calculate faithfulness
    words = ground_truth.split()[:5]
    is_faithful = any(w in answer for w in words if len(w) > 3)
    score = 1.0 if is_faithful else 0.0

    return EvaluationResult(
        key="faithfulness",
        score=score,
        comment="Faithful to ground truth" if is_faithful else "Does not match ground truth"
    )

High faithfulness_score means that there are exact consistency between the source documents and the answer.

You can check lower faithfulness scores by changing the result (answer from LLM) or source_documents to something else.

In [135]:
fake_result = result.copy()
fake_result["result"] = "we are the champions"
eval_result = faithfulness_chain(fake_result)
print(eval_result)

key='faithfulness' score=0.0 value=None comment='Does not match ground truth' correction=None evaluator_info={} feedback_config=None source_run_id=None target_run_id=None extra=None


**Context Relevancy**

In [138]:
eval_result = context_recall_chain(result)
eval_result["context_recall_score"]

NameError: name 'context_recall_chain' is not defined

High context_recall_score means that the ground truth is present in the source documents.

You can check lower context recall scores by changing the source_documents to something else.

In [None]:
from langchain.schema import Document
fake_result = result.copy()
fake_result["source_documents"] = [Document(page_content="I love christmas")]
eval_result = context_recall_chain(fake_result)
eval_result["context_recall_score"]

2. `evaluate()`

Evaluate a list of inputs/queries and the outputs/predictions from the QA chain.

In [137]:
# run the queries as a batch for efficiency
predictions = qa_chain.batch(examples)

# evaluate
print("evaluating...")
r = faithfulness_chain.evaluate(examples, predictions)
r

AttributeError: 'dict' object has no attribute 'replace'

In [None]:
# evaluate context recall
print("evaluating...")
r = context_recall_chain.evaluate(examples, predictions)
r