# Lab | Langchain Evaluation

## Intro

Pick different sets of data and re-run this notebook. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications.

What did you learn? - Let's discuss that in class

## LangChain: Evaluation

### Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [1]:
!pip install numpy==1.26.4 --force-reinstall

Collecting numpy==1.26.4
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.4


In [2]:
!pip install dotenv



In [3]:
!pip install docarray



In [4]:
!pip install langchain_openai



In [5]:
!pip install langchain_community



In [6]:
!pip uninstall -y torch torchvision torchaudio
!pip install torch==2.1.2 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install sentence-transformers


Found existing installation: torch 2.1.2+cpu
Uninstalling torch-2.1.2+cpu:
  Successfully uninstalled torch-2.1.2+cpu
Found existing installation: torchvision 0.16.2+cpu
Uninstalling torchvision-0.16.2+cpu:
  Successfully uninstalled torchvision-0.16.2+cpu
Found existing installation: torchaudio 2.1.2+cpu
Uninstalling torchaudio-2.1.2+cpu:
  Successfully uninstalled torchaudio-2.1.2+cpu
Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch==2.1.2
  Using cached https://download.pytorch.org/whl/cpu/torch-2.1.2%2Bcpu-cp311-cp311-linux_x86_64.whl (184.9 MB)
Collecting torchvision
  Using cached https://download.pytorch.org/whl/cpu/torchvision-0.22.0%2Bcpu-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Collecting torchaudio
  Using cached https://download.pytorch.org/whl/cpu/torchaudio-2.7.0%2Bcpu-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.6 kB)
INFO: pip is looking at multiple versions of torchvision to determine which version is compatible with other r

In [7]:
!pip install langchain_huggingface



In [153]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')

### Example 1

#### Create our QandA application

In [9]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.llms import OpenAI
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.document_loaders import CSVLoader, TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.chains import LLMChain


In [156]:
file = '/content/Amazon_product.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

In [157]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs = {'device': 'cpu'})
).from_loaders([loader])

In [158]:
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

#### Coming up with test datapoints

In [159]:
data[10]

Document(metadata={'source': '/content/Amazon_product.csv', 'row': 10}, page_content='\ufeffPRODUCT_ID: 2857066\nTITLE: 3NHآ® Glasses Goggles Anti Fog Antis Windproof Anti Dust Resistant\nBULLET_POINTS: [Good quality and Suitable to use.,This Product comes in a proper Packaging.,Delivery within 3-5 weeks,In case of any query or issue. Feel free to reach out to us.,Contains: Pack of 1]\nDESCRIPTION: 3NH Glasses Goggles Anti Fog Antis Windproof Anti Dust Resistant\nPRODUCT_TYPE_ID: 10359\nPRODUCT_LENGTH: 590.5511805\n: ')

In [160]:
data[11]

Document(metadata={'source': '/content/Amazon_product.csv', 'row': 11}, page_content='\ufeffPRODUCT_ID: 833712\nTITLE: La Mure / Valbonnais gps\nBULLET_POINTS: \nDESCRIPTION: \nPRODUCT_TYPE_ID: 1\nPRODUCT_LENGTH: 433.07\n: ')

#### Hard-coded examples

In [161]:
from langchain.prompts import PromptTemplate

In [173]:
from langchain.prompts import PromptTemplate
from langchain.schema import BaseOutputParser
from pydantic import BaseModel, Field

examples = [
    {
        "query": "Do the Cotton Ankel Leggings Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What is the color of the PRIKNIK Horn Red Electric Air Horn?",
        "answer": "The PRIKNIK Horn Red Electric Air Horn is vibrant red in color."
    }
]

# Define the prompt template
prompt_template = PromptTemplate(
    input_variables=["query"],
    template="Examples:\n"
             "query: Do the Cotton Ankel Leggings Set have side pockets?"
             "answer: Yes"
             "2. Query: What is the color of the PRIKNIK Horn Red Electric Air Horn?\n"
             "   Answer: The PRIKNIK Horn Red Electric Air Horn is vibrant red in color.\n"
             "Query: {query}\n"
             "Answer:"
)

# Define the output model
class Answer(BaseModel):
    answer: str = Field(description="The answer to the query")

# Create the output parser
class AnswerOutputParser(BaseOutputParser):
    def parse(self, text: str) -> Answer:
        # Split the response to get the answer
        answer = text.strip().split("Answer:")[-1].strip()
        return Answer(answer=answer)

# Initialize the LLM
# llm = OpenAI()
llm = ChatOpenAI()

# Create the LLMChain
llm_chain = LLMChain(
    llm=llm,
    prompt=prompt_template,
    output_parser=AnswerOutputParser()
)

# Example query
query = "Is the Pullover Set machine washable?"

# Run the chain
result = llm_chain.run({"query": query})

# Print the result
print(result)


answer='Yes, the Pullover Set is machine washable.'


#### LLM-Generated examples

In [164]:
from langchain.evaluation.qa import QAGenerateChain

In [165]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [166]:
llm_chain = LLMChain(llm=llm, prompt=prompt_template)

In [174]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)



In [175]:
new_examples[0]

{'qa_pairs': {'query': 'What is the material used for the ArtzFolio Tulip Flowers Blackout Curtain?',
  'answer': 'The ArtzFolio Tulip Flowers Blackout Curtain is made of premium cotton canvas fabric.'}}

In [176]:
data[0]

Document(metadata={'source': '/content/Amazon_product.csv', 'row': 0}, page_content='\ufeffPRODUCT_ID: 1925202\nTITLE: ArtzFolio Tulip Flowers Blackout Curtain for Door, Window & Room | Eyelets & Tie Back | Canvas Fabric | Width 4.5feet (54inch) Height 5 feet (60 inch); Set of 2 PCS\nBULLET_POINTS: [LUXURIOUS & APPEALING: Beautiful custom-made curtains to decorate any home or office | Includes inbuilt tieback to hold the curtain | Completely finished and ready to hang on walls & windows,MATERIAL: Luxurious & versatile fabric with a natural finish | High colour fastness | State-of-the-art digital printing ensures colour consistency and prevents any fading | Eyelets; Cotton Canvas; Width 4.5feet (54inch) | Multicolour | PACKAGE: 2 Room Curtains Eyelets | SIZE: Height 5 feet (60 inch); SET OF 2 PCS,BLACKOUT CURTAIN: 100% opaque & heavy premium cotton canvas fabric | Tight knitted, long life & durable fabric | Printing only on front side with a plain colour back side,MADE TO PERFECTION: La

In [177]:
d_flattened = [data['qa_pairs'] for data in new_examples]
d_flattened

[{'query': 'What is the material used for the ArtzFolio Tulip Flowers Blackout Curtain?',
  'answer': 'The ArtzFolio Tulip Flowers Blackout Curtain is made of premium cotton canvas fabric.'},
 {'query': "What is the product ID for the Marks & Spencer Girls' Pyjama Sets in Navy Mix for 9-10 year olds?",
  'answer': "The product ID for the Marks & Spencer Girls' Pyjama Sets in Navy Mix for 9-10 year olds is 2673191."},
 {'query': 'What are the specifications of the PRIKNIK Horn Red Electric Air Horn mentioned in the document?',
  'answer': 'The specifications of the PRIKNIK Horn Red Electric Air Horn are as follows: Color - Red, Material - Aluminium, Voltage - 12V, dB - 130 dB, Material - Aluminum Pump Head + Steel Pump Body + ABS Shell and Parts, DB output - 130db, Voltage - 12v, Sound Type - Dual Tone, Application - 12V Voltage Vehicles With Battery Above 20A.'},
 {'query': "According to the document, what is the composition of the ALISHAH Women's Cotton Ankle Length Leggings Combo of 

#### Combine examples

In [178]:
# examples += new_example
examples += d_flattened

In [179]:
examples[0]

{'query': 'Do the Cotton Ankel Leggings Set        have side pockets?',
 'answer': 'Yes'}

In [180]:
qa.invoke(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Do the Cotton Ankel Leggings Set        have side pockets?',
 'result': "No, the ALISHAH Women's Cotton Ankle Length Leggings Combo of 2 does not mention having side pockets in the provided context."}

### Manual Evaluation - Fun part

In [181]:
import langchain
langchain.debug = True

In [182]:
qa.invoke(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cotton Ankel Leggings Set        have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cotton Ankel Leggings Set        have side pockets?",
  "context": "﻿PRODUCT_ID: 1011020\nTITLE: Maevn Women's Core Full Elastic Band Cargo Pants(Royal Blue, XX-Small)\nBULLET_POINTS: [Straight leg pant,Full elastic waistband,Two front slash pockets,One cargo pocket and inner cell pocket,Side vents]\nDESCRIPTION: Core by Maevn features the Women's Full Elastic Band Cargo Scrub Pant, style 9016. No job is too hard for Maevn's Core collection. Made from a high-caliber poplin fabric, these scrubs are designed to hold up in even the toughest envi

{'query': 'Do the Cotton Ankel Leggings Set        have side pockets?',
 'result': "No, the ALISHAH Women's Cotton Ankle Length Leggings Combo of 2 does not mention having side pockets in the provided information."}

In [183]:
# Turn off the debug mode
langchain.debug = False

### LLM assisted evaluation

In [184]:
examples += d_flattened

In [185]:
examples

[{'query': 'Do the Cotton Ankel Leggings Set        have side pockets?',
  'answer': 'Yes'},
 {'query': 'What is the color of the PRIKNIK Horn Red Electric Air Horn?',
  'answer': 'The PRIKNIK Horn Red Electric Air Horn is vibrant red in color.'},
 {'query': 'What is the material used for the ArtzFolio Tulip Flowers Blackout Curtain?',
  'answer': 'The ArtzFolio Tulip Flowers Blackout Curtain is made of premium cotton canvas fabric.'},
 {'query': "What is the product ID for the Marks & Spencer Girls' Pyjama Sets in Navy Mix for 9-10 year olds?",
  'answer': "The product ID for the Marks & Spencer Girls' Pyjama Sets in Navy Mix for 9-10 year olds is 2673191."},
 {'query': 'What are the specifications of the PRIKNIK Horn Red Electric Air Horn mentioned in the document?',
  'answer': 'The specifications of the PRIKNIK Horn Red Electric Air Horn are as follows: Color - Red, Material - Aluminium, Voltage - 12V, dB - 130 dB, Material - Aluminum Pump Head + Steel Pump Body + ABS Shell and Par

In [186]:
predictions = qa.batch(examples)



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


In [187]:
predictions

[{'query': 'Do the Cotton Ankel Leggings Set        have side pockets?',
  'answer': 'Yes',
  'result': "No, the ALISHAH Women's Cotton Ankle Length Leggings Combo of 2 does not have side pockets."},
 {'query': 'What is the color of the PRIKNIK Horn Red Electric Air Horn?',
  'answer': 'The PRIKNIK Horn Red Electric Air Horn is vibrant red in color.',
  'result': 'The color of the PRIKNIK Horn Red Electric Air Horn is red.'},
 {'query': 'What is the material used for the ArtzFolio Tulip Flowers Blackout Curtain?',
  'answer': 'The ArtzFolio Tulip Flowers Blackout Curtain is made of premium cotton canvas fabric.',
  'result': 'The material used for the ArtzFolio Tulip Flowers Blackout Curtain is canvas fabric.'},
 {'query': "What is the product ID for the Marks & Spencer Girls' Pyjama Sets in Navy Mix for 9-10 year olds?",
  'answer': "The product ID for the Marks & Spencer Girls' Pyjama Sets in Navy Mix for 9-10 year olds is 2673191.",
  'result': "The product ID for the Marks & Spence

In [188]:
from langchain.evaluation.qa import QAEvalChain

In [189]:
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

In [190]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [191]:
graded_outputs

[{'results': 'INCORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'}]

In [192]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    # print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: Do the Cotton Ankel Leggings Set        have side pockets?
Real Answer: Yes
Predicted Answer: No, the ALISHAH Women's Cotton Ankle Length Leggings Combo of 2 does not have side pockets.

Example 1:
Question: What is the color of the PRIKNIK Horn Red Electric Air Horn?
Real Answer: The PRIKNIK Horn Red Electric Air Horn is vibrant red in color.
Predicted Answer: The color of the PRIKNIK Horn Red Electric Air Horn is red.

Example 2:
Question: What is the material used for the ArtzFolio Tulip Flowers Blackout Curtain?
Real Answer: The ArtzFolio Tulip Flowers Blackout Curtain is made of premium cotton canvas fabric.
Predicted Answer: The material used for the ArtzFolio Tulip Flowers Blackout Curtain is canvas fabric.

Example 3:
Question: What is the product ID for the Marks & Spencer Girls' Pyjama Sets in Navy Mix for 9-10 year olds?
Real Answer: The product ID for the Marks & Spencer Girls' Pyjama Sets in Navy Mix for 9-10 year olds is 2673191.
Predicted Answer: The

### Example 2
One can also easily evaluate your QA chains with the metrics offered in ragas

In [39]:
#!pip install torch
#!pip install langchain langchain-community


In [130]:
from langchain_huggingface import HuggingFaceEmbeddings
loader = TextLoader("/content/AI.txt")
index = VectorstoreIndexCreator(embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs = {'device': 'cpu'})).from_loaders([loader])


llm = ChatOpenAI(temperature= 0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=index.vectorstore.as_retriever(),
    return_source_documents=True,
)



In [133]:
# testing it out

question = "What is AI?"
result = qa_chain.invoke({"query": question})
result["result"]

'Artificial intelligence (AI) involves the design of systems and software capable of performing tasks that usually require human intelligence. These tasks include learning, logical thinking, problem-solving, planning, recognizing speech and images, translating languages, and even creativity. AI has become deeply integrated into various aspects of our daily lives, from smart assistants to medical applications, scientific research, self-driving cars, security systems, and even art and creativity.'

In [134]:
result

{'query': 'What is AI?',
 'result': 'Artificial intelligence (AI) involves the design of systems and software capable of performing tasks that usually require human intelligence. These tasks include learning, logical thinking, problem-solving, planning, recognizing speech and images, translating languages, and even creativity. AI has become deeply integrated into various aspects of our daily lives, from smart assistants to medical applications, scientific research, self-driving cars, security systems, and even art and creativity.',
 'source_documents': [Document(id='fcb516e2-3f52-4bf6-9b91-e6c00db4a6ba', metadata={'source': '/content/AI.txt'}, page_content='Artificial Intelligence and Its Impact on the Future\n\nArtificial intelligence (AI) is one of the most fascinating and debated fields in technology. It involves the design of systems and software capable of performing tasks that usually require human intelligence. These tasks include learning, logical thinking, problem-solving, pla

Now in order to evaluate the qa system we generated a few relevant questions. We've generated a few question for you but feel free to add any you want.

In [135]:
eval_questions = [
    "What is Artificial Intelligence (AI)?",
    "How has AI impacted various industries?",
    "What are the potential risks associated with AI?",
    "What is the future outlook for AI technology?",
    "How is AI used in education?",
]

eval_answers = [
    "Artificial Intelligence (AI) is the simulation of human intelligence in machines designed to think, learn, and perform tasks typically requiring human cognition, such as problem-solving, learning, decision-making, and language understanding.",
    "AI has significantly impacted various industries such as healthcare, entertainment, finance, and education. It is used to enhance productivity, make accurate predictions, improve customer experiences, and optimize operations. For example, AI systems assist in disease diagnosis, improve user recommendations in streaming services, and enhance financial modeling and analysis.",
    "The potential risks associated with AI include security concerns, such as the use of AI for malicious purposes (e.g., cyberattacks or autonomous weapons). There are also concerns about job displacement, increased inequalities, and the ethical dilemmas surrounding AI decision-making, particularly in areas like autonomous vehicles and surveillance.",
    "The future of AI looks promising, with expected advancements in cognitive simulation, enhancing its role in scientific research, healthcare, business, and more. However, there are challenges related to governance, ethics, and regulation that need to be addressed to ensure AI's responsible and safe development.",
    "AI is used in education to personalize learning experiences, help track student progress, and offer customized resources. It also supports remote learning, assists teachers in managing virtual classrooms, and provides accurate evaluations, creating more efficient and tailored educational systems.",
]

examples = [
    {"query": q, "ground_truths": [eval_answers[i]]}
    for i, q in enumerate(eval_questions)
]


In [136]:
examples

[{'query': 'What is Artificial Intelligence (AI)?',
  'ground_truths': ['Artificial Intelligence (AI) is the simulation of human intelligence in machines designed to think, learn, and perform tasks typically requiring human cognition, such as problem-solving, learning, decision-making, and language understanding.']},
 {'query': 'How has AI impacted various industries?',
  'ground_truths': ['AI has significantly impacted various industries such as healthcare, entertainment, finance, and education. It is used to enhance productivity, make accurate predictions, improve customer experiences, and optimize operations. For example, AI systems assist in disease diagnosis, improve user recommendations in streaming services, and enhance financial modeling and analysis.']},
 {'query': 'What are the potential risks associated with AI?',
  'ground_truths': ['The potential risks associated with AI include security concerns, such as the use of AI for malicious purposes (e.g., cyberattacks or autonomo

#### Introducing RagasEvaluatorChain

`RagasEvaluatorChain` creates a wrapper around the metrics ragas provides (documented [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)), making it easier to run these evaluation with langchain and langsmith.

The evaluator chain has the following APIs

- `__call__()`: call the `RagasEvaluatorChain` directly on the result of a QA chain.
- `evaluate()`: evaluate on a list of examples (with the input queries) and predictions (outputs from the QA chain).
- `evaluate_run()`: method implemented that is called by langsmith evaluators to evaluate langsmith datasets.

lets see each of them in action to learn more.

In [137]:
result = qa_chain.invoke({"query": eval_questions[1]})
result["result"]

'AI has had a significant impact on various industries, transforming the way businesses operate and improving efficiency. Here are some examples of how AI has impacted different sectors:\n\n1. Healthcare: AI is being used in healthcare for tasks like early disease diagnosis, personalized treatment plans, drug discovery, and medical imaging analysis. This has led to improved patient outcomes and more efficient healthcare delivery.\n\n2. Finance: In the finance industry, AI is used for fraud detection, algorithmic trading, risk assessment, and customer service chatbots. AI has helped financial institutions make faster and more accurate decisions, leading to better customer service and reduced risks.\n\n3. Retail: AI is used in retail for personalized recommendations, inventory management, demand forecasting, and customer service. This has improved the overall shopping experience for customers and helped retailers optimize their operations.\n\n4. Manufacturing: AI is used in manufacturing

In [138]:
key_mapping = {
    "query": "question",
    "result": "answer",
    "source_documents": "contexts"
}

result_updated = {}
for old_key, new_key in key_mapping.items():
    if old_key in result:
        result_updated[new_key] = result[old_key]


In [139]:
result_updated

{'question': 'How has AI impacted various industries?',
 'answer': 'AI has had a significant impact on various industries, transforming the way businesses operate and improving efficiency. Here are some examples of how AI has impacted different sectors:\n\n1. Healthcare: AI is being used in healthcare for tasks like early disease diagnosis, personalized treatment plans, drug discovery, and medical imaging analysis. This has led to improved patient outcomes and more efficient healthcare delivery.\n\n2. Finance: In the finance industry, AI is used for fraud detection, algorithmic trading, risk assessment, and customer service chatbots. AI has helped financial institutions make faster and more accurate decisions, leading to better customer service and reduced risks.\n\n3. Retail: AI is used in retail for personalized recommendations, inventory management, demand forecasting, and customer service. This has improved the overall shopping experience for customers and helped retailers optimize

In [51]:
!pip install --no-cache-dir recordclass

Collecting recordclass
  Downloading recordclass-0.23.1.tar.gz (1.3 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.3 MB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: recordclass
  Building wheel for recordclass (pyproject.toml) ... [?25l[?25hdone
  Created wheel for recordclass: filename=recordclass-0.23.1-cp311-cp311-linux_x86_64.whl size=453255 sha256=4a79007a367b315502c9e1d6b41fe4bae40158800d61403f518fc97d5bd9ea53
  Stored in directory: /tmp/pip-ephem-wheel-cache-wvgvmjx9/wheels/23/43/c9/edc2de30980

In [52]:
!pip install ragas==0.1.9

Collecting ragas==0.1.9
  Downloading ragas-0.1.9-py3-none-any.whl.metadata (5.2 kB)
Collecting datasets (from ragas==0.1.9)
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting pysbd>=0.3.4 (from ragas==0.1.9)
  Downloading pysbd-0.3.4-py3-none-any.whl.metadata (6.1 kB)
Collecting appdirs (from ragas==0.1.9)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->ragas==0.1.9)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets->ragas==0.1.9)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets->ragas==0.1.9)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets->ragas==0.1.9)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading ragas-0.1.9-py

In [140]:
from ragas.integrations.langchain import EvaluatorChain
# from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
)

# create evaluation chains
faithfulness_chain   = EvaluatorChain(metric=faithfulness)
answer_rel_chain     = EvaluatorChain(metric=answer_relevancy)
context_rel_chain    = EvaluatorChain(metric=context_relevancy)
context_recall_chain = EvaluatorChain(metric=context_recall)

1. `__call__()`

Directly run the evaluation chain with the results from the QA chain. Do note that metrics like context_relevancy and faithfulness require the `source_documents` to be present.

In [141]:
# Recheck the result that we are going to validate.
result

{'query': 'How has AI impacted various industries?',
 'result': 'AI has had a significant impact on various industries, transforming the way businesses operate and improving efficiency. Here are some examples of how AI has impacted different sectors:\n\n1. Healthcare: AI is being used in healthcare for tasks like early disease diagnosis, personalized treatment plans, drug discovery, and medical imaging analysis. This has led to improved patient outcomes and more efficient healthcare delivery.\n\n2. Finance: In the finance industry, AI is used for fraud detection, algorithmic trading, risk assessment, and customer service chatbots. AI has helped financial institutions make faster and more accurate decisions, leading to better customer service and reduced risks.\n\n3. Retail: AI is used in retail for personalized recommendations, inventory management, demand forecasting, and customer service. This has improved the overall shopping experience for customers and helped retailers optimize th

**Faithfulness**

In [142]:
result_updated["contexts"] = [doc.page_content for doc in result_updated["contexts"]]


In [143]:
eval_result = faithfulness_chain(result_updated)
eval_result["faithfulness"]

0.85

High faithfulness_score means that there are exact consistency between the source documents and the answer.

You can check lower faithfulness scores by changing the result (answer from LLM) or source_documents to something else.

In [146]:
fake_result = result.copy()
fake_result["result"] = "what is economic impact of AI"

# Create a new dictionary with the expected keys
updated_fake_result = {
    "question": fake_result.get("query"),
    "answer": fake_result.get("result"),
    "contexts": [doc.page_content for doc in fake_result.get("source_documents", [])]
}

# Pass the updated dictionary to faithfulness_chain
eval_result = faithfulness_chain(updated_fake_result)
eval_result["faithfulness"]  # Access using "faithfulness_score"



nan

**Context Relevancy**

In [147]:
# For context_recall_chain:
eval_result = context_recall_chain({
    "question": result["query"],
    "contexts": [doc.page_content for doc in result["source_documents"]],
    "ground_truth": eval_answers[eval_questions.index(result["query"])] # Assuming eval_answers contains ground truths for eval_questions
})
eval_result["context_recall"]


0.3333333333333333

High context_recall_score means that the ground truth is present in the source documents.

You can check lower context recall scores by changing the source_documents to something else.

In [149]:
from langchain.schema import Document

fake_result = result.copy()
fake_result["source_documents"] = [Document(page_content="I love AI")]

updated_fake_result = {
    "question": fake_result.get("query"),
    "contexts": [doc.page_content for doc in fake_result.get("source_documents", [])],
    "ground_truth": eval_answers[eval_questions.index(fake_result.get("query", ""))]
}

eval_result = context_recall_chain(updated_fake_result)
eval_result["context_recall"]

0.0

2. `evaluate()`

Evaluate a list of inputs/queries and the outputs/predictions from the QA chain.

In [150]:
# run the queries as a batch for efficiency
predictions = qa_chain.batch(examples)

# Update keys to match expected input format
updated_predictions = []
for prediction in predictions:
    updated_prediction = {
        "question": prediction.get("query"),
        "answer": prediction.get("result"),
        "contexts": [doc.page_content for doc in prediction.get("source_documents", [])]
    }
    updated_predictions.append(updated_prediction)

print("evaluating...")
r = [faithfulness_chain(prediction) for prediction in updated_predictions]
r

evaluating...


[{'question': 'What is Artificial Intelligence (AI)?',
  'answer': 'Artificial Intelligence (AI) involves the design of systems and software capable of performing tasks that usually require human intelligence. These tasks include learning, logical thinking, problem-solving, planning, recognizing speech and images, translating languages, and even creativity. AI has evolved significantly since its inception and is now deeply integrated into various aspects of our daily lives.',
  'contexts': ['Artificial Intelligence and Its Impact on the Future\n\nArtificial intelligence (AI) is one of the most fascinating and debated fields in technology. It involves the design of systems and software capable of performing tasks that usually require human intelligence. These tasks include learning, logical thinking, problem-solving, planning, recognizing speech and images, translating languages, and even creativity.\n\nAI has come a long way since it first emerged as an idea in the minds of scientists 

In [151]:
# run the queries as a batch for efficiency
predictions = qa_chain.batch(examples)

# Update keys to match expected input format
updated_predictions = []
for prediction in predictions:
    updated_prediction = {
        "question": prediction.get("query"),
        "answer": prediction.get("result"),
        "contexts": [doc.page_content for doc in prediction.get("source_documents", [])]
    }
    updated_predictions.append(updated_prediction)

print("evaluating...")
r = [faithfulness_chain(prediction) for prediction in updated_predictions] # Use updated_predictions here
r

evaluating...


[{'question': 'What is Artificial Intelligence (AI)?',
  'answer': 'Artificial Intelligence (AI) involves the design of systems and software capable of performing tasks that usually require human intelligence. These tasks include learning, logical thinking, problem-solving, planning, recognizing speech and images, translating languages, and even creativity. AI has evolved significantly since its inception and is now deeply integrated into various aspects of our daily lives, from smart assistants to medical applications and scientific research.',
  'contexts': ['Artificial Intelligence and Its Impact on the Future\n\nArtificial intelligence (AI) is one of the most fascinating and debated fields in technology. It involves the design of systems and software capable of performing tasks that usually require human intelligence. These tasks include learning, logical thinking, problem-solving, planning, recognizing speech and images, translating languages, and even creativity.\n\nAI has come a

In [152]:
# evaluate context recall
print("evaluating...")
# r = context_recall_chain.evaluate(examples, predictions) # This line is causing the error
r = []
for i, prediction in enumerate(updated_predictions):
    # Add the ground truth to the prediction dictionary
    prediction["ground_truth"] = examples[i]["ground_truths"][0]
    # Now call context_recall_chain with the complete prediction
    r.append(context_recall_chain(prediction))
r

evaluating...


[{'question': 'What is Artificial Intelligence (AI)?',
  'answer': 'Artificial Intelligence (AI) involves the design of systems and software capable of performing tasks that usually require human intelligence. These tasks include learning, logical thinking, problem-solving, planning, recognizing speech and images, translating languages, and even creativity. AI has evolved significantly since its inception and is now deeply integrated into various aspects of our daily lives, from smart assistants to medical applications and scientific research.',
  'contexts': ['Artificial Intelligence and Its Impact on the Future\n\nArtificial intelligence (AI) is one of the most fascinating and debated fields in technology. It involves the design of systems and software capable of performing tasks that usually require human intelligence. These tasks include learning, logical thinking, problem-solving, planning, recognizing speech and images, translating languages, and even creativity.\n\nAI has come a