# 使用LangChain和RAGAS对RAG系统进行自动有效评估

我们主要讨论一下LLM RAG问答系统中一个重要的组成部分:

- Evaluation

我们主要使用LangChain 构建RAG问答系统，利用 RAGAS 框架进行评估，因为它正逐渐成为评估 RAG 系统的标准方法

### 首先安装 依赖

In [2]:
pip install -U -q langchain openai ragas arxiv pymupdf chromadb wandb tiktoken

[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

### 数据准备

主要以Arxiv的论文为例进行评估，通过 `ArxivLoader` 加载数据(论文)作为RAG的上下文。

- [`ArxivLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.arxiv.ArxivLoader.html#langchain_community.document_loaders.arxiv.ArxivLoader)


In [4]:
from langchain.document_loaders import ArxivLoader
#ArxivLoader 是 LangChain 库中的一个文档加载器，用于从 ArXiv 网站加载论文。
paper_docs = ArxivLoader(query="2309.15217", load_max_docs=1).load()
len(paper_docs)

1

In [5]:
for doc in paper_docs:
  print(doc.metadata)

{'Published': '2025-04-28', 'Title': 'Ragas: Automated Evaluation of Retrieval Augmented Generation', 'Authors': 'Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert', 'Summary': 'We introduce Ragas (Retrieval Augmented Generation Assessment), a framework\nfor reference-free evaluation of Retrieval Augmented Generation (RAG)\npipelines. RAG systems are composed of a retrieval and an LLM based generation\nmodule, and provide LLMs with knowledge from a reference textual database,\nwhich enables them to act as a natural language layer between a user and\ntextual databases, reducing the risk of hallucinations. Evaluating RAG\narchitectures is, however, challenging because there are several dimensions to\nconsider: the ability of the retrieval system to identify relevant and focused\ncontext passages, the ability of the LLM to exploit such passages in a faithful\nway, or the quality of the generation itself. With Ragas, we put forward a\nsuite of metrics which can be used to eval

### 创建RAG文本分割、Embedding model 、 向量库存储

我们主要使用 `RecursiveCharacterTextSplitter` 切割文本，通过`OpenAIEmbeddings()`进行文本编码，存储到 `VectorStore`。

- `RecursiveCharacterTextSplitter()`
- `OpenAIEmbeddings()`
- `Chroma`

In [6]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import DashScopeEmbeddings
embeddings_model = DashScopeEmbeddings(
        model="text-embedding-v2",
        dashscope_api_key=openai.api_key,
    )
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

docs = text_splitter.split_documents(paper_docs)

vectorstore = Chroma.from_documents(docs, embeddings_model)

Chroma 向量数据库默认情况下是内存存储，这意味着数据在程序运行结束后不会保留。
但是，Chroma 也支持持久化存储，您可以指定一个路径将数据保存到磁盘上。这样，即使程序关闭，数据也会被保留，并在下次启动时自动加载。

In [7]:
len(docs)

107

In [8]:
print(max([len(chunk.page_content) for chunk in docs]))

497


现在我们可以利用 `Chroma` 向量库的 `.as_retriever()` 方式进行检索，需要控制的主要参数为 `k`

In [9]:
base_retriever = vectorstore.as_retriever(search_kwargs={"k" : 3})

- ectorstore.as_retriever() : 这个方法的作用是将一个向量数据库实例（ vectorstore ）转换为 LangChain 中的一个检索器（ Retriever ）对象。检索器是 LangChain 中负责根据用户查询从数据源中获取相关文档的核心组件。
- "k" : 这个键表示要检索的“最相似”文档的数量。在这里， "k" : 3 意味着当检索器接收到一个查询时，它将从向量存储中返回与该查询最相似的 3 个文档。这在 RAG（检索增强生成）系统中非常常见，用于限制传递给大型语言模型的上下文信息量，以提高效率和相关性。

检索器的作用
检索器（Retriever）是一个核心组件，其主要作用是从一个数据源（如向量数据库、文档加载器等）中根据给定的查询（query）检索出相关的文档或信息。

In [10]:
relevant_docs = base_retriever.invoke("What is Retrieval Augmented Generation?")


In [11]:
len(relevant_docs)

3

### 创建prompt ——— 生成答案
我们需要利用`LLM`对`Context` 生成一系列的问题的`answer`


In [12]:
from langchain import PromptTemplate

template = """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 

Question: {question} 

Context: {context} 

Answer:
"""

prompt = PromptTemplate(
    template=template, 
    input_variables=["context","question"]
  )

print(prompt)

input_variables=['context', 'question'] input_types={} partial_variables={} template="You are an assistant for question-answering tasks. \nUse the following pieces of retrieved context to answer the question. \nIf you don't know the answer, just say that you don't know. \n\nQuestion: {question} \n\nContext: {context} \n\nAnswer:\n"


### 生成`answer`,利用LLM
利用 `Runnable` 定义一个 `chain` 实现rag全流程。

In [13]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
    model_name="qwen-plus-2025-04-28", 
    temperature=0,
    api_key="sk-ba2dda3817f145d7af141fdf32e31d90",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
    )
#RunnablePassthrough将输入数据原封不动地传递到输出
#StrOutputParser() 它被用作 RAG 链的最后一步，确保最终的答案以字符串形式输出。
rag_chain = (
    {"context": base_retriever,  "question": RunnablePassthrough()} 
    | prompt 
    | llm
    | StrOutputParser() 
)

In [14]:
print(llm.invoke("你是什么模型"))

content='我是通义千问，由通义实验室自主研发的超大规模语言模型。我能够回答问题、创作文字，比如写故事、写公文、写邮件、写剧本、逻辑推理、编程等等，还能表达观点，玩游戏等。我在多国语言上都有很好的掌握，能为你提供多样化的帮助。如果你有任何问题或需要帮助，欢迎随时告诉我！' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 79, 'prompt_tokens': 15, 'total_tokens': 94, 'completion_tokens_details': None, 'prompt_tokens_details': None}, 'model_name': 'qwen-plus-2025-04-28', 'system_fingerprint': None, 'id': 'chatcmpl-e98b6bd5-1878-93c9-9019-e4722ccee050', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None} id='run--89d164d0-9d47-4335-a799-980889ef14ee-0' usage_metadata={'input_tokens': 15, 'output_tokens': 79, 'total_tokens': 94, 'input_token_details': {}, 'output_token_details': {}}


#### 创建 RAGAs 所需的数据
question  Answer   contexts  ground_truths

In [None]:
# Ragas 数据集格式要求  ['question', 'answer', 'contexts', 'ground_truths']
'''
{
    "question": [], <-- 问题基于Context的
    "answer": [], <-- 答案基于LLM生成的
    "contexts": [], <-- context
    "ground_truths": [] <-- 标准答案
}
'''

from datasets import Dataset
#构建问题与标准答案（黄金数据集）
questions = ["What is faithfulness ?", 
             "How many pages are included in the WikiEval dataset, and which years do they cover information from?",
             "Why is evaluating Retrieval Augmented Generation (RAG) systems challenging?",
            ]
ground_truths = ["Faithfulness refers to the idea that the answer should be grounded in the given context.",
                  " To construct the dataset, we first selected 50 Wikipedia pages covering events that have happened since the start of 2022.",
                "Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself."]              
answers = []
contexts = []

# 生成答案
for query in questions:
    answers.append(rag_chain.invoke(query))
    contexts.append([docs.page_content for docs in base_retriever.get_relevant_documents(query)])

# 构建数据
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "reference": ground_truths
}
dataset = Dataset.from_dict(data)

  from .autonotebook import tqdm as notebook_tqdm
  contexts.append([docs.page_content for docs in base_retriever.get_relevant_documents(query)])


In [16]:
dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'reference'],
    num_rows: 3
})

### 使用RAGAs 进行评估

In [25]:
# Ragas 数据集格式要求  ['question', 'answer', 'contexts', 'ground_truths']
'''
{
    "question": [], <-- 问题基于Context的
    "answer": [], <-- 答案基于LLM生成的
    "contexts": [], <-- context
    "ground_truths": [] <-- 标准答案
}
'''

from datasets import Dataset
#构建问题与标准答案（黄金数据集）
questions = ["What is faithfulness ?", 
             "How many pages are included in the WikiEval dataset, and which years do they cover information from?",
             "Why is evaluating Retrieval Augmented Generation (RAG) systems challenging?",
            ]
ground_truths = ["Faithfulness refers to the idea that the answer should be grounded in the given context.",
                  " To construct the dataset, we first selected 50 Wikipedia pages covering events that have happened since the start of 2022.",
                "Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself."]              
answers = []
contexts = []

# 生成答案
for query in questions:
    answers.append(rag_chain.invoke(query))
    contexts.append([docs.page_content for docs in base_retriever.get_relevant_documents(query)])

# 构建数据
data = {
    "user_input": questions,
    "response": answers,
    "retrieved_contexts": contexts,
    "reference": ground_truths
}
dataset = Dataset.from_dict(data)

In [None]:
#将评估数据转换成 Ragas 框架专用的格式 。
from ragas import EvaluationDataset
evaluation_dataset = EvaluationDataset.from_list(dataset)

In [30]:
evaluation_dataset

EvaluationDataset(features=['user_input', 'retrieved_contexts', 'response', 'reference'], len=3)

我们可以使用一组常用的RAG评估指标，在收集的数据集上评估我们的RAG系统。您可以选择任何模型作为评估用LLM来进行评估。
ragas默认使用openai的api

In [27]:
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(llm)

In [28]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness
from ragas import evaluate
result = evaluate(dataset=evaluation_dataset,metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness()],llm=evaluator_llm)
result

Evaluating: 100%|██████████| 9/9 [01:05<00:00,  7.26s/it]


{'context_recall': 1.0000, 'faithfulness': 0.8333, 'factual_correctness(mode=f1)': 0.8733}

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision
)


result = evaluate(
    dataset = dataset, 
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy
    ],
    llm=evaluator_llm
)

result

Evaluating:  50%|█████     | 6/12 [00:23<00:17,  2.85s/it]Exception raised in Job[11]: AuthenticationError(Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-ba2dd***********************1d90. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}})
Evaluating:  67%|██████▋   | 8/12 [00:31<00:13,  3.26s/it]Exception raised in Job[3]: AuthenticationError(Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-ba2dd***********************1d90. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}})
Evaluating:  75%|███████▌  | 9/12 [00:32<00:08,  2.68s/it]Exception raised in Job[7]: AuthenticationError(Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-ba2dd***********************1d90. You can find your API key at https://platform.openai.co

{'context_precision': 1.0000, 'context_recall': 1.0000, 'faithfulness': 0.8571, 'answer_relevancy': nan}

In [29]:
import pandas as pd
pd.set_option("display.max_colwidth", None)

df = result.to_pandas()
df

Unnamed: 0,user_input,retrieved_contexts,response,reference,context_recall,faithfulness,factual_correctness(mode=f1)
0,What is faithfulness ?,"[First, Faithfulness refers to the idea that the an-\nswer should be grounded in the given context. This\nis important to avoid hallucinations, and to ensure\nthat the retrieved context can act as a justification\nfor the generated answer. Indeed, RAG systems are\noften used in applications where the factual con-\nsistency of the generated text w.r.t. the grounded\nsources is highly important, e.g. in domains such as\nlaw, where information is constantly evolving. Sec-, Faithfulness measures the information\nconsistency of the answer against the\ngiven context. Any claims that are made\nin the answer that cannot be deduced\nfrom context should be penalized.\nGiven an answer and context, assign a\nscore for faithfulness in the range 0-10.\ncontext: [context]\nanswer: [answer]\nTies, where the same score is assigned by the LLM\nto both answer candidates, were broken randomly.\nThe second baseline, shown as GPT Ranking, in-, considered quality dimensions. For faithfulness\nand context relevance, the two annotators agreed in\naround 95% of cases. For answer relevance, they\nagreed in around 90% of the cases. Disagreements\nwere resolved after a discussion between the anno-\ntators.\nFaithfulness\nTo obtain human judgements about\nfaithfulness, we first used ChatGPT to answer the\nquestion without access to any additional context.\nWe then asked the annotators to judge which of the]","Faithfulness, in the context of Retrieval Augmented Generation (RAG) systems, refers to the degree to which the generated answer is grounded in and consistent with the provided context. It ensures that the claims made in the answer can be logically deduced from the given context, thereby avoiding hallucinations or unsupported statements. Faithfulness is critical in applications where factual accuracy and consistency with reference sources are essential, such as legal or medical domains.",Faithfulness refers to the idea that the answer should be grounded in the given context.,1.0,1.0,1.0
1,"How many pages are included in the WikiEval dataset, and which years do they cover information from?","[which we refer to as WikiEval4. To construct the\ndataset, we first selected 50 Wikipedia pages cov-\nering events that have happened since the start of\n20225. In selecting these pages, we prioritised\nthose with recent edits. For each of the 50 pages,\nwe then asked ChatGPT to suggest a question that\ncan be answered based on the introductory section\nof the page, using the following prompt:\nYour task is to formulate a question from\ngiven context satisfying the rules given\nbelow:, which are annotated with human judgments. We\ncan then verify to what extent our metrics agree\nwith human assessments of faithfulness, answer\nrelevance and context relevance. Since we are not\naware of any publicly available datasets that could\nbe used for this purpose, we created a new dataset,\nwhich we refer to as WikiEval4. To construct the\ndataset, we first selected 50 Wikipedia pages cov-\nering events that have happened since the start of\n20225. In selecting these pages, we prioritised, tems with valuable insights, even in the absence\nof any ground truth. Our evaluation on WikiEval\nhas shown that the predictions from Ragas are\nclosely aligned with human predictions, especially\nfor faithfulness and answer relevance.\nReferences\nAmos Azaria and Tom M. Mitchell. 2023. The inter-\nnal state of an LLM knows when its lying. CoRR,\nabs/2304.13734.\nSebastian Borgeaud, Arthur Mensch, Jordan Hoffmann,\nTrevor Cai, Eliza Rutherford, Katie Millican, George]","The WikiEval dataset includes 50 Wikipedia pages, and the information covered in these pages pertains to events that have happened since the start of 2022.","To construct the dataset, we first selected 50 Wikipedia pages covering events that have happened since the start of 2022.",1.0,1.0,1.0
2,Why is evaluating Retrieval Augmented Generation (RAG) systems challenging?,"[Abstract\nWe introduce Ragas (Retrieval Augmented\nGeneration Assessment), a framework for\nreference-free evaluation of Retrieval Aug-\nmented Generation (RAG) pipelines.\nRAG\nsystems are composed of a retrieval and an\nLLM based generation module, and provide\nLLMs with knowledge from a reference textual\ndatabase, which enables them to act as a natu-\nral language layer between a user and textual\ndatabases, reducing the risk of hallucinations.\nEvaluating RAG architectures is, however, chal-, Ragas: Automated Evaluation of Retrieval Augmented Generation\nShahul Es†, Jithin James†, Luis Espinosa-Anke∗♢, Steven Schockaert∗\n†Exploding Gradients\n∗CardiffNLP, Cardiff University, United Kingdom\n♢AMPLYFI, United Kingdom\nshahules786@gmail.com,jamesjithin97@gmail.com\n{espinosa-ankel,schockaerts1}@cardiff.ac.uk\nAbstract\nWe introduce Ragas (Retrieval Augmented\nGeneration Assessment), a framework for\nreference-free evaluation of Retrieval Aug-\nmented Generation (RAG) pipelines.\nRAG, retrieval-augmented systems is thus paramount. In\npractice, RAG systems are often evaluated in terms\nof the language modelling task itself, i.e. by mea-\nsuring perplexity on some reference corpus. How-\never, such evaluations are not always predictive\nof downstream performance (Wang et al., 2023c).\nMoreover, this evaluation strategy relies on the LM\nprobabilities, which are not accessible for some\nclosed models (e.g. ChatGPT and GPT-4). Ques-\ntion answering is another common evaluation task,]","Evaluating Retrieval Augmented Generation (RAG) systems is challenging because there are multiple dimensions that need to be assessed. These include:\n\n1. **Retrieval Effectiveness**: The ability of the retrieval system to identify relevant and focused context passages from the reference textual database.\n2. **Faithful Exploitation by LLMs**: The ability of the large language model (LLM) to effectively and faithfully use the retrieved passages when generating responses.\n3. **Generation Quality**: The quality, coherence, and relevance of the final generated output.\n\nTraditional evaluation methods, such as measuring perplexity or relying on question-answering tasks, may not always predict downstream performance or may not be feasible for closed models. Additionally, evaluations often rely on ground truth human annotations, which can be time-consuming and expensive. This complexity necessitates a more robust and automated evaluation framework like Ragas, which can assess these different dimensions without requiring human annotations.","Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself.",1.0,0.5,0.62
