# 第六章 问答

 - [一、引言](#一、引言)
 - [二、环境配置](#二、环境配置)
 - [三、加载向量数据库](#三、加载向量数据库)
 - [四、构造检索式问答连](#四、构造检索式问答连)
 - [五、深入探究检索式问答链](#五、深入探究检索式问答链)
     - [5.1 基于模板的检索式问答链](#5.1-基于模板的检索式问答链)
     - [5.2 基于 MapReduce 的检索式问答链](#5.2-基于-MapReduce-的检索式问答链)
     - [5.3 基于 Refine 的检索式问答链](#5.3-基于-Refine-的检索式问答链)
 - [六、实验：状态记录](#六、实验：状态记录)


## 一、引言


在上一章，我们已经讨论了如何检索与给定问题相关的文档。下一步是获取这些文档，拿到原始问题，将它们一起传递给语言模型，并要求它回答这个问题。在本课程中，我们将详细介绍这一过程，以及完成这项任务的几种不同方法。

我们已经完成了整个存储和获取，获取了相关的切分文档之后，现在我们需要将它们传递给语言模型，以获得答案。这个过程的一般流程如下：首先问题被提出，然后我们查找相关的文档，接着将这些切分文档和系统提示一起传递给语言模型，并获得答案。

默认情况下，我们将所有的文档切片都传递到同一个上下文窗口中，即同一次语言模型调用中。然而，有一些不同的方法可以解决这个问题，它们都有优缺点。大部分优点来自于有时可能会有很多文档，但你简单地无法将它们全部传递到同一个上下文窗口中。MapReduce、Refine 和 MapRerank 是三种方法，用于解决这个短上下文窗口的问题。我们将在该课程中进行简要介绍。

## 二、环境配置

配置环境方法同前，此处不再赘述

In [26]:
from dotenv import load_dotenv
env = load_dotenv("/home/xinyi/.env")

import os

from langchain_community.chat_models.tongyi import ChatTongyi
from langchain_core.messages import HumanMessage

llm = ChatTongyi(
    streaming=True,
)

res = llm.stream([HumanMessage(content="hi")], streaming=True)
for r in res:
    print("chat resp:", r.content)
from langchain.embeddings.dashscope import DashScopeEmbeddings
embedding = DashScopeEmbeddings(
            model="text-embedding-v1",
    )


chat resp: Hello
chat resp: !
chat resp:  How
chat resp:  can I assist you today
chat resp: ?


在2023年9月2日之后，GPT-3.5 API 会进行更新，因此此处需要进行一个时间判断

In [2]:
import datetime
current_date = datetime.datetime.now().date()
if current_date < datetime.date(2023, 9, 2):
    llm_name = "gpt-3.5-turbo-0301"
else:
    llm_name = "gpt-3.5-turbo"
print(llm_name)

gpt-3.5-turbo-0301


## 三、加载向量数据库

In [2]:
# 加载在之前已经进行持久化的向量数据库
from langchain.vectorstores import Chroma
# from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/cs229_lectures/'
# embedding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [3]:
# 可以看见包含了我们之前进行分割的209个文档
print(vectordb._collection.count())

209


我们可以测试一下对于一个提问进行向量检索。如下代码会在向量数据库中根据相似性进行检索，返回给你 k 个文档。

In [4]:
question = "What are major topics for this class?"
docs = vectordb.similarity_search(question,k=3)
len(docs)

3

In [5]:
question = "这节课的主要话题是什么"
docs = vectordb.similarity_search(question,k=3)
len(docs)

3

## 四、构造检索式问答连

基于 LangChain，我们可以构造一个使用 GPT3.5 进行问答的检索式问答链，这是一种通过检索步骤进行问答的方法。我们可以通过传入一个语言模型和一个向量数据库来创建它作为检索器。然后，我们可以用问题作为查询调用它，得到一个答案。

In [9]:
# 使用 ChatGPT3.5，温度设置为0
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0)

In [6]:
# 导入检索式问答链
from langchain.chains import RetrievalQA

In [7]:
# 声明一个检索式问答链
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [8]:
# 可以以该方式进行检索问答
question = "What are major topics for this class?"
result = qa_chain({"query": question})

  warn_deprecated(


In [9]:
result["result"]

"The major topics for this class seem to include:\n\n1. Big O notation: You mentioned that everyone is assumed to know about it, which suggests it's a foundational concept.\n2. Data structures: Queues, stacks, and binary trees are likely discussed as part of computer science fundamentals.\n3. Programming skills: Writing simple computer programs is mentioned as a prerequisite.\n4. Matrix-vector notation: It appears to be a topic that was covered relatively quickly, possibly related to linear algebra.\n5. Linear algebra: You specifically mention that the instructor will delve into it and suggest there might be a refresher session for those who need it.\n\nThese are the main topics that are directly addressed in the provided context."

In [10]:
# 可以以该方式进行检索问答
question = "这节课的主要话题是什么"
result = qa_chain({"query": question})

In [11]:
print(result["result"])

这节课的主要话题包括线性代数的复习和使用矩阵向量 notation，以及可能的证明内容。如果线性代数部分讲得很快，可以去这周的讨论课上获取更详细的解释。


## 五、深入探究检索式问答链

通过上述代码，我们可以实现一个简单的检索式问答链。接下来，让我们深入其中的细节，看看在这个检索式问答链中，LangChain 都做了些什么。


### 5.1 基于模板的检索式问答链


我们首先定义了一个提示模板。它包含一些关于如何使用下面的上下文片段的说明，然后有一个上下文变量的占位符。

In [12]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


In [13]:
# 中文版
from langchain.prompts import PromptTemplate

# Build prompt
template = """使用以下上下文片段来回答最后的问题。如果你不知道答案，只需说不知道，不要试图编造答案。答案最多使用三个句子。尽量简明扼要地回答。在回答的最后一定要说"感谢您的提问！"
{context}
问题：{question}
有用的回答："""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


In [14]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [15]:
question = "Is probability a class topic?"

In [16]:
result = qa_chain({"query": question})

In [17]:
result["result"]

'Yes, the class assumes familiarity with basic probability and statistics, with topics like random variables, expectation, and variance being part of the prerequisites.'

In [19]:
# 中文版
question = "机器学习是其中一节的话题吗"

In [20]:
result = qa_chain({"query": question})

In [21]:
result["result"]

'是的，这是一节关于机器学习的课程，由Andrew Ng教授介绍。'

In [22]:
result["source_documents"][0]

Document(page_content="MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is ju st spend a little time going over the logistics \nof the class, and then we'll start to  talk a bit about machine learning.  \nBy way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so \nI personally work in machine learning, and I' ve worked on it for about 15 years now, and \nI actually think that machine learning is th e most exciting field of all the computer \nsciences. So I'm actually always excited about  teaching this class. Sometimes I actually \nthink that machine learning is not only the most exciting thin g in computer science, but \nthe most exciting thing in all of human e ndeavor, so maybe a little bias there.  \nI also want to introduce the TAs, who are all graduate students doing research in or \nrelated to the machine learni ng and all aspects of machin e learni

这种方法非常好，因为它只涉及对语言模型的一次调用。然而，它也有局限性，即如果文档太多，可能无法将它们全部适配到上下文窗口中。我们可以使用另一种技术来对文档进行问答，即MapReduce技术。

### 5.2 基于 MapReduce 的检索式问答链

在 MapReduce 技术中，首先将每个独立的文档单独发送到语言模型以获取原始答案。然后，这些答案通过最终对语言模型的一次调用组合成最终的答案。虽然这样涉及了更多对语言模型的调用，但它的优势在于可以处理任意数量的文档。


In [31]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)

In [32]:
question = "Is probability a class topic?"
result = qa_chain_mr({"query": question})

SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /gpt2/resolve/main/vocab.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1133)')))"), '(Request ID: c81264c8-baad-4fb9-9b04-88aa89c66ce0)')

In [30]:
result["result"]

"Refined Answer: Probability plays a central role in this class, particularly in the context of machine learning and statistical inference. In the class, you'll encounter the concept of conditional probabilities, such as P(Y|X), where Y represents the output variable and X represents the input features. The notation P(Y|X; θ) introduces the idea of a parameter θ, often referred to from a frequentist perspective, which is not considered a random variable but represents the true underlying structure generating the data. This distinction between fixed parameters and random variables is crucial when distinguishing between generative and discriminative learning algorithms, which you'll explore in a later lecture.\n\nThe instructor emphasizes that while assumptions made in modeling are not absolutely true, they provide a practical framework for developing algorithms. As you progress through the course, you'll learn how to evaluate when these assumptions benefit or hinder your learning algori

In [27]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)
# 中文版
question = "概率论是其中一节的话题吗"
result = qa_chain_mr({"query": question})
result["result"]

SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /gpt2/resolve/main/tokenizer_config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1133)')))"), '(Request ID: 8a495217-da37-4e45-b8f9-766978970a6c)')

当我们将之前的问题通过这个链进行运行时，我们可以看到这种方法的两个问题。第一，速度要慢得多。第二，结果实际上更差。根据给定文档的这一部分，对这个问题并没有明确的答案。这可能是因为它是基于每个文档单独回答的。因此，如果信息分布在两个文档之间，它并没有在同一上下文中获取到所有的信息。

In [None]:
#import os
#os.environ["LANGCHAIN_TRACING_V2"] = "true"
#os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
#os.environ["LANGCHAIN_API_KEY"] = "..." # replace dots with your api key

我们可导入上述环境变量，然后探寻 MapReduce 文档链的细节。例如，上述演示中，我们实际上涉及了四个单独的对语言模型的调用。在运行完每个文档后，它们会在最终链式中组合在一起，即Stuffed Documents链，将所有这些回答合并到最终的调用中。

### 5.3 基于 Refine 的检索式问答链

我们可以类似地设置链式类型为Refine。这是一种新的链式类型。Refine 文档链类似于 MapReduce 链，对于每一个文档，会调用一次 LLM，但有所改进的是，我们每次发送给 LLM 的最终提示是一个序列，这个序列会将先前的响应与新数据结合在一起，并请求得到改进后的响应。因此，这是一种类似于 RNN 的概念，我们增强了上下文，从而解决信息分布在不同文档的问题。

In [28]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine"
)
question = "Is probability a class topic?"
result = qa_chain_mr({"query": question})
result["result"]

"Refined Answer: Probability plays a central role in this class, particularly in the context of machine learning and statistical inference. In the class, you'll encounter the concept of conditional probabilities, such as P(Y|X), where Y represents the output variable and X represents the input features. The notation P(Y|X; θ) introduces the idea of a parameter θ, often referred to from a frequentist perspective, which is not considered a random variable but represents the true underlying structure generating the data. This distinction between fixed parameters and random variables is crucial when distinguishing between generative and discriminative learning algorithms, which you'll explore in a later lecture.\n\nThe instructor emphasizes that while assumptions made in modeling are not absolutely true, they provide practical approximations that enhance our learning algorithms. As the course progresses, you'll learn how to evaluate the impact of these assumptions on algorithm performance 

In [57]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine"
)
question = "概率论是其中一节的话题吗"
result = qa_chain_mr({"query": question})
result["result"]

'Based on the additional context provided, the instructor mentions that they will cover statistics and algebra in the discussion sections as a refresher, and will also use the discussion sections to go over extensions of the material taught in the main lectures. However, there is no explicit mention of probability theory being covered in the course. Therefore, the original answer still stands.'

你会注意到，这个结果比MapReduce链的结果要好。这是因为使用Refined Chain允许你逐个地组合信息，实际上比MapReduce链鼓励更多的信息传递。

## 六、实验：状态记录

让我们在这里做一个实验。

我们将创建一个QA链，使用默认的stuff。让我们问一个问题，概率论是课程的主题吗？它会回答，概率论应该是先决条件。我们将追问，为什么需要这些先决条件？然后我们得到了一个答案。这门课的先决条件是假定具有计算机科学和基本计算机技能和原理的基本知识。这与之前问有关概率的问题毫不相关。

In [33]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [34]:
question = "Is probability a class topic?"
result = qa_chain({"query": question})
result["result"]

'Yes, the context mentions "basic probability and statistics" as a prerequisite for the class, suggesting that probability is a topic covered in the course.'

In [35]:
question = "why are those prerequesites needed?"
result = qa_chain({"query": question})
result["result"]

'The prerequisites, such as knowledge of big O notation, data structures like queues, stacks, and binary trees, as well as basic programming skills, are necessary for understanding the fundamental concepts and efficiency in computer science topics that underpin machine learning. These topics form the foundation for algorithms and data structures, which are crucial for designing and analyzing machine learning models. Additionally, being able to write simple programs helps in grasping how these concepts work practically. The professor wants to ensure students have a solid base before diving into machine learning to prevent wasting time on basic concepts.'

In [36]:
question = "概率论是这节课的一个内容吗"
result = qa_chain({"query": question})
result["result"]

'是的，概率论在这节课中被提及，它是机器学习的一部分，并且可能会在讨论部分作为复习内容出现。'

In [37]:
question = "为什么需要具备这些知识"
result = qa_chain({"query": question})
result["result"]

'这些知识，包括大O记号、数据结构（如队列、栈和二叉树）以及基本编程技能，是计算机科学和算法课程的基础。它们对于理解程序效率、设计和分析算法至关重要。掌握这些能够帮助你更有效地解决问题，写出运行时间复杂度可控的代码，从而在编程和计算机科学领域进行有效的学习和工作。'

基本上，我们使用的链式（chain）没有任何状态的概念。它不记得之前的问题或之前的答案。为了实现这一点，我们需要引入内存，这是我们将在下一节中讨论的内容。