## 二、什么是检索增强的生成模型（RAG）


### 2.1、LLM 固有的局限性

1. LLM 的知识不是实时的
2. LLM 可能不知道你私有的领域/业务知识

<img src="gpt-llama2.png" style="margin-left: 0px" width="600px">


### 2.2、检索增强生成

天然能想到的，我们自己有产品知识库，有服务手册这些垂直领域的信息，能不能让大模型学会这些垂直领域的信息。
我们能想象到的方法有两种：

1. 重新训练大模型，把这些垂直领域的数据喂给大模型，让大模型从中学习, 这是微调
2. 给大模型添加个外挂的知识库，我们让大模型和这个知识库结合着去给用户回答问题

<div class="alert alert-success">
<b>类比：</b>
    <li>你可以把这个过程想象成开卷考试。让 LLM 先翻书，再回答问题。这个过程模型本身是不学会知识的。</li>
    <li>微调就是闭卷考试，你的先把所有的知识都学会，才能去回答问题。</li>
</div>

RAG（Retrieval Augmented Generation）顾名思义，通过**检索**的方法来增强**生成模型**的能力。

<video src="RAG.mp4" controls="controls" width=800px style="margin-left: 0px"></video>


## 三、RAG 系统的基本搭建流程

搭建过程：

1. 文档加载，并按一定条件**切割**成片段
2. 将切割的文本片段灌入**检索引擎**
3. 封装**检索接口**：能从文档里搜索出相关的文档片段
4. 构建**调用流程**：Query -> 检索 -> Prompt -> LLM -> 回复



### 3.1、文档的加载与切割


In [2]:
!pip install --upgrade openai

Collecting openai
  Using cached openai-1.34.0-py3-none-any.whl.metadata (21 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Using cached openai-1.34.0-py3-none-any.whl (325 kB)
Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Installing collected packages: distro, openai
Successfully installed distro-1.9.0 openai-1.34.0
[0m

In [3]:
# 安装 pdf 解析库
!pip install pdfminer.six

Collecting pdfminer.six
  Using cached pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting cryptography>=36.0.0 (from pdfminer.six)
  Downloading cryptography-42.0.8-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Using cached pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
Downloading cryptography-42.0.8-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: cryptography, pdfminer.six
Successfully installed cryptography-42.0.8 pdfminer.six-20231228
[0m

In [1]:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer

In [2]:
def extract_text_from_pdf(filename, page_numbers=None, min_line_length=1):
    '''从 PDF 文件中（按指定页码）提取文字'''
    paragraphs = []
    buffer = ''
    full_text = ''
    # 提取全部文本
    for i, page_layout in enumerate(extract_pages(filename)):
        # 如果指定了页码范围，跳过范围外的页
        if page_numbers is not None and i not in page_numbers:
            continue
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                full_text += element.get_text() + '\n'
                
    # 按空行分隔，将文本重新组织成段落
    lines = full_text.split('\n')
    for text in lines:
        if len(text) >= min_line_length:
            buffer += (' '+text) if not text.endswith('-') else text.strip('-')
        elif buffer:
            paragraphs.append(buffer)
            buffer = ''
    if buffer:
        paragraphs.append(buffer)
    return paragraphs

In [3]:
paragraphs = extract_text_from_pdf("llama2.pdf", min_line_length=10)

In [4]:
for para in paragraphs[:3]:
    print(para+"\n")

 Llama 2: Open Foundation and Fine-Tuned Chat Models

 Hugo Touvron∗ Louis Martin† Kevin Stone† Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edu

## 3.2、检索引擎


这里我们使用先进的开源搜索引擎 ElasticSearch，它可以实现各种场景下的搜索功能。

官方地址：https://www.elastic.co/cn/elasticsearch(有兴趣的同学可以了解)

### 安装 ES 服务器

安装教程地址 https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html 。
（可以使用 cursor 参考学习）

安装后，可以通过不同系统的服务状态监测指令查看 ES 运行状态，这里我的 centos 指令为 `service elasticsearch status`

### 安装 ES 客户端 

In [8]:
!pip install elasticsearch7  

Collecting elasticsearch7
  Using cached elasticsearch7-7.17.9-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting urllib3<2,>=1.21.1 (from elasticsearch7)
  Using cached urllib3-1.26.18-py2.py3-none-any.whl.metadata (48 kB)
Using cached elasticsearch7-7.17.9-py2.py3-none-any.whl (386 kB)
Using cached urllib3-1.26.18-py2.py3-none-any.whl (143 kB)
Installing collected packages: urllib3, elasticsearch7
  Attempting uninstall: urllib3
    Found existing installation: urllib3 2.2.1
    Uninstalling urllib3-2.2.1:
      Successfully uninstalled urllib3-2.2.1
Successfully installed elasticsearch7-7.17.9 urllib3-1.26.18
[0m

### 安装NLTK（文本处理方法库）

In [9]:
!pip install nltk

Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting joblib (from nltk)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.5.15-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Downloading regex-2024.5.15-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (776 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m776.2/776.2 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached joblib-1.4.2-py3-none-any.whl (301 kB)
Installing collected packages: regex, joblib, nltk
Successfully installed joblib-1.4.2 nltk-3.8.1 regex-2024.5.15
[0m

In [5]:
from elasticsearch7 import Elasticsearch, helpers
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
import re

import warnings
warnings.simplefilter("ignore")  # 屏蔽 ES 的一些Warnings

# 下载分词器和停用词库
nltk.download('punkt')  # 英文切词、词根、切句等方法
nltk.download('stopwords')  # 英文停用词库

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
def to_keywords(input_string):
    '''（英文）文本只保留关键字'''
    # 使用正则表达式替换所有非字母数字的字符为空格
    no_symbols = re.sub(r'[^a-zA-Z0-9\s]', ' ', input_string)
    word_tokens = word_tokenize(no_symbols)
    # 加载停用词表
    stop_words = set(stopwords.words('english'))
    ps = PorterStemmer()
    # 去停用词，取词根
    filtered_sentence = [ps.stem(w)
                         for w in word_tokens if not w.lower() in stop_words]
    return ' '.join(filtered_sentence)

In [7]:
to_keywords('how many parameters does llama 2 have?')

'mani paramet llama 2'

<div class="alert alert-info">
此处 to_keywords 为针对英文的实现，针对中文的实现请参考 chinese_utils.py
</div>

将文本灌入检索引擎


In [8]:
# 1. 创建Elasticsearch连接
es = Elasticsearch(
    hosts=['http://localhost:9200'],  # 服务地址与端口
    # http_auth=("elastic", "FKaB1Jpz0Rlw0l6G"),  # 用户名，密码
)

# 2. 定义索引名称
index_name = "teacher_demo_index_tmp"

# 3. 如果索引已存在，删除它（仅供演示，实际应用时不需要这步）
if es.indices.exists(index=index_name):
    es.indices.delete(index=index_name)

# 4. 创建索引
es.indices.create(index=index_name)

# 5. 灌库指令，构建索引
actions = [
    {
        "_index": index_name,
        "_source": {
            "keywords": to_keywords(para),
            "text": para
        }
    }
    for para in paragraphs
]

# 6. 文本灌库
helpers.bulk(es, actions)

(983, [])

实现关键字检索


In [9]:
def search(query_string, top_n=3):
    # ES 的查询语言
    search_query = {
        "match": {
            "keywords": to_keywords(query_string)
        }
    }
    res = es.search(index=index_name, query=search_query, size=top_n)
    return [hit["_source"]["text"] for hit in res["hits"]["hits"]]

In [10]:
results = search("how many parameters does llama 2 have?", 2)
for r in results:
    print(r+"\n")

 Llama 2 comes in a range of parameter sizes—7B, 13B, and 70B—as well as pretrained and fine-tuned variations.

 1. Llama 2, an updated version of Llama 1, trained on a new mix of publicly available data. We also increased the size of the pretraining corpus by 40%, doubled the context length of the model, and adopted grouped-query attention (Ainslie et al., 2023). We are releasing variants of Llama 2 with 7B, 13B, and 70B parameters. We have also trained 34B variants, which we report on in this paper but are not releasing.§



### 3.3、LLM 接口封装


In [11]:
from openai import OpenAI
import os
# 加载环境变量
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())  # 读取本地 .env 文件，里面定义了 OPENAI_API_KEY

client = OpenAI()

In [12]:
def get_completion(prompt, model="gpt-3.5-turbo"):
    '''封装 openai 接口'''
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,  # 模型输出的随机性，0 表示随机性最小
    )
    return response.choices[0].message.content

### 3.4、Prompt 模板


In [13]:
def build_prompt(prompt_template, **kwargs):
    '''将 Prompt 模板赋值'''
    prompt = prompt_template
    for k, v in kwargs.items():
        if isinstance(v, str):
            val = v
        elif isinstance(v, list) and all(isinstance(elem, str) for elem in v):
            val = '\n'.join(v)
        else:
            val = str(v)
        prompt = prompt.replace(f"__{k.upper()}__", val)
    return prompt

In [20]:
prompt_template = """
你是一个问答机器人。
你的任务是根据下述给定的已知信息回答用户问题。
确保你的回复完全依据下述已知信息。不要编造答案。
如果下述已知信息不足以回答用户的问题，请直接回复"我无法回答您的问题"。

已知信息:
__INFO__

用户问：
__QUERY__

请用中文回答用户问题。
"""

In [19]:
prompt = build_prompt(prompt_template, info="a", query="b", key="c")
print(prompt)


你是一个问答机器人。
你的任务是根据下述给定的已知信息回答用户问题。
确保你的回复完全依据下述已知信息。不要编造答案。
如果下述已知信息不足以回答用户的问题，请直接回复"我无法回答您的问题"。

已知信息:
a

用户问：
b

c

请用中文回答用户问题。



### 3.5、RAG Pipeline 初探


<video src="RAG.mp4" controls="controls" width=800px style="margin-left: 0px"></video>



In [21]:
user_query = "how many parameters does llama 2 have?"

# 1. 检索
search_results = search(user_query, 2)

# 2. 构建 Prompt
prompt = build_prompt(prompt_template, info=search_results, query=user_query)
print("===Prompt===")
print(prompt)

# 3. 调用 LLM
response = get_completion(prompt)

print("===回复===")
print(response)

===Prompt===

你是一个问答机器人。
你的任务是根据下述给定的已知信息回答用户问题。
确保你的回复完全依据下述已知信息。不要编造答案。
如果下述已知信息不足以回答用户的问题，请直接回复"我无法回答您的问题"。

已知信息:
 Llama 2 comes in a range of parameter sizes—7B, 13B, and 70B—as well as pretrained and fine-tuned variations.
 1. Llama 2, an updated version of Llama 1, trained on a new mix of publicly available data. We also increased the size of the pretraining corpus by 40%, doubled the context length of the model, and adopted grouped-query attention (Ainslie et al., 2023). We are releasing variants of Llama 2 with 7B, 13B, and 70B parameters. We have also trained 34B variants, which we report on in this paper but are not releasing.§

用户问：
how many parameters does llama 2 have?

请用中文回答用户问题。

===回复===
Llama 2有7B、13B和70B三种参数大小。
