上一篇文章把信息转换成了向量，现在需要使用这些向量。vector store 不仅是存储向量的容器，也是检索引擎。主流的 vector store 有 FAISS、Chroma、Pinecone 和 Milvus 等，还有一些云数据库。这里使用 FAISS。首先需要安装功能包：`pip install faiss-cpu langchain-community`，然后准备一点数据：

In [5]:
import os
from pathlib import Path
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings

embedding = HuggingFaceEmbeddings(model_name='BAAI/bge-small-en-v1.5')

root = os.getcwd()
loader = TextLoader(os.path.join(root, 'data/paul_graham/paul_graham_essay.txt'))
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(data)

  from tqdm.autonotebook import tqdm, trange


创建 vector store 实例很方便，既可以先创建空白 vector store，然后再添加数据。这里需要说明一下，保存向量的 store 是 vector store，而仅有 vector store 没有意义。还需要一个存储文档的 store 叫 doc store 和文档与向量之间的对应关系。在创建空 vector store 时，doc store 和关系也需要提供空白的空间，后续添加信息时也要手动添加。

In [9]:
import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore

dim = len(embedding.embed_query("hello")) # 384
index = faiss.IndexFlatL2(dim)

vector_store = FAISS(embedding_function=embedding,
                     index=index,
                     docstore=InMemoryDocstore(),
                     index_to_docstore_id={})

也可以直接使用带有 metadata 的 documents，或使用更简单的 texts 和已经做好的词嵌入创建 vector store。对应的函数分别是
- `from_documents`
- `from_texts`
- `from_embeddings`

如果使用这些方法，doc store 和关系是自动创建的。

In [13]:
vector_store = FAISS.from_documents(docs, embedding)
print(vector_store.index_to_docstore_id)
print(vector_store.docstore)

{0: '73b9bc80-edfc-4384-9585-bcb87b2555ce', 1: 'a7f86181-b1c5-41d7-9874-a675a5b1808c', 2: '71eb9323-6b1e-4ff7-a723-3fdf008c3618', 3: '9e03d34a-9f30-4595-bf67-d147a69c279e', 4: 'c9718e5e-e40c-49c5-b848-7d9f01d412b5', 5: '37dfe88a-02b6-4f41-8c5d-008155224e45', 6: 'c6e3898c-5e31-4408-87a9-490d8ad8b63a', 7: 'c29f63d6-a518-4609-a314-320977547302', 8: '29826f26-be5d-4dbf-ad36-78d63eb621d4', 9: 'fb6b5518-3ca6-4ee3-9dba-4008d05d6b9b', 10: '0475cdce-bfb6-4485-baf7-160203bdbf8b', 11: '40149006-e293-4ae3-a552-4d65a1bf2f04', 12: '66cfacb3-1574-4caa-8af4-2d9a9528f035', 13: '6d296582-37d1-43ec-9ba9-b6a073381030', 14: 'a2dd7399-0235-49cf-ad3f-fb276f3780fa', 15: '60a6789c-08da-42d6-bd2f-0a09c87f29d1', 16: '461d8399-8cca-4171-a31b-2bf635d31346', 17: '21c77f72-944d-4855-90c8-231eefdac9c6', 18: '6d646ffc-d292-4bdf-8cf5-bcde4d21b95a', 19: '72fdbad4-4c18-435a-bc66-ee3b4350b037', 20: '0b4edf4c-0bfd-41c6-b672-a69ba7c9015a', 21: 'd25afcec-5e3d-401f-89aa-8776e72e9b69', 22: '42f6c6b8-b72f-4ef3-acbf-517a69365da7

当然，肯定有需要补充数据的时候。一共有 3 种办法添加数据：直接使用带有 metadata 的 documents，或使用更简单的 texts 和已经做好的词嵌入。对应的函数分别是
- `from_documents`
- `from_texts`
- `from_embeddings`

添加数据时不需要考虑 doc store 和关系，程序会自动更新。以 `add_documents` 为例：

In [None]:
vector_store.add_documents(docs)

有加就有删，删除是按照 doc store 的 `id` 进行的。

In [19]:
vector_store.delete(['73b9bc80-edfc-4384-9585-bcb87b2555ce'])

True

检索有两种方法，`vector_store.similarity_search` 的主要参数有：
- `query`：检索的文本
- `k`：检索的个数
- `filter`：过滤条件，例如只检索某个主题下的文本，或者只检索某个时间范围的文本等
- `fetch_k`：在执行 `filter` 之前要检索的个数

In [20]:
res = vector_store.similarity_search('What did the author do growing up?', k=3)
print(res)

[Document(metadata={'source': '/Users/wenjiazhai/Documents/GitHub/RAG_zero_to_hero/data/paul_graham/paul_graham_essay.txt'}, page_content="book. But there's nothing like writing a book about something to help you learn it. The book, On Lisp, wasn't published till 1993, but I wrote much of it in grad school."), Document(metadata={'source': '/Users/wenjiazhai/Documents/GitHub/RAG_zero_to_hero/data/paul_graham/paul_graham_essay.txt'}, page_content="Over the next several years I wrote lots of essays about all kinds of different topics. O'Reilly reprinted a collection of them as a book, called Hackers & Painters after one of the essays in it. I also worked on spam filters, and did some more painting. I used to have dinners for a group of friends every thursday night, which taught me how to cook for groups. And I bought another building in Cambridge, a former candy factory (and later, twas said, porn studio), to use as an office."), Document(metadata={'source': '/Users/wenjiazhai/Documents/G

而 `vector_store.as_retriever` 接收关键字参数，关键字可以是 `search_type`（可以是 `similarity`、`mmr` 或 `similarity_score_threshold`）、 `k`、 `score_threshold`、 `fetch_k` 或 `filter`，生成一个 query engine，再检索 query。

In [21]:
engine = vector_store.as_retriever(search_kwargs={"k": 3})
res = engine.invoke('What did the author do growing up?')
print(res)

[Document(metadata={'source': '/Users/wenjiazhai/Documents/GitHub/RAG_zero_to_hero/data/paul_graham/paul_graham_essay.txt'}, page_content="book. But there's nothing like writing a book about something to help you learn it. The book, On Lisp, wasn't published till 1993, but I wrote much of it in grad school."), Document(metadata={'source': '/Users/wenjiazhai/Documents/GitHub/RAG_zero_to_hero/data/paul_graham/paul_graham_essay.txt'}, page_content="Over the next several years I wrote lots of essays about all kinds of different topics. O'Reilly reprinted a collection of them as a book, called Hackers & Painters after one of the essays in it. I also worked on spam filters, and did some more painting. I used to have dinners for a group of friends every thursday night, which taught me how to cook for groups. And I bought another building in Cambridge, a former candy factory (and later, twas said, porn studio), to use as an office."), Document(metadata={'source': '/Users/wenjiazhai/Documents/G

可以看到，这两种方法的结果是完全一致的。这一步只是找到相关文档，还差最后一步：生成。