# Weaviate

本笔记本将介绍如何使用 `langchain-weaviate` 包在 LangChain 中开始使用 Weaviate 向量存储。

> [Weaviate](https://weaviate.io/) 是一个开源向量数据库。它允许您存储来自您喜爱的 ML 模型的数据对象和向量嵌入，并无缝扩展到数十亿的数据对象。

要使用此集成，您需要有一个正在运行的 Weaviate 数据库实例。

## 最低版本要求

此模块要求 Weaviate 版本为 `1.23.7` 或更高。但我们建议您使用最新版本的 Weaviate。

## 连接到 Weaviate

在本笔记本中，我们假设您有一个本地 Weaviate 实例运行在 `http://localhost:8080` 上，并且 50051 端口已开放用于 [gRPC 流量](https://weaviate.io/blog/grpc-performance-improvements)。因此，我们将使用以下方式连接到 Weaviate：

```python
weaviate_client = weaviate.connect_to_local()
```

### 其他部署选项

Weaviate 可以通过 [多种方式部署](https://weaviate.io/developers/weaviate/starter-guides/which-weaviate)，例如使用 [Weaviate Cloud Services (WCS)](https://console.weaviate.cloud)，[Docker](https://weaviate.io/developers/weaviate/installation/docker-compose) 或 [Kubernetes](https://weaviate.io/developers/weaviate/installation/kubernetes)。

如果您的 Weaviate 实例以其他方式部署，请在此处 [阅读更多](https://weaviate.io/developers/weaviate/client-libraries/python#instantiate-a-client) 关于连接到 Weaviate 的不同方法。您可以使用不同的 [帮助函数](https://weaviate.io/developers/weaviate/client-libraries/python#python-client-v4-helper-functions) 或 [创建自定义实例](https://weaviate.io/developers/weaviate/client-libraries/python#python-client-v4-explicit-connection)。

> 请注意，您需要一个 `v4` 客户端 API，它将创建一个 `weaviate.WeaviateClient` 对象。

### 身份验证

某些 Weaviate 实例，例如在 WCS 上运行的实例，启用了身份验证，例如 API 密钥和/或用户名+密码身份验证。

有关更多信息，请阅读 [客户端身份验证指南](https://weaviate.io/developers/weaviate/client-libraries/python#authentication)，以及 [深入的身份验证配置页面](https://weaviate.io/developers/weaviate/configuration/authentication)。

## 安装

In [1]:
# install package
# %pip install -Uqq langchain-weaviate
# %pip install openai tiktoken langchain

## 环境设置

本 Notebook 通过 `OpenAIEmbeddings` 使用 OpenAI API。我们建议您获取一个 OpenAI API 密钥，并将其导出为名为 `OPENAI_API_KEY` 的环境变量。

完成此操作后，您的 OpenAI API 密钥将被自动读取。如果您不熟悉环境变量，请在此处 [here](https://docs.python.org/3/library/os.html#os.environ) 或在此指南 [this guide](https://www.twilio.com/en-us/blog/environment-variables-python) 中了解更多。

# 用法

## 按相似度查找对象

下面是一个如何根据与查询的相似性来查找对象的示例，涵盖了从数据导入到查询 Weaviate 实例的整个过程。

### 第一步：数据导入

首先，我们将通过加载和分块长文本文件的内容来创建需要添加到 `Weaviate` 的数据。

In [2]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

In [3]:
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

  warn_deprecated(


现在，我们可以导入数据了。

为此，请连接到 Weaviate 实例并使用返回的 `weaviate_client` 对象。例如，可以按如下方式导入文档：

In [4]:
import weaviate
from langchain_weaviate.vectorstores import WeaviateVectorStore

In [5]:
weaviate_client = weaviate.connect_to_local()
db = WeaviateVectorStore.from_documents(docs, embeddings, client=weaviate_client)

/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/


### 步骤 2：执行搜索

现在我们可以执行相似性搜索了。这将根据存储在 Weaviate 中的 embedding 和从查询文本生成的等效 embedding，返回与查询文本最相似的文档。

In [6]:
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

# Print the first 100 characters of each result
for i, doc in enumerate(docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:100] + "...")


Document 1:
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...

Document 2:
And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of ...

Document 3:
Vice President Harris and I ran for office with a new economic vision for America. 

Invest in Ameri...

Document 4:
A former top litigator in private practice. A former federal public defender. And from a family of p...


您还可以添加过滤器，根据过滤条件包含或排除结果。（请参阅[更多过滤器示例](https://weaviate.io/developers/weaviate/search/filters)）。

In [7]:
from weaviate.classes.query import Filter

for filter_str in ["blah.txt", "state_of_the_union.txt"]:
    search_filter = Filter.by_property("source").equal(filter_str)
    filtered_search_results = db.similarity_search(query, filters=search_filter)
    print(len(filtered_search_results))
    if filter_str == "state_of_the_union.txt":
        assert len(filtered_search_results) > 0  # There should be at least one result
    else:
        assert len(filtered_search_results) == 0  # There should be no results

0
4


还可以提供 `k`，即返回结果的上限数量。

In [8]:
search_filter = Filter.by_property("source").equal("state_of_the_union.txt")
filtered_search_results = db.similarity_search(query, filters=search_filter, k=3)
assert len(filtered_search_results) <= 3

### 量化结果相似度

您可以选择性地检索相关性“分数”。这是一个相对分数，表示在搜索结果池中特定搜索结果的优劣程度。

请注意，这是一个相对分数，这意味着它不应用于确定相关性的阈值。但是，它可以用于比较整个搜索结果集中不同搜索结果的相关性。

In [9]:
docs = db.similarity_search_with_score("country", k=5)

for doc in docs:
    print(f"{doc[1]:.3f}", ":", doc[0].page_content[:100] + "...")

0.935 : For that purpose we’ve mobilized American ground forces, air squadrons, and ship deployments to prot...
0.500 : And built the strongest, freest, and most prosperous nation the world has ever known. 

Now is the h...
0.462 : If you travel 20 miles east of Columbus, Ohio, you’ll find 1,000 empty acres of land. 

It won’t loo...
0.450 : And my report is this: the State of the Union is strong—because you, the American people, are strong...
0.442 : Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...


## 搜索机制

`similarity_search` 使用 Weaviate 的 [混合搜索](https://weaviate.io/developers/weaviate/api/graphql/search-operators#hybrid)。

混合搜索结合了向量搜索和关键词搜索，其中 `alpha` 是向量搜索的权重。`similarity_search` 函数允许您通过 kwargs 传递其他参数。有关可用参数，请参阅此 [参考文档](https://weaviate.io/developers/weaviate/api/graphql/search-operators#hybrid)。

因此，您可以通过添加 `alpha=0` 来执行纯关键词搜索，如下所示：

In [10]:
docs = db.similarity_search(query, alpha=0)
docs[0]

Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'state_of_the_union.txt'})

## 持久化

通过 `langchain-weaviate` 添加的任何数据都将根据其配置在 Weaviate 中持久化。

例如，WCS 实例被配置为无限期地持久化数据，而 Docker 实例可以设置为将数据持久化到卷中。了解更多关于 [Weaviate 的持久化](https://weaviate.io/developers/weaviate/configuration/persistence) 的信息。

## 多租户

[多租户](https://weaviate.io/developers/weaviate/concepts/data#multi-tenancy)允许你在单个 Weaviate 实例中拥有大量隔离的数据集合，这些集合具有相同的集合配置。这对于多用户环境非常有用，例如构建 SaaS 应用，其中每个最终用户都将拥有自己的隔离数据集合。

要使用多租户，矢量存储需要感知 `tenant` 参数。

因此，在添加任何数据时，请提供 `tenant` 参数，如下所示。

In [11]:
db_with_mt = WeaviateVectorStore.from_documents(
    docs, embeddings, client=weaviate_client, tenant="Foo"
)

2024-Mar-26 03:40 PM - langchain_weaviate.vectorstores - INFO - Tenant Foo does not exist in index LangChain_30b9273d43b3492db4fb2aba2e0d6871. Creating tenant.


在执行查询时，也请提供 `tenant` 参数。

In [12]:
db_with_mt.similarity_search(query, tenant="Foo")

[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'state_of_the_union.txt'}),
 Document(page_content='And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of food, ga

## 检索器选项

Weaviate 也可以用作检索器

### 最大化边际相关性搜索 (MMR)

除了在 retriever 对象中使用 similaritysearch 外，你还可以使用 `mmr`。

In [13]:
retriever = db.as_retriever(search_type="mmr")
retriever.invoke(query)[0]

/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/


Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': 'state_of_the_union.txt'})

# 与 LangChain 一起使用

大型语言模型（LLMs）的一个已知局限性是，它们的训练数据可能已过时，或者不包含您所需的特定领域知识。

请看下面的示例：

In [14]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
llm.predict("What did the president say about Justice Breyer")

  warn_deprecated(
  warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/


"I'm sorry, I cannot provide real-time information as my responses are generated based on a mixture of licensed data, data created by human trainers, and publicly available data. The last update was in October 2021."

向量数据库通过提供一种存储和检索相关信息的方式来补充 LLM。这使得您能够结合 LLM 的推理和语言能力以及向量数据库检索相关信息的能力，从而结合两者的优势。

结合 LLM 和向量数据库的两个著名应用是：
- 问题解答
- 检索增强生成 (RAG)

### 回答问题（包含来源）

在 langchain 中，可以通过使用向量存储来增强问答功能。让我们来看看如何做到这一点。

本节将使用 `RetrievalQAWithSourcesChain`，它会从索引中查找文档。

首先，我们将再次对文本进行分块，并将它们导入 Weaviate 向量存储。

In [15]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import OpenAI

In [16]:
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)

In [17]:
docsearch = WeaviateVectorStore.from_texts(
    texts,
    embeddings,
    client=weaviate_client,
    metadatas=[{"source": f"{i}-pl"} for i in range(len(texts))],
)

现在我们可以构建链条，并指定 retriever：

In [18]:
chain = RetrievalQAWithSourcesChain.from_chain_type(
    OpenAI(temperature=0), chain_type="stuff", retriever=docsearch.as_retriever()
)

  warn_deprecated(


然后运行链条，提出问题：

In [19]:
chain(
    {"question": "What did the president say about Justice Breyer"},
    return_only_outputs=True,
)

  warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/


{'answer': ' The president thanked Justice Stephen Breyer for his service and announced his nomination of Judge Ketanji Brown Jackson to the Supreme Court.\n',
 'sources': '31-pl'}

### 检索增强生成

大型语言模型和向量存储库相结合的另一个非常流行的应用是检索增强生成（RAG）。 这是一种利用检索器从向量存储库中查找相关信息，然后利用大型语言模型根据检索到的数据和提示生成输出的技术。

我们从类似的设置开始：

In [20]:
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)

In [21]:
docsearch = WeaviateVectorStore.from_texts(
    texts,
    embeddings,
    client=weaviate_client,
    metadatas=[{"source": f"{i}-pl"} for i in range(len(texts))],
)

retriever = docsearch.as_retriever()

我们需要为 RAG 模型构建一个模板，以便将检索到的信息填充到模板中。

In [22]:
from langchain_core.prompts import ChatPromptTemplate

template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)

print(prompt)

input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question}\nContext: {context}\nAnswer:\n"))]


In [23]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

运行单元格后，我们得到了非常相似的输出。

In [24]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What did the president say about Justice Breyer")

/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/


"The president honored Justice Stephen Breyer for his service to the country as an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. The president also mentioned nominating Circuit Court of Appeals Judge Ketanji Brown Jackson to continue Justice Breyer's legacy of excellence. The president expressed gratitude towards Justice Breyer and highlighted the importance of nominating someone to serve on the United States Supreme Court."

但请注意，由于模板由您来构建，您可以根据自己的需求进行自定义。

### 总结与资源

Weaviate 是一个可扩展、生产就绪的向量数据库。

此次集成允许 Weaviate 与 LangChain 结合使用，通过强大的数据存储增强大型语言模型的能力。其可扩展性和生产就绪性使其成为 LangChain 应用程序的理想向量数据库选择，并将缩短您的产品上市时间。