<a href="https://colab.research.google.com/github/wojiushigexiaobai/Myblog/blob/main/docs/docs/tutorials/retrievers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a semantic search engine 构建语义搜索引擎

This tutorial will familiarize you with LangChain's [document loader](/docs/concepts/document_loaders), [embedding](/docs/concepts/embedding_models), and [vector store](/docs/concepts/vectorstores) abstractions. These abstractions are designed to support retrieval of data--  from (vector) databases and other sources--  for integration with LLM workflows. They are important for applications that fetch data to be reasoned over as part of model inference, as in the case of retrieval-augmented generation, or [RAG](/docs/concepts/rag) (see our RAG tutorial [here](/docs/tutorials/rag)).

本教程将让您熟悉 LangChain 的文档加载器、嵌入和 vector store 抽象。这些抽象旨在支持从（矢量）数据库和其他来源检索数据，以便与LLM工作流集成。它们对于获取数据作为模型推理的一部分进行推理的应用程序非常重要，例如检索增强生成或 RAG（请参阅此处的 RAG 教程）。

在这里，我们将在 PDF 文档上构建一个搜索引擎。这将允许我们在 PDF 中检索类似于输入查询的段落。
## Concepts

本指南重点介绍文本数据的检索。我们将介绍以下概念：

- Documents and document loaders;文档和文档加载器;
- Text splitters;文本拆分器;
- Embeddings;嵌入;
- Vector stores and retrievers.向量存储和检索器。

## Setup 设置

### Jupyter Notebook

本教程和其他教程可能在 Jupyter 笔记本中运行最方便。有关如何安装的说明，请参阅[此处](https://jupyter.org/install)。
### Installation 安装

本教程需要 `langchain-community` 和 `pypdf` 包：

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import CodeBlock from "@theme/CodeBlock";

<Tabs>
  <TabItem value="pip" label="Pip" default>
    <CodeBlock language="bash">pip install langchain-community pypdf</CodeBlock>
  </TabItem>
  <TabItem value="conda" label="Conda">
    <CodeBlock language="bash">conda install langchain-community pypdf -c conda-forge</CodeBlock>
  </TabItem>
</Tabs>


有关更多详细信息，请参阅我们的[安装指南](/docs/how_to/installation)。

### LangSmith

您使用 LangChain 构建的许多应用程序将包含多个步骤和多次调用LLM。随着这些应用程序变得越来越复杂，能够检查您的链条或代理内部到底发生了什么变得至关重要。最好的方法是使用[LangSmith](https://smith.langchain.com)。

在上面的链接中注册后，请确保设置环境变量以开始记录跟踪：

```shell
export LANGSMITH_TRACING="true"
export LANGSMITH_API_KEY="..."
```

或者，如果在笔记本中，您可以通过以下方式设置它们：

```python
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()
```


## Documents and Document Loaders 文档和文档加载器

LangChain 实现了一个[Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html)抽象类，它旨在表示一个文本单元和关联的元数据。它有三个属性：

- `page_content`:表示内容的字符串;
- `metadata`:包含任意元数据的字典 dict;
- `id`:（可选）文档的字符串标识符。

`metadata` 属性可以捕获有关文档来源、文档与其他文档的关系以及其他信息的信息。请注意，单个 `Document` 对象通常表示较大文档的一个块。

我们可以根据需要生成示例文档：
```python
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]
```

In [1]:
!pip install langchain-community pypdf

Collecting langchain-community
  Downloading langchain_community-0.3.19-py3-none-any.whl.metadata (2.4 kB)
Collecting pypdf
  Downloading pypdf-5.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain

In [2]:
import getpass
import os
from google.colab import userdata

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = userdata.get('LANGSMITH_API_KEY')

However, the LangChain ecosystem implements [document loaders](/docs/concepts/document_loaders) that [integrate with hundreds of common sources](/docs/integrations/document_loaders/). This makes it easy to incorporate data from these sources into your AI application.

但是，LangChain 生态系统实现了与数百个常见源集成的文档加载器。这使得将这些来源的数据整合到您的 AI 应用程序中变得容易。
### Loading documents 加载文档

Let's load a PDF into a sequence of `Document` objects. There is a sample PDF in the LangChain repo [here](https://github.com/langchain-ai/langchain/tree/master/docs/docs/example_data) -- a 10-k filing for Nike from 2023. We can consult the LangChain documentation for [available PDF document loaders](/docs/integrations/document_loaders/#pdfs). Let's select [PyPDFLoader](/docs/integrations/document_loaders/pypdfloader/), which is fairly lightweight.

让我们将 PDF 加载到 Document 对象序列中。此处的 LangChain 存储库中有一个示例 PDF — 2023 年为 Nike 提交的 10-k 文件。我们可以查阅 LangChain 文档以获取可用的 PDF 文档加载器。让我们选择 PyPDFLoader，它是相当轻量级的。

In [4]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "/content/sample_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

107


:::tip

有关 PDF 文档加载器的更多详细信息，请参阅[本指南](/docs/how_to/document_loader_pdf/)。

:::

`PyPDFLoader` loads one `Document` object per PDF page. For each, we can easily access:

PyPDFLoader 为每个 PDF 页面加载一个 Document 对象。对于每个，我们都可以轻松访问：
- The string content of the page;页面的字符串内容;
- Metadata containing the file name and page number.包含文件名和页码的元数据。

In [5]:
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F

{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': '/content/sample_data/nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1'}


### Splitting 分句

For both information retrieval and downstream question-answering purposes, a page may be too coarse a representation. Our goal in the end will be to retrieve `Document` objects that answer an input query, and further splitting our PDF will help ensure that the meanings of relevant portions of the document are not "washed out" by surrounding text.

对于信息检索和下游问答目的，页面的表示形式可能过于粗糙。我们最终的目标是检索回答输入查询的 Document 对象，进一步拆分我们的 PDF 将有助于确保文档相关部分的含义不会被周围的文本“冲淡”。

We can use [text splitters](/docs/concepts/text_splitters) for this purpose. Here we will use a simple text splitter that partitions based on characters. We will split our documents into chunks of 1000 characters
with 200 characters of overlap between chunks. The overlap helps
mitigate the possibility of separating a statement from important
context related to it. We use the
[RecursiveCharacterTextSplitter](/docs/how_to/recursive_text_splitter),
which will recursively split the document using common separators like
new lines until each chunk is the appropriate size. This is the
recommended text splitter for generic text use cases.

为此，我们可以使用文本拆分器。在这里，我们将使用一个简单的文本拆分器，它根据字符进行分区。我们将文档拆分为 1000 个字符的块 块之间有 200 个字符的重叠。重叠有助于 减少将 statement 与 important 分开的可能性 上下文。我们使用 RecursiveCharacterTextSplitter，它将使用通用分隔符（如换行符）递归拆分文档，直到每个块的大小都合适。这是通用文本使用案例的推荐文本拆分器。

We set `add_start_index=True` so that the character index where each
split Document starts within the initial Document is preserved as
metadata attribute “start_index”.

我们设置 add_start_index=True，以便每个拆分的 Document 在初始 Document 中开始的字符索引保留为元数据属性 “start_index”。

请参阅[本指南](/docs/how_to/document_loader_pdf/)以了解有关使用 PDF 的更多详细信息，包括如何从特定部分和图像中提取文本。

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

516

## Embeddings 嵌入

Vector search is a common way to store and search over unstructured data (such as unstructured text). The idea is to store numeric vectors that are associated with the text. Given a query, we can [embed](/docs/concepts/embedding_models) it as a vector of the same dimension and use vector similarity metrics (such as cosine similarity) to identify related text.

矢量搜索是存储和搜索非结构化数据（如非结构化文本）的常用方法。这个想法是存储与文本关联的数字向量。给定一个查询，我们可以将其嵌入为相同维度的向量，并使用向量相似度指标（例如余弦相似度）来识别相关文本。

LangChain 支持来自[dozens of providers数十个提供商](/docs/integrations/text_embedding/).的嵌入。这些模型指定如何将文本转换为数值向量。我们选择一个模型：

import EmbeddingTabs from "@theme/EmbeddingTabs";

<EmbeddingTabs customVarName="embeddings" />

In [8]:
!pip install -qU langchain-huggingface

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m96.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m73.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [9]:
from langchain_huggingface import HuggingFaceEmbeddings # 使用Huggingface提供的embedding，因为是免费的也因为常用

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# | output: false
# | echo: false
# 需要传入openai的key
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [10]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 768

[0.04747221991419792, 0.02167578786611557, -0.009018078446388245, 0.005356694106012583, 0.025557683780789375, -0.010230249725282192, -0.008414017967879772, 0.039303917437791824, 0.021570557728409767, -0.02409544587135315]


Armed with a model for generating text embeddings, we can next store them in a special data structure that supports efficient similarity search.

有了生成文本嵌入的模型，我们接下来可以将它们存储在支持高效相似性搜索的特殊数据结构中。
## Vector stores 矢量存储

LangChain [VectorStore](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html) objects contain methods for adding text and `Document` objects to the store, and querying them using various similarity metrics. They are often initialized with [embedding](/docs/how_to/embed_text) models, which determine how text data is translated to numeric vectors.

LangChain VectorStore 对象包含用于将文本和 Document 对象添加到存储区以及使用各种相似性指标查询它们的方法。它们通常使用嵌入模型进行初始化，嵌入模型确定如何将文本数据转换为数字向量。

LangChain includes a suite of [integrations](/docs/integrations/vectorstores) with different vector store technologies. Some vector stores are hosted by a provider (e.g., various cloud providers) and require specific credentials to use; some (such as [Postgres](/docs/integrations/vectorstores/pgvector)) run in separate infrastructure that can be run locally or via a third-party; others can run in-memory for lightweight workloads. Let's select a vector store:

LangChain 包括一套与不同向量存储技术的集成。某些矢量存储由提供商（例如，各种云提供商）托管，需要特定的凭证才能使用;一些（例如 Postgres）在单独的基础设施中运行，这些基础设施可以在本地或通过第三方运行;其他 CPU 可以在内存中运行轻量级工作负载。让我们选择一个 vector store：

import VectorStoreTabs from "@theme/VectorStoreTabs";

<VectorStoreTabs/>

In [11]:
!pip install -qU langchain-core

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/415.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m415.7/415.7 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [12]:
from langchain_core.vectorstores import InMemoryVectorStore
# 存储于内存中
vector_store = InMemoryVectorStore(embeddings)

In [13]:
!pip install -qU langchain-mongodb

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/46.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.4/46.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m47.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/313.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain_mongodb import MongoDBAtlasVectorSearch
# 存储于MongoDB数据库
vector_store = MongoDBAtlasVectorSearch(
    embedding=embeddings,
    collection=MONGODB_COLLECTION,
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
    relevance_score_fn="cosine",
)

In [16]:
!pip install -qU langchain-chroma

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m73.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.9/94.9 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00

In [17]:
# | output: false
# | echo: false

from langchain_chroma import Chroma
# 存储于Chroma中
vector_store = Chroma(embedding_function=embeddings)

实例化我们的 vector store 后，我们现在可以为文档编制索引。

In [18]:
ids = vector_store.add_documents(documents=all_splits)

Note that most vector store implementations will allow you to connect to an existing vector store--  e.g., by providing a client, index name, or other information. See the documentation for a specific [integration](/docs/integrations/vectorstores) for more detail.

请注意，大多数 vector store 实现都允许您连接到现有的 vector store —— 例如，通过提供 client、索引名称或其他信息。有关更多详细信息，请参阅特定集成的文档。

一旦我们实例化了包含documents的`VectorStore`,我们就可以查询它。[VectorStore](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html)包括用于查询的方法：
- Synchronously and asynchronously;同步和异步;
- By string query and by vector;按字符串查询和按向量;
- With and without returning similarity scores;返回和不返回相似性分数;
- By similarity and [maximum marginal relevance](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html#langchain_core.vectorstores.base.VectorStore.max_marginal_relevance_search) (to balance similarity with query to diversity in retrieved results).通过相似性和最大边际相关性（以平衡相似性与查询与检索结果中的多样性）。

The methods will generally include a list of [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects in their outputs.

这些方法的输出中通常包括 Document 对象的列表。
### Usage 用法

Embeddings typically represent text as a "dense" vector such that texts with similar meanings are geometrically close. This lets us retrieve relevant information just by passing in a question, without knowledge of any specific key-terms used in the document.

嵌入向量通常将文本表示为“密集”向量，以便具有相似含义的文本在几何上接近。这样，我们只需传入一个问题即可检索相关信息，而无需了解文档中使用的任何特定关键术语。

根据与字符串查询的相似性返回文档：

In [19]:
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])

page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our
wholesale, NIKE Direct and merchandising strategies in the region, among other functions.
In the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and three of which are
leased. Two other distribution centers, one located in Indianapolis, Indiana and one located in Dayton, Tennessee, are leased and operated by third-party logistics
providers. One distribution center for Converse is located in Ontario, California, which is leased. NIKE has a number of distribution facilities outside the United States,
some of which are leased and operated by third-party logistics providers. The most significant distribution facilities outside the United States are located in Laakdal,' metadata={'author': 'EDGAR Online, a division of Donnelley Fi

Async query: 异步查询：

In [20]:
results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])

page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'creationdate': '2023-07-

Return scores:返回分数：

In [21]:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 0.3725222945213318

page_content='Table of Contents
YEAR ENDED MAY 31,
(Dollars in millions) 2023 2022 2021
REVENUES
North America $ 21,608 $ 18,353 $ 17,179 
Europe, Middle East & Africa 13,418 12,479 11,456 
Greater China 7,248 7,547 8,290 
Asia Pacific & Latin America 6,431 5,955 5,343 
Global Brand Divisions 58 102 25 
Total NIKE Brand 48,763 44,436 42,293 
Converse 2,427 2,346 2,205 
Corporate 27 (72) 40 
TOTAL NIKE, INC. REVENUES $ 51,217 $ 46,710 $ 44,538 
EARNINGS BEFORE INTEREST AND TAXES
North America $ 5,454 $ 5,114 $ 5,089 
Europe, Middle East & Africa 3,531 3,293 2,435 
Greater China 2,283 2,365 3,243 
Asia Pacific & Latin America 1,932 1,896 1,530 
Global Brand Divisions (4,841) (4,262) (3,656)
Converse 676 669 543 
Corporate (2,840) (2,219) (2,261)
Interest expense (income), net (6) 205 262 
TOTAL NIKE, INC. INCOME BEFORE INCOME TAXES $ 6,201 $ 6,651 $ 6,661 
ADDITIONS TO PROPERTY, PLANT AND EQUIPMENT
North America $ 283 $ 146 $ 98 
Europe, Middle East & Africa 21

Return documents based on similarity to an embedded query:

根据与嵌入式查询的相似性返回文档：

In [23]:
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='Table of Contents
GROSS MARGIN
FISCAL 2023 COMPARED TO FISCAL 2022
For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to
43.5% for fiscal 2023 compared to 46.0% for fiscal 2022 due to the following:
*Wholesale equivalent
The decrease in gross margin for fiscal 2023 was primarily due to:
• Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as
product mix;
• Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in
the prior period resulting from lower available inventory supply;
• Unfavorable changes in net foreign currency exchange rates, including hedges; and
• Lower off-price margin, on a wholesale equivalent basis.
This was partially offset by:'

Learn more:了解更多信息：

- [API reference](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html)
- [How-to guide](/docs/how_to/vectorstores)
- [Integration-specific docs](/docs/integrations/vectorstores)

## Retrievers 检索器

LangChain `VectorStore` objects do not subclass [Runnable](https://python.langchain.com/api_reference/core/index.html#langchain-core-runnables). LangChain [Retrievers](https://python.langchain.com/api_reference/core/index.html#langchain-core-retrievers) are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous `invoke` and `batch` operations). Although we can construct retrievers from vector stores, retrievers can interface with non-vector store sources of data, as well (such as external APIs).

LangChain `VectorStore` 对象不是子类Runnable。LangChain Retriever 是 Runnables，因此它们实现了一组标准的方法（例如，同步和异步`调用`和`批处理`操作）。尽管我们可以从向量存储构建检索器，但检索器也可以与非向量存储数据源（例如外部 API）交互。

We can create a simple version of this ourselves, without subclassing `Retriever`. If we choose what method we wish to use to retrieve documents, we can create a runnable easily. Below we will build one around the `similarity_search` method:

我们可以自己创建一个简单的版本，而无需子类化 Retriever。如果我们选择希望使用什么方法来检索文档，我们可以轻松地创建一个 runnable。下面我们将围绕 similarity_search 方法构建一个：

In [24]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(id='32877fe3-d585-49c3-b113-331bfbeb18b3', metadata={'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'creationdate': '2023-07-20T16:22:00-04:00', 'creator': 'EDGAR Filing HTML Converter', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'page': 26, 'page_label': '27', 'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'source': '/content/sample_data/nke-10k-2023.pdf', 'start_index': 804, 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'title': '0000320187-23-000039', 'total_pages': 107}, page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our\nwholesale, NIKE Direct and merchandising strategies in the region, among other functions.\nIn the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are own

Vectorstores implement an `as_retriever` method that will generate a Retriever, specifically a [VectorStoreRetriever](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStoreRetriever.html). These retrievers include specific `search_type` and `search_kwargs` attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:

Vectorstore 实现了一个 as_retriever 方法，该方法将生成一个 Retriever，特别是 VectorStoreRetriever。这些检索器包括特定的 search_type 和 search_kwargs 属性，用于标识要调用的基础向量存储的方法以及如何参数化它们。例如，我们可以用下面的代码来复制上面的内容：

In [25]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(id='32877fe3-d585-49c3-b113-331bfbeb18b3', metadata={'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'creationdate': '2023-07-20T16:22:00-04:00', 'creator': 'EDGAR Filing HTML Converter', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'page': 26, 'page_label': '27', 'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'source': '/content/sample_data/nke-10k-2023.pdf', 'start_index': 804, 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'title': '0000320187-23-000039', 'total_pages': 107}, page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our\nwholesale, NIKE Direct and merchandising strategies in the region, among other functions.\nIn the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are own

`VectorStoreRetriever` supports search types of `"similarity"` (default), `"mmr"` (maximum marginal relevance, described above), and `"similarity_score_threshold"`. We can use the latter to threshold documents output by the retriever by similarity score.

VectorStoreRetriever 支持 “similarity”（默认）、“mmr”（如上所述的最大边际相关性）和 “similarity_score_threshold” 的搜索类型。我们可以使用后者通过相似性分数对检索器输出的文档进行阈值限制。

Retrievers can easily be incorporated into more complex applications, such as [retrieval-augmented generation (RAG)](/docs/concepts/rag) applications that combine a given question with retrieved context into a prompt for a LLM. To learn more about building such an application, check out the [RAG tutorial](/docs/tutorials/rag) tutorial.

检索器可以很容易地合并到更复杂的应用程序中，例如检索增强生成 （RAG） 应用程序，这些应用程序将给定问题与检索到的上下文合并到提示中。LLM要了解有关构建此类应用程序的更多信息，请查看 RAG 教程教程。

### Learn more:了解更多信息：

检索策略可能很丰富，也很复杂。例如：

- We can [infer hard rules and filters](/docs/how_to/self_query/) from a query (e.g., "using documents published after 2020");我们可以从查询中推断出硬规则和过滤器（例如，“使用 2020 年之后发布的文档”）;
- We can [return documents that are linked](/docs/how_to/parent_document_retriever/) to the retrieved context in some way (e.g., via some document taxonomy);我们可以以某种方式返回链接到检索到的上下文的文档（例如，通过某些文档分类法）;
- We can generate [multiple embeddings](/docs/how_to/multi_vector) for each unit of context;我们可以为每个上下文单元生成多个 embedding;
- We can [ensemble results](/docs/how_to/ensemble_retriever) from multiple retrievers;我们可以集成来自多个检索器的结果;
- We can assign weights to documents, e.g., to weigh [recent documents](/docs/how_to/time_weighted_vectorstore/) higher.我们可以为文档分配权重，例如，将最近的文档的权重提高。

The [retrievers](/docs/how_to#retrievers) section of the how-to guides covers these and other built-in retrieval strategies.

how-to guides中的检索器部分介绍了这些策略和其他内置检索策略。

It is also straightforward to extend the [BaseRetriever](https://python.langchain.com/api_reference/core/retrievers/langchain_core.retrievers.BaseRetriever.html) class in order to implement custom retrievers. See our how-to guide [here](/docs/how_to/custom_retriever).

扩展 BaseRetriever 类以实现自定义检索器也很简单。在此处查看我们的how-to guide。
## Next steps 后续步骤

您现在已经了解了如何在 PDF 文档上构建语义搜索引擎。

有关文档加载器的更多信息：

- [Conceptual guide](/docs/concepts/document_loaders)
- [How-to guides](/docs/how_to/#document-loaders)
- [Available integrations](/docs/integrations/document_loaders/)

For more on embeddings:有关嵌入的更多信息：

- [Conceptual guide](/docs/concepts/embedding_models/)
- [How-to guides](/docs/how_to/#embedding-models)
- [Available integrations](/docs/integrations/text_embedding/)

For more on vector stores:有关向量存储的更多信息：

- [Conceptual guide](/docs/concepts/vectorstores/)
- [How-to guides](/docs/how_to/#vector-stores)
- [Available integrations](/docs/integrations/vectorstores/)

有关 RAG 的更多信息，请参阅：

- [Build a Retrieval Augmented Generation (RAG) App](/docs/tutorials/rag/)
- [Related how-to guides](/docs/how_to/#qa-with-rag)