# ArxivRetriever

>[arXiv](https://arxiv.org/) 是一个开放获取的档案库，收录了物理学、数学、计算机科学、定量生物学、定量金融学、统计学、电气工程与系统科学以及经济学领域的 200 万篇学术文章。

本笔记本展示了如何从 Arxiv.org 中检索科学文章并将其转换为下游使用的 [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html) 格式。

有关 `ArxivRetriever` 所有功能和配置的详细文档，请访问 [API 参考](https://python.langchain.com/api_reference/community/retrievers/langchain_community.retrievers.arxiv.ArxivRetriever.html)。

### 集成详情

import {ItemTable} from "@theme/FeatureTables";

<ItemTable category="external_retrievers" item="ArxivRetriever" />

## 设置

如果您想从单个查询中获得自动跟踪，还可以通过取消注释下面的行来设置您的 [LangSmith](https://docs.smith.langchain.com/) API 密钥：

In [None]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### 安装

此检索器位于 `langchain-community` 包中。我们还需要 [arxiv](https://pypi.org/project/arxiv/) 依赖项：

In [None]:
%pip install -qU langchain-community arxiv

## 实例化

`ArxivRetriever` 参数包括：
- 可选 `load_max_docs`: 默认为 100。用于限制下载文档的数量。下载全部 100 篇文档需要时间，因此请在实验中使用较小的数字。目前有一个硬性限制是 300。
- 可选 `load_all_available_meta`: 默认为 False。默认情况下，只下载最重要的字段：`Published` (文档发布/最后更新日期)、`Title`、`Authors`、`Summary`。如果设置为 True，则也会下载其他字段。
- `get_full_documents`: 布尔值，默认为 False。决定是否获取文档的全文。

更多详情请参阅 [API 参考](https://python.langchain.com/api_reference/community/retrievers/langchain_community.retrievers.arxiv.ArxivRetriever.html)。

In [1]:
from langchain_community.retrievers import ArxivRetriever

retriever = ArxivRetriever(
    load_max_docs=2,
    get_ful_documents=True,
)

## 用法

`ArxivRetriever` 支持通过文章标识符进行检索：

In [2]:
docs = retriever.invoke("1605.08386")

In [3]:
docs[0].metadata  # meta-information of the Document

{'Entry ID': 'http://arxiv.org/abs/1605.08386v1',
 'Published': datetime.date(2016, 5, 26),
 'Title': 'Heat-bath random walks with Markov bases',
 'Authors': 'Caprice Stanley, Tobias Windisch'}

In [4]:
docs[0].page_content[:400]  # a content of the Document

'Graphs on lattice points are studied whose edges come from a finite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on\nfibers of a fixed integer matrix can be bounded from above by a constant. We\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\nalso state explicit conditions on the set of moves so that the heat-bath random\nwalk, a ge'

`ArxivRetriever` 还支持基于自然语言文本的检索：

In [5]:
docs = retriever.invoke("What is the ImageBind model?")

In [6]:
docs[0].metadata

{'Entry ID': 'http://arxiv.org/abs/2305.05665v2',
 'Published': datetime.date(2023, 5, 31),
 'Title': 'ImageBind: One Embedding Space To Bind Them All',
 'Authors': 'Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra'}

## 在链中使用

与其他检索器一样，`ArxivRetriever` 可以通过[链](/docs/how_to/sequence/)集成到 LLM 应用中。

我们将需要一个 LLM 或聊天模型：

import ChatModelTabs from "@theme/ChatModelTabs";

<ChatModelTabs customVarName="llm" />

In [7]:
# | output: false
# | echo: false

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

In [8]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_template(
    """Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [9]:
chain.invoke("What is the ImageBind model?")

'The ImageBind model is an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It shows that only image-paired data is sufficient to bind the modalities together and can leverage large scale vision-language models for zero-shot capabilities and emergent applications such as cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.'

## API 参考

如需了解 `ArxivRetriever` 的所有功能和配置的详细文档，请前往 [API 参考](https://python.langchain.com/api_reference/community/retrievers/langchain_community.retrievers.arxiv.ArxivRetriever.html)。