# LLM Sherpa

本 Notebook 介绍了如何使用 `LLM Sherpa` 加载多种类型的文件。`LLM Sherpa` 支持不同的文件格式，包括 DOCX、PPTX、HTML、TXT 和 XML。

`LLMSherpaFileLoader` 使用 LayoutPDFReader，它是 LLMSherpa 库的一部分。该工具旨在解析 PDF 文件并保留其布局信息，这在使用大多数 PDF 转文本解析器时常常会丢失。

以下是 LayoutPDFReader 的一些主要功能：

* 它可以识别和提取章节及子章节及其层级。
* 它可以合并行以形成段落。
* 它可以识别章节和段落之间的链接。
* 它可以提取表格以及表格所在的章节。
* 它可以识别并提取列表和嵌套列表。
* 它可以连接跨页面的内容。
* 它可以去除重复的页眉和页脚。
* 它可以去除水印。

请查阅 [llmsherpa](https://llmsherpa.readthedocs.io/en/latest/) 文档。

`INFO: 此库在某些 PDF 文件上会失败，因此请谨慎使用。`

In [None]:
# Install package
# !pip install --upgrade --quiet llmsherpa

## LLMSherpaFileLoader

LLMSherpaFileLoader 在底层定义了一些加载文件内容的策略：["sections", "chunks", "html", "text"]，设置 [nlm-ingestor](https://github.com/nlmatics/nlm-ingestor) 以获取 `llmsherpa_api_url` 或使用默认值。

### sections 策略：将文件解析为 sections 返回

In [5]:
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader

loader = LLMSherpaFileLoader(
    file_path="https://arxiv.org/pdf/2402.14207.pdf",
    new_indent_parser=True,
    apply_ocr=True,
    strategy="sections",
    llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()

In [6]:
docs[1]

Document(page_content='Abstract\nWe study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.\nThis underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing.\nWe propose STORM, a writing system for the Synthesis of Topic Outlines through\nReferences\nFull-length Article\nTopic\nOutline\n2022 Winter Olympics\nOpening Ceremony\nResearch via Question Asking\nRetrieval and Multi-perspective Question Asking.\nSTORM models the pre-writing stage by\nLLM\n(1) discovering diverse perspectives in researching the given topic, (2) simulating conversations where writers carrying different perspectives pose questions to a topic expert grounded on trusted Internet sources, (3) curating the collected information to create an outline.\nFor evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia a

In [7]:
len(docs)

79

### 分块策略：将文件解析为多个分块并返回

In [8]:
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader

loader = LLMSherpaFileLoader(
    file_path="https://arxiv.org/pdf/2402.14207.pdf",
    new_indent_parser=True,
    apply_ocr=True,
    strategy="chunks",
    llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()

In [9]:
docs[1]

Document(page_content='Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models\nStanford University {shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu lam@cs.stanford.edu', metadata={'source': 'https://arxiv.org/pdf/2402.14207.pdf', 'chunk_number': 1, 'chunk_type': 'para'})

In [10]:
len(docs)

306

### HTML 策略：将文件作为单个 HTML 文档返回

In [10]:
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader

loader = LLMSherpaFileLoader(
    file_path="https://arxiv.org/pdf/2402.14207.pdf",
    new_indent_parser=True,
    apply_ocr=True,
    strategy="html",
    llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()

In [12]:
docs[0].page_content[:400]

'<html><h1>Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models</h1><table><th><td colSpan=1>Yijia Shao</td><td colSpan=1>Yucheng Jiang</td><td colSpan=1>Theodore A. Kanell</td><td colSpan=1>Peter Xu</td></th><tr><td colSpan=1></td><td colSpan=1>Omar Khattab</td><td colSpan=1>Monica S. Lam</td><td colSpan=1></td></tr></table><p>Stanford University {shaoyj, yuchengj, '

In [13]:
len(docs)

1

### 文本策略：将文件作为单个文本文档返回

In [1]:
from langchain_community.document_loaders.llmsherpa import LLMSherpaFileLoader

loader = LLMSherpaFileLoader(
    file_path="https://arxiv.org/pdf/2402.14207.pdf",
    new_indent_parser=True,
    apply_ocr=True,
    strategy="text",
    llmsherpa_api_url="http://localhost:5010/api/parseDocument?renderFormat=all",
)
docs = loader.load()

In [3]:
docs[0].page_content[:400]

'Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models\n | Yijia Shao | Yucheng Jiang | Theodore A. Kanell | Peter Xu\n | --- | --- | --- | ---\n |  | Omar Khattab | Monica S. Lam | \n\nStanford University {shaoyj, yuchengj, tkanell, peterxu, okhattab}@stanford.edu lam@cs.stanford.edu\nAbstract\nWe study how to apply large language models to write grounded and organized long'

In [4]:
len(docs)

1