# 06 文档加载器
许多LLM应用程序需要用户特定的数据，这些数据不属于模型的训练集。

实现此目的的主要方法是通过检索增强生成 （RAG）。在此过程中，检索外部数据，然后在执行生成步骤时传递给LLM。

LangChain为RAG应用程序提供了所有构建块 - 从简单到复杂。文档的这一部分涵盖了与检索步骤相关的所有内容 - 例如数据的获取。


使用文档加载器以 的形式从 Document 源加载数据。

A Document 是一段文本和关联的元数据。

例如，有用于加载简单 .txt 文件、加载任何网页的文本内容甚至加载 YouTube 视频脚本的文档加载器。

文档加载程序公开一个“加载”方法，用于将数据作为文档从配置的源加载。它们还可以选择实现“延迟加载”，以便将数据延迟加载到内存中。

In [1]:
from langchain.document_loaders import TextLoader

loader = TextLoader("documentstore/index.md")
loader.load()

[Document(page_content='## test index.md', metadata={'source': 'documentstore/index.md'})]

In [2]:
# 加载 CSV 数据，每个文档一行。
from langchain.document_loaders.csv_loader import CSVLoader


loader = CSVLoader(file_path='documentstore/index.csv')
data = loader.load()

In [3]:
print(data)

[Document(page_content='title: red\ncontext: is a color', metadata={'source': 'documentstore/index.csv', 'row': 0}), Document(page_content='title: watermelon\ncontext: is a fruit', metadata={'source': 'documentstore/index.csv', 'row': 1}), Document(page_content='title: bike\ncontext: is a vehicle', metadata={'source': 'documentstore/index.csv', 'row': 2})]


In [8]:
# 自定义 csv 解析和加载
loader = CSVLoader(file_path='documentstore/index.csv', csv_args={
    'delimiter': ',',
    'quotechar': '"',
    'fieldnames': ['title','context']
})

data = loader.load()
print(data)

[Document(page_content='title: title\ncontext: context', metadata={'source': 'documentstore/index.csv', 'row': 0}), Document(page_content='title: red\ncontext: is a color', metadata={'source': 'documentstore/index.csv', 'row': 1}), Document(page_content='title: watermelon\ncontext: is a fruit', metadata={'source': 'documentstore/index.csv', 'row': 2}), Document(page_content='title: bike\ncontext: is a vehicle', metadata={'source': 'documentstore/index.csv', 'row': 3})]


In [9]:
# 使用该 source_column 参数为从每一行创建的文档指定源。否则 file_path 将用作从 CSV 文件创建的所有文档的源。
loader = CSVLoader(file_path='documentstore/index.csv', source_column="context")

data = loader.load()

print(data)

[Document(page_content='title: red\ncontext: is a color', metadata={'source': 'is a color', 'row': 0}), Document(page_content='title: watermelon\ncontext: is a fruit', metadata={'source': 'is a fruit', 'row': 1}), Document(page_content='title: bike\ncontext: is a vehicle', metadata={'source': 'is a vehicle', 'row': 2})]


In [14]:
! pip install unstructured[md]

Collecting markdown
  Downloading Markdown-3.4.4-py3-none-any.whl (94 kB)
     -------------------------------------- 94.2/94.2 kB 335.7 kB/s eta 0:00:00
Installing collected packages: markdown
Successfully installed markdown-3.4.4



[notice] A new release of pip available: 22.3.1 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
# 从文件夹加载所有文档
from langchain.document_loaders import DirectoryLoader
#我们可以使用该 glob 参数来控制要加载的文件。请注意，此处它不会加载 .rst 文件或 .html 文件。
loader = DirectoryLoader(r'D:\langchainstudy\langchain-robot\documentstore', glob='**/*.md')
docs = loader.load()
len(docs)

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


1

In [6]:
#显示进度条
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader(r'D:\langchainstudy\langchain-robot\documentstore', glob="**/*.md", show_progress=True)
docs = loader.load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.it/s]
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 160.05it/s]


In [7]:
#使用多线程
loader = DirectoryLoader(r'D:\langchainstudy\langchain-robot\documentstore', glob="**/*.md", use_multithreading=True)
docs = loader.load()
len(docs)

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


1

In [5]:
# 更改加载程序类
from langchain.document_loaders import PythonLoader
loader = DirectoryLoader(r'D:\langchainstudy\langchain-robot', glob="*.py", loader_cls=PythonLoader)
docs = loader.load()
len(docs)

1

In [9]:
#html
from langchain.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader("documentstore/fake-content.html")
data = loader.load()
data

[Document(page_content='html\n\ntest', metadata={'source': 'documentstore/fake-content.html'})]

In [10]:
# 使用 BeautifulSoup4 加载 HTML 4
from langchain.document_loaders import BSHTMLLoader
loader = BSHTMLLoader("documentstore/fake-content.html")
data = loader.load()
data

[Document(page_content='\n\nhtml\ntest\n\n', metadata={'source': 'documentstore/fake-content.html', 'title': ''})]

In [1]:
from langchain.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint


file_path='./documentstore/examples.json'
data = json.loads(Path(file_path).read_text())
pprint(data)

{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
 'is_still_participant': True,
 'joinable_mode': {'link': '', 'mode': 1},
 'magic_words': [],
 'messages': [{'content': 'Bye!',
               'sender_name': 'User 2',
               'timestamp_ms': 1675597571851},
              {'content': 'Hi! Im interested in your bag. Im offering $50. Let '
                          'me know if you are interested. Thanks!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675549022673}],
 'participants': [{'name': 'User 1'}, {'name': 'User 2'}],
 'thread_path': 'inbox/User 1 and User 2 chat',
 'title': 'User 1 and User 2 chat'}
