# 1.1 txt 文件加载器 `TextLoader`
纯文本文件没有字体格式，是最简单的文本文档类型。在开篇中提到，LangChain 拥有若干种文档加载器，用于加载各种格式的文本文档。他们复用一部分基本参数和功能（均继承自 `langchain_community.document_loaders.base.BaseLoader`），所以纯文本文档加载器 `TextLoader` 是非常好的起始。
## 主要参数
- `file_path`: 文件路径
- `encoding`: 文件编码，默认为 utf-8
- `autodetect_encoding`: 是否需要自动检测文件编码，默认为 False

## 例子

In [2]:
import os
from pathlib import Path
from langchain.document_loaders import TextLoader

root = os.getcwd()
loader = TextLoader(os.path.join(root, 'data/paul_graham/paul_graham_essay.txt'))
data = loader.load()

print(data[0].dict().keys())
print(len(data[0].page_content))

dict_keys(['id', 'metadata', 'page_content', 'type'])
75014


## 方法：
1. `load()` 和 `aload()`：载入文件，返回一个 `Document` 对象。`aload()` 是 `load()` 的异步版本。
2. `lazy_load()` 和 `alazy_load()`：惰性读取，即等到对象被调用的时候再进行读取。`alazy_load()` 是 `lazy_load()` 的异步版本。

上述方法返回的 `Document` 对象的结构是：
```python
Document(page_content=text, metadata={"source": str(self.file_path)})
```

# 1.2 CSV 文件加载器 `CSVLoader`

In [3]:
import pandas as pd
df = pd.read_csv(os.path.join(root, 'data/weather_forecast_data.csv')) # credit to https://www.kaggle.com/datasets/zeeshier/weather-forecast-dataset
print(df.head())

   Temperature   Humidity  Wind_Speed  Cloud_Cover     Pressure     Rain
0    23.720338  89.592641    7.335604    50.501694  1032.378759     rain
1    27.879734  46.489704    5.952484     4.990053   992.614190  no rain
2    25.069084  83.072843    1.371992    14.855784  1007.231620  no rain
3    23.622080  74.367758    7.050551    67.255282   982.632013     rain
4    20.591370  96.858822    4.643921    47.676444   980.825142  no rain


`CSVLoader` 读取文档后，每一行会生成一个 `Document` 对象，每一个 `Document` 对象包含：
- `page_content`：每一行都被转化为 {列 : 列值} 的列表
- `metadata`：元数据，包括文件路径和行数
## 主要参数
- `file_path`：读取的文件路径
- `source_column`：读取的列
- `metadata_columns`：读取的列作为元数据
- `csv_args`：一个传给 `csv.DictReader` 的字典
- `encoding`：文件编码
- `autodetect_encoding`：是否自动检测文件编码
## 例子

In [4]:
from langchain.document_loaders import CSVLoader

docs = CSVLoader(os.path.join(root, 'data/weather_forecast_data.csv')).load()
print(docs[0])

page_content='Temperature: 23.720337598183118
Humidity: 89.59264065174611
Wind_Speed: 7.335604391040214
Cloud_Cover: 50.501693832913155
Pressure: 1032.378758690279
Rain: rain' metadata={'source': '/Users/wenjiazhai/Documents/GitHub/RAG_zero_to_hero/data/weather_forecast_data.csv', 'row': 0}


## 方法：
1. `load()` 和 `aload()`：载入文件，返回一个 `Document` 对象。`aload()` 是 `load()` 的异步版本。
2. `lazy_load()` 和 `alazy_load()`：惰性读取，即等到对象被调用的时候再进行读取。`alazy_load()` 是 `lazy_load()` 的异步版本。

上述方法返回的 `Document` 对象的结构是：
```python
Document(page_content=text, metadata={"source": str(self.file_path)， 'row': row_num})
```

# 1.3 JSON 文件加载器 `JSONLoader`
## 主要参数
- `file_path`: 文件路径
- `jq_schema`: 用于从 JSON 中提取数据或文本的 jq 模式
- `content_key`: 从 JSON 文件中提取的文本数据的键值
- `text_content`：布尔标志，指示内容是否为字符串格式，默认为 `True`
- `json_lines`：布尔标志，指示输入是否为 JSON Lines 格式，默认为 `False`
## 例子

In [5]:
pip install jq -Uq

Note: you may need to restart the kernel to use updated packages.


In [6]:
from langchain.document_loaders import JSONLoader

docs = JSONLoader(os.path.join(root, 'data/conversation.json'), jq_schema='.', text_content=False).load() # https://www.kaggle.com/datasets/vaibhavgeek/conversation-json
print(docs[0])

page_content='{"conversations": [["Good morning, how are you?", "I am doing well, how about you?", "I'm also good.", "That's good to hear.", "Yes it is."], ["Hello", "Hi", "How are you doing?", "I am doing well.", "That is good to hear", "Yes it is.", "Can I help you with anything?", "Yes, I have a question.", "What is your question?", "Could I borrow a cup of sugar?", "I'm sorry, but I don't have any.", "Thank you anyway", "No problem"], ["How are you doing?", "I am doing well, how about you?", "I am also good.", "That's good."], ["Have you heard the news?", "What good news?"], ["What is your favorite book?", "I can't read.", "So what's your favorite color?", "Blue"], ["Who are you?", "Who? Who is but a form following the function of what", "What are you then?", "A man in a mask.", "I can see that.", "It's not your powers of observation I doubt, but merely the paradoxical nature of asking a masked man who is. But tell me, do you like music?", "I like seeing movies.", "What kind of mov

# 1.4 PDF 文档加载器
由于 PDF 文件非常常用，有很多解决方案，这里仅介绍 `PyPDF`。其它加载器见 https://python.langchain.com/docs/integrations/document_loaders/。
## 主要参数
- `file_path`：文件路径
- `password`：打卡该 PDF 文件的密码，如果需要密码的话
- `extract_images`：是否提取图片，默认为 False
## 例子

In [7]:
!pip install pypdf

Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple


In [8]:
from langchain.document_loaders import PyPDFLoader

docs = PyPDFLoader(os.path.join(root, 'data/Qwen2.pdf')).load()
print(docs[0])

page_content='QWEN 2 TECHNICAL REPORT
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li,
Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang,
Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren
Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei
Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie
Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng,
Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan,
Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang
Guo, and Zhihao Fan
Qwen Team, Alibaba Group∗
ABSTRACT
This report introduces the Qwen2 series, the latest addition to our large lan-
guage models and large multimodal models. We release a comprehensive suite of
foundational and instruc

除了上述文档加载器，LangChain 还提供了几十种加载其它文档的加载器，有需要可以到 https://python.langchain.com/docs/integrations/document_loaders/ 查看。