## 1. CSV
csv是使用逗号分割值的文本文件，每一行都是一条数据

In [6]:
import pandas as pd

df = pd.read_csv("./example data/2.1 Document Loader.csv")
df.head()

Unnamed: 0,Book ID,Title,Author,Publication Year,Genre
0,1,The Great Adventure,Alex Reed,2010,Adventure
1,2,Mystery of the Ancients,Maria Lopez,2015,Mystery
2,3,The Lost City,John Doe,2018,Fantasy
3,4,Journey Through Time,Emily Clarke,2020,Science Fiction
4,5,Secrets of the Ocean,David Smith,2021,Documentary


In [22]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="./example data/2.1 Document Loader.csv")
data = loader.load()

In [23]:
# 返回一个文件列表，每一行都是一个Document
# 默认把所有的数据都作为Source
print(data[0])

page_content='Book ID: 1\nTitle: The Great Adventure\nAuthor: Alex Reed\nPublication Year: 2010\nGenre: Adventure' metadata={'source': './example data/2.1 Document Loader.csv', 'row': 0}


### 指定某列作为Source

In [24]:
loader = CSVLoader(file_path="./example data/2.1 Document Loader.csv",
                   source_column="Title")
data = loader.load()
print(data[0])

page_content='Book ID: 1\nTitle: The Great Adventure\nAuthor: Alex Reed\nPublication Year: 2010\nGenre: Adventure' metadata={'source': 'The Great Adventure', 'row': 0}


## 2. File dictionary

In [2]:
from langchain_community.document_loaders import DirectoryLoader

# 默认使用UnstructuredLoader
loader = DirectoryLoader("./example data/", glob="*.csv")
docs = loader.load()

### 加进度条

In [6]:
loader = DirectoryLoader("./example data/", glob="*.csv", show_progress=True)
docs = loader.load()

100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 108.48it/s]


### 使用其他加载器

In [10]:
from langchain_community.document_loaders.csv_loader import CSVLoader
loader = DirectoryLoader("./example data/", glob="*.csv", loader_cls=CSVLoader)
docs = loader.load()
print(docs[0])

page_content='Book ID: 1\nTitle: The Great Adventure\nAuthor: Alex Reed\nPublication Year: 2010\nGenre: Adventure' metadata={'source': 'example data/2.1 Document Loader.csv', 'row': 0}


### 3. HTML

In [14]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

loader = UnstructuredHTMLLoader("./example data/2.1 Langchain.html")
data = loader.load()

data

[Document(page_content='\n\nModules\n\nRetrieval\n\nDocument loaders\n\nHTML\n\nHTML\n\nThe HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser.\n\nThis covers how to load HTML documents into a document format that we can use downstream.\n\nfrom\n\nlangchain_community\n\ndocument_loaders\n\nimport\n\nUnstructuredHTMLLoader\n\nloader\n\nUnstructuredHTMLLoader\n\n"example_data/fake-content.html"\n\ndata\n\nloader\n\nload\n\ndata\n\n[Document(page_content=\'My First Heading\\n\\nMy first paragraph.\', lookup_str=\'\', metadata={\'source\': \'example_data/fake-content.html\'}, lookup_index=0)]\n\nLoading HTML with BeautifulSoup4\u200b\n\nWe can also use BeautifulSoup4 to load HTML documents using the BSHTMLLoader.  This will extract the text from the HTML into page_content, and the page title as title into metadata.\n\nfrom\n\nlangchain_community\n\ndocument_loaders\n\nimport\n\nBSHTMLLoader\n\nloader\n\nBSHTMLLoader\n\n

### 使用BeautifulSoup4加载HTML
可以提取文本到page_content中，并将页面标题提取到metadata

In [20]:
from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader("./example data/2.1 Langchain.html")
data = loader.load()
data[0].metadata

{'source': './example data/2.1 Langchain.html',
 'title': 'HTML | 🦜️🔗 Langchain'}

## 4. PDF

### PyPDF
每页pdf都会被加载为一个Document

In [25]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./example data/2417262.pdf")
data = loader.load()
data[0]

Document(page_content='Problem Chosen\nD2024\nMCM/ICM\nSummary SheetTeam Control Number\n2417262\nGL-PaD: Great Lakes’ Predictor-and-Decision maker\nSummary\nThis paper focuses on optimizing water level in the Great Lakes. It aims to coordi-\nnate the interests of stakeholders and develop a better water level management plan by\ncombining environmental factors and historical data. The superiority of the algorithm is\nverified through simulation.\nFor requirement 1 , we propose a model to evaluate the interests of each stakeholder,\nusing historical water level data to categorize them into primary and secondary groups.\nAdditionally, the lakes are classified based on water level fluctuation magnitude, lead-\ning to tailored optimization strategies for each. The established water levels for the five\nlakes at 183.35, 176.33, 175.10, 174.28, and 74.82 meters serve as benchmarks for future\noptimization control strategies.\nFor requirement 2 , we presents a joint differential predictor and

#### 使用PyPDF提取图片

In [29]:
loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf", extract_images=True)
pages = loader.load()
print(pages[4].page_content)

LayoutParser : A Uniﬁed Toolkit for DL-Based DIA 5
Table 1: Current layout detection models in the LayoutParser model zoo
Dataset Base Model1Large Model Notes
PubLayNet [38] F / M M Layouts of modern scientiﬁc documents
PRImA [3] M - Layouts of scanned modern magazines and scientiﬁc reports
Newspaper [17] F - Layouts of scanned US newspapers from the 20th century
TableBank [18] F F Table region on modern scientiﬁc and business document
HJDataset [31] F / M - Layouts of history Japanese documents
1For each dataset, we train several models of diﬀerent sizes for diﬀerent needs (the trade-oﬀ between accuracy
vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101
backbones [ 13], respectively. One can train models of diﬀerent architectures, like Faster R-CNN [ 28] (F) and Mask
R-CNN [ 12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained
using the ResNet 101 backbone. The platform is maintained 