# WebBaseLoader

此内容涵盖了如何使用 `WebBaseLoader` 将 `HTML` 网页的全部文本加载到可供下游使用的文档格式中。有关加载网页的更多自定义逻辑，请查看一些子类示例，例如 `IMSDbLoader`、`AZLyricsLoader` 和 `CollegeConfidentialLoader`。

如果您不想担心网站爬取、绕过 JS 阻止网站和数据清理，可以考虑使用 `FireCrawlLoader` 或更快的选项 `SpiderLoader`。

## 概述
### 集成详情

- TODO: 填写表格的 features。
- TODO: 如果不相关，则移除 JS support 的链接，否则请确保链接正确。
- TODO: 确保 API 参考链接是正确的。

| Class | Package | Local | Serializable | JS support|
| :--- | :--- | :---: | :---: |  :---: |
| [WebBaseLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html) | [langchain_community](https://python.langchain.com/api_reference/community/index.html) | ✅ | ❌ | ❌ |
### Loader features
| Source | Document Lazy Loading | Native Async Support
| :---: | :---: | :---: |
| WebBaseLoader | ✅ | ✅ |

## 设置

### 凭证

`WebBaseLoader` 不需要任何凭证。

### 安装

要使用 `WebBaseLoader`，您首先需要安装 `langchain-community` python 包。

In [None]:
%pip install -qU langchain_community beautifulsoup4

## 初始化

现在我们可以实例化我们的模型对象并加载文档：

In [2]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.example.com/")

为了在获取数据时绕过 SSL 验证错误，您可以设置“verify”选项：

`loader.requests_kwargs = {'verify':False}`

### 使用多个页面进行初始化

您还可以传入一个页面列表进行加载。

In [3]:
loader_multiple_pages = WebBaseLoader(
    ["https://www.example.com/", "https://google.com"]
)

## 加载

In [4]:
docs = loader.load()

docs[0]

Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n')

In [5]:
print(docs[0].metadata)

{'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}


### 并发加载多个 URL

您可以通过并发抓取和解析多个 URL 来加快抓取过程。

并发请求有合理的限制，默认为每秒 2 个。如果您不介意做个“好公民”，或者您控制着正在抓取的服务器并且不在乎负载，您可以更改 `requests_per_second` 参数来增加最大并发请求数。请注意，虽然这会加快抓取过程，但可能会导致服务器阻止您。请小心！

In [6]:
%pip install -qU  nest_asyncio

# fixes a bug with asyncio and jupyter
import nest_asyncio

nest_asyncio.apply()

Note: you may need to restart the kernel to use updated packages.


In [8]:
loader = WebBaseLoader(["https://www.example.com/", "https://google.com"])
loader.requests_per_second = 1
docs = loader.aload()
docs

Fetching pages: 100%|###########################################################################| 2/2 [00:00<00:00,  8.28it/s]


[Document(metadata={'source': 'https://www.example.com/', 'title': 'Example Domain', 'language': 'No language found.'}, page_content='\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n'),
 Document(metadata={'source': 'https://google.com', 'title': 'Google', 'description': "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.", 'language': 'en'}, page_content='GoogleSearch Images Maps Play YouTube News Gmail Drive More »Web History | Settings | Sign in\xa0Advanced search5 ways Gemini can help during the HolidaysAdvertisingBusiness SolutionsAbout Google© 2024 - Privacy - Terms  ')]

### 加载 xml 文件，或使用不同的 BeautifulSoup 解析器

你也可以查看 `SitemapLoader` 来了解如何加载站点地图文件，这是一个使用此功能的示例。

In [9]:
loader = WebBaseLoader(
    "https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml"
)
loader.default_parser = "xml"
docs = loader.load()
docs

[Document(metadata={'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}, page_content='\n\n10\nEnergy\n3\n2018-01-01\n2018-01-01\nfalse\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\nÂ§ 431.86\nSection Â§ 431.86\n\nEnergy\nDEPARTMENT OF ENERGY\nENERGY CONSERVATION\nENERGY EFFICIENCY PROGRAM FOR CERTAIN COMMERCIAL AND INDUSTRIAL EQUIPMENT\nCommercial Packaged Boilers\nTest Procedures\n\n\n\n\n§\u2009431.86\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n(a) Scope. This section provides test procedures, pursuant to the Energy Policy and Conservation Act (EPCA), as amended, which must be followed for measuring the combustion efficiency and/or thermal efficiency of a gas- or oil-fired commercial packaged boiler.\n(b) Testing and Calculations. Determine the thermal efficiency or combustion efficiency of commercial packaged boilers by condu

## 懒加载

您可以使用懒加载一次只加载一个页面，以最大限度地减少内存需求。

In [10]:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)

print(pages[0].page_content[:100])
print(pages[0].metadata)



10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}


### 异步

In [12]:
pages = []
async for doc in loader.alazy_load():
    pages.append(doc)

print(pages[0].page_content[:100])
print(pages[0].metadata)

Fetching pages: 100%|###########################################################################| 1/1 [00:00<00:00, 10.51it/s]



10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}





## 使用代理

有时你可能需要使用代理来绕过 IP 封锁。你可以将代理字典传递给加载器（以及底层的 `requests`）来使用它们。

In [None]:
loader = WebBaseLoader(
    "https://www.walmart.com/search?q=parrots",
    proxies={
        "http": "http://{username}:{password}:@proxy.service.com:6666/",
        "https": "https://{username}:{password}:@proxy.service.com:6666/",
    },
)
docs = loader.load()

## API 参考

要了解 `WebBaseLoader` 的所有功能和配置的详细文档，请访问 API 参考：https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html