# Diffbot

>[Diffbot](https://docs.diffbot.com/docs/getting-started-with-diffbot) 是一套基于机器学习的产品，可以轻松地组织网页数据。

>Diffbot 的 [Extract API](https://docs.diffbot.com/reference/extract-introduction) 是一项服务，用于组织和规范化网页数据。

>与传统的网页抓取工具不同，`Diffbot Extract` 不需要任何规则即可读取页面上的内容。它使用计算机视觉模型将页面分类为 20 种可能的类型之一，然后将原始 HTML 标记转换为 JSON。生成的结构化 JSON 遵循一致的 [基于类型的本体](https://docs.diffbot.com/docs/ontology)，这使得用户可以轻松地使用相同的模式从多个不同的网页源提取数据。

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/integrations/document_loaders/diffbot.ipynb)

## 概览
本指南介绍如何使用 [Diffbot Extract API](https://www.diffbot.com/products/extract/) 从 URL 列表中提取数据，并将其转换为结构化的 JSON，以便我们下游使用。

## 设置

首先，安装所需的软件包。

In [6]:
%pip install --upgrade --quiet langchain-community

Diffbot 的 Extract API 需要一个 API 令牌。请按照以下说明[获取免费 API 令牌](/docs/integrations/providers/diffbot#installation-and-setup)，然后设置环境变量。

In [None]:
%env DIFFBOT_API_TOKEN REPLACE_WITH_YOUR_TOKEN

## 使用 Document Loader

导入 DiffbotLoader 模块，并使用 URL 列表和你的 Diffbot token 进行实例化。

In [10]:
import os

from langchain_community.document_loaders import DiffbotLoader

urls = [
    "https://python.langchain.com/",
]

loader = DiffbotLoader(urls=urls, api_token=os.environ.get("DIFFBOT_API_TOKEN"))

使用 `.load()` 方法，您可以查看已加载的文档

In [11]:
loader.load()

[Document(page_content="LangChain is a framework for developing applications powered by large language models (LLMs).\nLangChain simplifies every stage of the LLM application lifecycle:\nDevelopment: Build your applications using LangChain's open-source building blocks and components. Hit the ground running using third-party integrations and Templates.\nProductionization: Use LangSmith to inspect, monitor and evaluate your chains, so that you can continuously optimize and deploy with confidence.\nDeployment: Turn any chain into an API with LangServe.\nlangchain-core: Base abstractions and LangChain Expression Language.\nlangchain-community: Third party integrations.\nPartner packages (e.g. langchain-openai, langchain-anthropic, etc.): Some integrations have been further split into their own lightweight packages that only depend on langchain-core.\nlangchain: Chains, agents, and retrieval strategies that make up an application's cognitive architecture.\nlanggraph: Build robust and state

## 将提取的文本转换为图文档

结构化的页面内容可以通过 `DiffbotGraphTransformer` 进一步处理，以将实体和关系提取到图中。

In [None]:
%pip install --upgrade --quiet langchain-experimental

In [13]:
from langchain_experimental.graph_transformers.diffbot import DiffbotGraphTransformer

diffbot_nlp = DiffbotGraphTransformer(
    diffbot_api_key=os.environ.get("DIFFBOT_API_TOKEN")
)
graph_documents = diffbot_nlp.convert_to_graph_documents(loader.load())

要继续将数据加载到知识图谱中，请遵循 [`DiffbotGraphTransformer` 指南](/docs/integrations/graphs/diffbot/#loading-the-data-into-a-knowledge-graph) 。