# Amazon Textract 

>[Amazon Textract](https://docs.aws.amazon.com/managedservices/latest/userguide/textract.html)是一种机器学习(ML)服务，可自动从扫描文档中提取文本、手写内容和数据。
>
>它超越了简单的光学字符识别(OCR)，能够识别、理解并从表格和表单中提取数据。如今，许多公司通过手动方式从扫描的文档（如PDF、图像、表格和表单）中提取数据，或使用需要手动配置的简单OCR软件（当表单更改时通常需要更新）。为了克服这些手动且昂贵的流程，`Textract`使用ML读取和处理任何类型的文档，准确地提取文本、手写内容、表格和其他数据，无需任何手动努力。

此示例演示了将`Amazon Textract`与LangChain结合使用作为DocumentLoader。

`Textract`支持`PDF`、`TIFF`、`PNG`和`JPEG`格式。

`Textract`支持这些[文档大小、语言和字符](https://docs.aws.amazon.com/textract/latest/dg/limits-document.html)。

In [1]:
%pip install --upgrade --quiet  boto3 langchain-openai tiktoken python-dotenv

In [2]:
%pip install --upgrade --quiet  "amazon-textract-caller>=0.2.0"

## 示例1

第一个示例使用本地文件，该文件将在内部发送到Amazon Textract同步API [DetectDocumentText](https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html)。

对于Textract，本地文件或像HTTP://这样的URL端点仅限于单页文档。
多页文档必须存储在S3上。此示例文件是一个jpeg文件。

In [None]:
from langchain_community.document_loaders import AmazonTextractPDFLoader

loader = AmazonTextractPDFLoader("example_data/alejandro_rosalez_sample-small.jpeg")
documents = loader.load()

文件输出

In [10]:
documents

[Document(page_content='Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emer

## 示例2
下一个示例从HTTPS端点加载文件。
它必须是单页的，因为Amazon Textract要求所有多页文档存储在S3上。

In [7]:
from langchain_community.document_loaders import AmazonTextractPDFLoader

loader = AmazonTextractPDFLoader(
    "https://amazon-textract-public-content.s3.us-east-2.amazonaws.com/langchain/alejandro_rosalez_sample_1.jpg"
)
documents = loader.load()

In [11]:
documents

[Document(page_content='Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emergency Contact 1: First Name: CARLOS Last Name: SALAZAR Phone: 212-555-0150 Relationship to Patient: BROTHER Emergency Contact 2: First Name: JANE Last Name: DOE Phone: 650-555-0123 Relationship FRIEND to Patient: Did you feel fever or feverish lately? Yes No Are you having shortness of breath? Yes No Do you have a cough? Yes No Did you experience loss of taste or smell? Yes No Where you in contact with any confirmed COVID-19 positive patients? Yes No Did you travel in the past 14 days to any regions affected by COVID-19? Yes No Patient Information First Name: ALEJANDRO Last Name: ROSALEZ Date of Birth: 10/10/1982 Sex: M Marital Status: MARRIED Email Address: Address: 123 ANY STREET City: ANYTOWN State: CA Zip Code: 12345 Phone: 646-555-0111 Emer

## 示例3

处理多页文档要求文档位于S3上。示例文档位于us-east-2区域的存储桶中，Textract需要在同一区域调用才能成功，因此我们在客户端上设置region_name并将其传递给加载器，以确保从us-east-2调用Textract。您也可以让笔记本在us-east-2中运行，将AWS_DEFAULT_REGION设置为us-east-2，或者在不同环境中运行时，传入带有该区域名称的boto3 Textract客户端，如下方单元格所示。

In [12]:
import boto3

textract_client = boto3.client("textract", region_name="us-east-2")

file_path = "s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf"
loader = AmazonTextractPDFLoader(file_path, client=textract_client)
documents = loader.load()

现在获取页面数量以验证响应（打印出完整响应会相当长...）。我们预期有16页。

In [13]:
len(documents)

16

## 示例4

您可以选择向AmazonTextractPDFLoader传递一个名为`linearization_config`的额外参数，该参数将决定在Textract运行后文本输出如何被解析器线性化。

In [None]:
from langchain_community.document_loaders import AmazonTextractPDFLoader
from textractor.data.text_linearization_config import TextLinearizationConfig

loader = AmazonTextractPDFLoader(
    "s3://amazon-textract-public-content/langchain/layout-parser-paper.pdf",
    linearization_config=TextLinearizationConfig(
        hide_header_layout=True,
        hide_footer_layout=True,
        hide_figure_layout=True,
    ),
)
documents = loader.load()

## 在LangChain链中使用AmazonTextractPDFLoader（例如OpenAI）

AmazonTextractPDFLoader可以与其他加载器相同的方式在链中使用。
Textract本身具有[查询功能](https://docs.aws.amazon.com/textract/latest/dg/API_Query.html)，它提供与此示例中的QA链类似的功能，也值得一看。

In [None]:
# 您也可以将OPENAI_API_KEY存储在.env文件中
# import os
# from dotenv import load_dotenv

# load_dotenv()

In [None]:
# 或直接在环境中设置OpenAI密钥
import os

os.environ["OPENAI_API_KEY"] = "your-OpenAI-API-key"

In [16]:
from langchain.chains.question_answering import load_qa_chain
from langchain_openai import OpenAI

chain = load_qa_chain(llm=OpenAI(), chain_type="map_reduce")
query = ["Who are the autors?"]

chain.run(input_documents=documents, question=query)

' The authors are Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, Weining Li, Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N., Peters, M., Schmitz, M., Zettlemoyer, L., Lukasz Garncarek, Powalski, R., Stanislawek, T., Topolski, B., Halama, P., Gralinski, F., Graves, A., Fernández, S., Gomez, F., Schmidhuber, J., Harley, A.W., Ufkes, A., Derpanis, K.G., He, K., Gkioxari, G., Dollár, P., Girshick, R., He, K., Zhang, X., Ren, S., Sun, J., Kay, A., Lamiroy, B., Lopresti, D., Mears, J., Jakeway, E., Ferriter, M., Adams, C., Yarasavage, N., Thomas, D., Zwaard, K., Li, M., Cui, L., Huang,'