π Extract information from unstructured documents at scale with Amazon Bedrock
Converting documents into a structured database is a recurring business task. Common use cases include creating a product feature table from article descriptions, extracting meta-data from legal contracts, analyzing customer reviews, and more.
This repo provides an AWS CDK solution for intelligent document processing in seconds using generative AI.
Key features:
- Extract different information, including:
- Well-defined entities (name, title, etc)
- Numeric scores (sentiment, urgency, etc)
- Free-form content (summary, suggested response, etc)
- Simply describe the attributes to be extracted without costly data annotation or model training
- Leverage Amazon Bedrock Data Automation and multi-modal LLMs on Amazon Bedrock
- Use Python API or demo frontend to process PDFs, MS Office, images, and other formats
Example API Call
Refer to the demo notebook for the API implementation and usage examples:
docs = ['doc1', 'doc2']
features = [
{"name": "delay", "description": "delay of the shipment in days"},
{"name": "shipment_id", "description": "unique shipment identifier"},
{"name": "summary", "description": "one-sentence summary of the doc"},
]
run_idp_bedrock_api(
documents=docs,
features=features,
)
# [{'delay': 2, 'shipment_id': '123890', 'summary': 'summary1'},
# {'delay': 3, 'shipment_id': '678623', 'summary': 'summary2'}]
Web UI Video
idp_demo.mp4
This diagram depicts a high-level architecture of the solution:
To deploy the app to your AWS account, you can use a local IDE or create a SageMaker Notebook instance.
We recommend using SageMaker to avoid installing extra requirements. Set up ml.t3.large
instance and make sure the IAM role attached to the notebook has sufficient permissions for deploying CloudFormation stacks.
Clone the repo to a location of your choice:
git clone https://github.com/aws-samples/intelligent-document-processing-with-amazon-bedrock.git
When working from a SageMaker Notebook instance, run this script to install all missing requirements:
cd intelligent-document-processing-with-amazon-bedrock
sh install_deps.sh
When working locally, make sure you have installed the following, as well as access to the target AWS account:
- AWS CLI
- AWS Account: configure an AWS account with a profile
$ aws configure --profile [profile-name]
- Node.js
- AWS CDK Toolkit
- Python 3.9+
- uv - Fast Python package installer and resolver
- Docker Desktop
Navigate to the repo folder and execute the following script to create a virtual environment on macOS or Linux:
sh install_env.sh
source .venv/bin/activate
Copy the config-example.yml
to a config.yml
file and specify your project name and modules you would like to deploy (e.g., whether to deploy a UI). Make sure you add your user email to the Amazon Cognito users list.
stack_name: idp-test # Used as stack name and prefix for resources (<16 chars, cannot start with "aws")
...
frontend:
deploy_ecs: True # Whether to deploy demo frontend on ECS
- Open the target AWS account
- Open AWS Bedrock console and navigate to the region specified in
config.yml
- Select "Model Access" in the left sidebar and browse through the list of available LLMs
- Make sure to request and enable access for the model IDs specified in
config.yml
Bootstrap CDK in your account. When working locally, use the profile name you have used in the aws configure
step. When working from a SageMaker Notebook instance, profile specification is not required.
cdk bootstrap --profile [PROFILE_NAME]
Make sure the Docker daemon is running. On Mac, you can open Docker Desktop. On SageMaker, Docker daemon is already running.
cdk deploy --profile [PROFILE_NAME]
You can delete the CDK stack from your AWS account by running:
cdk destroy --profile [AWS_PROFILE_NAME]
or manually delete the CloudFormation stack from the AWS console.
Deploying CDK / CloudFormation stacks requires near Admin Permissions. Make sure to have the necessary IAM account permissions before running CDK deploy. Here is a detailed list of minimal required permissions to deploy a stack.
When deleting the stack, it may delete everything except for the created S3 bucket, which will contain the uploaded documents by the user and their processed versions. In order to actually delete this s3 bucket, you may need to empty it first. This is an expected behavior as all s3 buckets may contain sensitive data to the users.
This happens die to a wrong Python path. Change python3
in cdk.json
to your Python alias.
Follow steps in this notebook to run a job via an API call. You will need to:
- provide input document(s)
- provide a list of features to be extracted
- The URL to access the frontend appears as output at the end of the CDK deployment under "CloudfrontDistributionName"
or
- Open the AWS console, and go to CloudFront
- Copy the Domain name of the created distribution
Login credentials are available from:
- User name: email from a list of Cognito user emails in
config.yml
inauthentication
section - Password: temporary password received by email from
no-reply@verificationemail.com
after deployment
You can run the demo frontend locally for testing and development by following these steps:
- Deploy the CDK stack once
- Go to
src/ecs/.env
and setSTACK_NAME
to your stack name in theconfig.yml
- Provide AWS credentials
- You can add AWS credentials to the
src/ecs/.env
file - Or simply export credentials in your terminal, e.g.
export AWS_PROFILE=<profile>
- You can add AWS credentials to the
- Navigate to the frontend folder, create environment and install dependencies:
cd src/ecs
uv venv
source .venv/bin/activate
uv sync --extra dev
- Start frontend on localhost:
streamlit run src/Home.py
- Copy the local URL from the terminal output and paste in the address bar of your browser
- Make sure that the local URL you use is http://localhost:8501. It will not work otherwise
Core team:
![]() |
![]() |
---|---|
Nikita Kozodoi | Nuno Castro |
Contributors:
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|---|---|---|
Romain Besombes | Zainab Afolabi | Egor Krasheninnikov | Huong Vu | Aiham Taleb | Elizaveta Zinovyeva | Babs Khalidson | Ennio Pastore |
Acknowledgements:
See CONTRIBUTING for more information.
Note: this asset represents a proof-of-value for the services included and is not intended as a production-ready solution. You must determine how the AWS Shared Responsibility applies to their specific use case and implement the needed controls to achieve their desired security outcomes. AWS offers a broad set of security tools and configurations to enable out customers.
- Input data:
- Note that the solution is not scoped for processing regulated data.
- Network & Delivery:
- CloudFront:
- Use geography-aware rules to block or allow access to CloudFront distributions where required.
- Use AWS WAF on public CloudFront distributions.
- Ensure that solution CloudFront distributions use a security policy with minimum TLSv1.1 or TLSv1.2 and appropriate security ciphers for HTTPS viewer connections. Currently, the CloudFront distribution allows for SSLv3 or TLSv1 for HTTPS viewer connections and uses SSLv3 or TLSv1 for communication to the origin.
- API Gateway:
- Activate request validation on API Gateway endpoints to do first-pass input validation.
- Use AWS WAF on public-facing API Gateway Endpoints.
- CloudFront:
- Machine Learning and AI:
- Bedrock
- Enable model invocation logging and set alerts to ensure adherence to any responsible AI policies. Model invocation logging is disabled by default. See https://docs.aws.amazon.com/bedrock/latest/userguide/model-invocation-logging.html
- Consider enabling Bedrock Guardrails to add baseline protections against analyzing documents or extracting attributes covering certain protected topics.
- Comprehend
- Consider using Amazon COmprehend for detecting and masking PII data in the user-uploaded inputs.
- Bedrock
- Security & Compliance:
- Cognito
- Implement multi-factor authentication (MFA) in each Cognito User Pool.
- Consider implementing AdvanceSecurityMode to ENFORCE in Cognito User Pools.
- KMS
- Implement KMS key rotation for regulatory compliance or other specific cases.
- Configure, monitor, and alert on KMS events according to lifecycle policies.
- Cognito
- Serverless:
- Lambda
- Periodically scan all AWS Lambda container images for vulnerabilities according to lifecycle policies. AWS Inspector can be used for that.
- Lambda
In order to keep coding standards and formatting consistent, we use pre-commit
. This can be run from the terminal via uv run pre-commit run -a
.
This library is licensed under the MIT-0 License. See the LICENSE file.