---

## What This Notebook Does

1. **Scans an S3 path** recursively for source documents
2. **Analyzes folder hierarchy** to determine document context (e.g., `/controls/` vs `/policies/`)
3. **Uses Amazon Bedrock** to intelligently extract metadata from document content
4. **Generates `.metadata.json` files** in the correct format for Amazon Bedrock Knowledge Bases
5. **Uploads metadata files** to S3 alongside source documents

---



# Generating Metadata Files for Amazon Bedrock Knowledge Bases with Auto-Generated Query Filters

This notebook demonstrates how to automatically generate metadata JSON files for documents stored in Amazon S3, enabling the **auto-generated query filters** feature in Amazon Bedrock Knowledge Bases for improved retrieval accuracy.

## Overview

Amazon Bedrock Knowledge Bases offers fully-managed, end-to-end Retrieval-Augmented Generation (RAG) workflows to create highly accurate, low latency, secure, and custom GenAI applications by incorporating contextual information from your data sources [[1]](https://aws.amazon.com/about-aws/whats-new/2024/12/amazon-bedrock-knowledge-bases-auto-generated-query-filters-improved-retrieval/). This notebook focuses on preparing your documents with metadata to take advantage of the **automatically-generated query filters** capability.

## What Are Auto-Generated Query Filters?

Auto-generated query filters extend the existing capability of manual metadata filtering by allowing you to narrow down search results **without the need to manually construct complex filter expressions** [[1]](https://aws.amazon.com/about-aws/whats-new/2024/12/amazon-bedrock-knowledge-bases-auto-generated-query-filters-improved-retrieval/). This feature improves retrieval accuracy by ensuring the documents retrieved are relevant to the query.

For example, for a query like "How to file a claim in Washington", the state "Washington" will be automatically applied as a filter to retrieve only those documents pertaining to that particular state [[1]](https://aws.amazon.com/about-aws/whats-new/2024/12/amazon-bedrock-knowledge-bases-auto-generated-query-filters-improved-retrieval/).

## Metadata File Requirements

For metadata filtering to work, you must create JSON metadata files that follow these conventions:

- The metadata file must be named `<document>.<extension>.metadata.json` [[2]](https://docs.aws.amazon.com/kendra/latest/dg/s3-metadata.html)
- The document's Amazon S3 key is appended to the metadata's Amazon S3 prefix and then suffixed with `.metadata.json` to form the metadata file's Amazon S3 path [[3]](https://docs.aws.amazon.com/amazonq/latest/qbusiness-ug/s3-metadata.html)
- The metadata file must be a UTF-8 text file without a BOM marker [[2]](https://docs.aws.amazon.com/kendra/latest/dg/s3-metadata.html)

### Example File Mapping

```
Bucket name:
     s3://bucketName
Document path:
     documents/legal
Metadata path:
     metadata
File mapping:
     s3://bucketName/documents/legal/file.txt -> 
        s3://bucketName/metadata/documents/legal/file.txt.metadata.json
```

## Why This Approach?

RAG applications process user queries by searching across a large set of documents. However, in many situations, you might need to retrieve documents with specific attributes or content. You can use metadata filtering to narrow down search results by specifying inclusion and exclusion criteria [[4]](https://aws.amazon.com/blogs/machine-learning/from-concept-to-reality-navigating-the-journey-of-rag-from-proof-of-concept-to-production/). By pre-generating comprehensive metadata files, you enable the auto-generated query filters to automatically apply relevant filters based on the document's metadata without requiring manual filter construction [[1]](https://aws.amazon.com/about-aws/whats-new/2024/12/amazon-bedrock-knowledge-bases-auto-generated-query-filters-improved-retrieval/).

## Regional Availability

The auto-generated query filters capability is available in US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Asia Pacific (Seoul), Europe (Frankfurt), Europe (Zurich), and AWS GovCloud (US-West) [[1]](https://aws.amazon.com/about-aws/whats-new/2024/12/amazon-bedrock-knowledge-bases-auto-generated-query-filters-improved-retrieval/).

## Sources:

[1] Title: "Amazon Bedrock Knowledge Bases now provides auto-generated query filters for improved retrieval"
URL: https://aws.amazon.com/about-aws/whats-new/2024/12/amazon-bedrock-knowledge-bases-auto-generated-query-filters-improved-retrieval/
Document Type: AWS Announcement

[2] Title: "Amazon S3 document metadata - Amazon Kendra"
URL: https://docs.aws.amazon.com/kendra/latest/dg/s3-metadata.html
Document Type: AWS Documentation

[3] Title: "Adding document metadata in Amazon S3 - Amazon Q Business"
URL: https://docs.aws.amazon.com/amazonq/latest/qbusiness-ug/s3-metadata.html
Document Type: AWS Documentation

[4] Title: "From concept to reality: Navigating the Journey of RAG from proof of concept to production"
URL: https://aws.amazon.com/blogs/machine-learning/from-concept-to-reality-navigating-the-journey-of-rag-from-proof-of-concept-to-production/
Document Type: AWS Machine Learning Blog

<mentorURLsources>
[1] Title: "Amazon S3 document metadata - Amazon Kendra"
URL: https://docs.aws.amazon.com/kendra/latest/dg/s3-metadata.html
Document Type: documentation

[2] Title: "Amazon Bedrock Knowledge Bases now provides auto-generated query filters for improved retrieval - AWS"
URL: https://aws.amazon.com/about-aws/whats-new/2024/12/amazon-bedrock-knowledge-bases-auto-generated-query-filters-improved-retrieval/
Document Type: blog

[3] Title: "Adding document metadata in Amazon S3 - Amazon Q Business"
URL: https://docs.aws.amazon.com/amazonq/latest/qbusiness-ug/s3-metadata.html
Document Type: documentation

[4] Title: "Adding document metadata in Amazon S3 - Amazon Q Business"
URL: https://docs.aws.amazon.com/amazonq/latest/qbusiness-ug/s3-metadata-v2.html
Document Type: documentation

[5] Title: "Metadata filtering - Amazon Simple Storage Service"
URL: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors-metadata-filtering.html
Document Type: documentation

[6] Title: "Data Discovery Accelerator - Amazon S3 Metadata - AWS"
URL: https://aws.amazon.com/s3/features/metadata/
Document Type: blog

[7] Title: "Introducing queryable object metadata for Amazon S3 buckets (preview) | AWS News Blog"
URL: https://aws.amazon.com/blogs/aws/introducing-queryable-object-metadata-for-amazon-s3-buckets-preview/
Document Type: blog

[8] Title: "Building self-managed RAG applications with Amazon EKS and Amazon S3 Vectors | AWS Storage Blog"
URL: https://aws.amazon.com/blogs/storage/building-self-managed-rag-applications-with-amazon-eks-and-amazon-s3-vectors/
Document Type: blog
</mentorURLsources>

In [1]:
!pip install blackcellmagic
%load_ext blackcellmagic

Collecting blackcellmagic
  Using cached blackcellmagic-0.0.3-py3-none-any.whl.metadata (3.7 kB)
Collecting black<22.0,>=21.9b0 (from blackcellmagic)
  Using cached black-21.12b0-py3-none-any.whl.metadata (39 kB)
Collecting tomli<2.0.0,>=0.2.6 (from black<22.0,>=21.9b0->blackcellmagic)
  Using cached tomli-1.2.3-py3-none-any.whl.metadata (9.1 kB)
Using cached blackcellmagic-0.0.3-py3-none-any.whl (4.2 kB)
Using cached black-21.12b0-py3-none-any.whl (156 kB)
Using cached tomli-1.2.3-py3-none-any.whl (12 kB)
Installing collected packages: tomli, black, blackcellmagic
  Attempting uninstall: tomli
    Found existing installation: tomli 2.3.0
    Uninstalling tomli-2.3.0:
      Successfully uninstalled tomli-2.3.0
  Attempting uninstall: black
    Found existing installation: black 25.1.0
    Uninstalling black-25.1.0:
      Successfully uninstalled black-25.1.0
Successfully installed black-21.12b0 blackcellmagic-0.0.3 tomli-1.2.3


In [2]:
import boto3
import json
import os
from typing import Dict, Optional

In [3]:
# Initialize AWS clients
s3_client = boto3.client('s3')
bedrock_runtime = boto3.client('bedrock-runtime', region_name='us-east-1')

In [4]:
# ============================================================================
# CONFIGURATION SECTION - Update these values
# ============================================================================

# S3 Configuration
INPUT_BUCKET = "183023889407-us-east-1-compliance-rule-generator"
INPUT_PREFIX = "kb-data-source/"  # Folder path in S3 where knowledge base is stored

# AWS Region
AWS_REGION = "us-east-1"

# Bedrock Model Configuration

MODELS = {
    "premium": "us.anthropic.claude-opus-4-5-20251101-v1:0",  # not available
    "good": "us.anthropic.claude-sonnet-4-5-20250929-v1:0",  # times out
    "balanced": "us.anthropic.claude-sonnet-4-20250514-v1:0",
    "fast_cheap": "us.anthropic.claude-haiku-4-5-20251001-v1:0",
    "aws_native": "amazon.nova-premier-v1:0",
}
MODEL_ID = MODELS["balanced"]

MAX_TOKENS = 4096

TEMPERATURE = 0.1
# If you find the metadata generation is too rigid or missing nuanced interpretations, you could increase
# the temperature slightly (to 0.2-0.3). However, for most metadata extraction tasks, keeping temperature
# low ensures the model focuses on key information and maintains fidelity to the original document content.

In [5]:
def parse_s3_path(s3_path: str) -> tuple:
    """Parse an S3 path into bucket name and prefix."""
    if s3_path.startswith('s3://'):
        s3_path = s3_path[5:]
    parts = s3_path.split('/', 1)
    return parts[0], parts[1] if len(parts) > 1 else ''

In [6]:
def get_document_content(bucket_name: str, object_key: str) -> Optional[str]:
    """Retrieve text content from a document in S3."""
    try:
        response = s3_client.get_object(Bucket=bucket_name, Key=object_key)
        body = response['Body'].read()
        
        # Handle text-based files
        if any(object_key.endswith(ext) for ext in ['.txt', '.md', '.json', '.csv']):
            return body.decode('utf-8')
        else:
            try:
                return body.decode('utf-8')
            except:
                return f"[Binary Document: {os.path.basename(object_key)}]"
    except Exception as e:
        print(f"Error reading document {object_key}: {e}")
        return None

In [7]:
def generate_metadata_with_model(
    object_key: str,
    document_content: str,
    model_id: str = MODEL_ID,
) -> Dict:
    """
    Use the model's intelligence to analyze the document and generate
    appropriate metadata without hardcoded schemas.
    """

    filename = os.path.basename(object_key)

    # Simple, flexible prompt that lets the model decide the metadata
    prompt = f"""Analyze this document and generate metadata for an Amazon Bedrock Knowledge Base.

Document path: {object_key}
Filename: {filename}

Document content:
{document_content}

Based on your analysis of the document content and its location in the folder structure, 
generate appropriate metadata as a JSON object. Consider:

1. What type of document is this? (e.g., policy, control, procedure, guideline, etc.)
2. What topics or categories does it cover?
3. Are there any identifiers, version numbers, or dates mentioned?
4. What keywords would help someone find this document?
5. Any other relevant attributes you can extract from the content.

The folder path may provide context (e.g., /controls/ suggests NIST controls, 
/policies/ suggests company policies).

Return ONLY a valid JSON object with up to 9 top recommended metadata attributes. 
Generate no more than 9 attributes.
Use clear, consistent key names in snake_case. Example format:
{{
    "document_type": "...",
    "category": "...",
    "keywords": "...",
    ...any other relevant fields...
}}"""

    try:
        payload = {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": MAX_TOKENS,
            "temperature": TEMPERATURE,
            "messages": [{"role": "user", "content": prompt}],
        }

        response = bedrock_runtime.invoke_model(
            modelId=model_id, body=json.dumps(payload)
        )

        result = json.loads(response["body"].read())
        generated_text = result["content"][0]["text"].strip()

        # Clean up markdown formatting if present
        if generated_text.startswith("```"):
            generated_text = generated_text.split("```")[1]
            if generated_text.startswith("json"):
                generated_text = generated_text[4:]
        if generated_text.endswith("```"):
            generated_text = generated_text[:-3]

        return json.loads(generated_text.strip())

    except Exception as e:
        print(f"Error generating metadata: {e}")
        # Minimal fallback
        return {"source_path": object_key, "filename": filename}

In [8]:
def generate_metadata_files_for_s3_path(
    s3_path: str,
    dry_run: bool = False,
    overwrite_existing: bool = False,
    model_id: str = MODEL_ID,
) -> Dict:
    
    """Generate metadata files for all documents in an S3 path, 
    relying on model intelligence to determine appropriate metadata."""
    
    bucket_name, prefix = parse_s3_path(s3_path)
    
    print(f"Processing S3 path: s3://{bucket_name}/{prefix}")
    print(f"Using model: {model_id}")
    print("-" * 60)
    
    results = {"processed": 0, "skipped": 0, "errors": 0, "files": []}
    
    # List all objects
    paginator = s3_client.get_paginator("list_objects_v2")
    for page in paginator.paginate(Bucket=bucket_name, Prefix=prefix):
        if "Contents" not in page:  # Ensure contents returned
            continue
        
        for obj in page["Contents"]:
            key = obj["Key"]
            
            # Skip metadata files and folders
            if key.endswith('.metadata.json') or key.endswith('/'):
                continue

            metadata_key = f"{key}.metadata.json"
            print(f"\nProcessing: {key}")
            
            # Check if metadata exists
            if not overwrite_existing:
                response = s3_client.list_objects_v2(
                    Bucket=bucket_name, 
                    Prefix=metadata_key, 
                    MaxKeys=1
                )
                if 'Contents' in response:
                    print(f"  Skipping - metadata exists")
                    results["skipped"] += 1
                    continue
                    
            content = get_document_content(bucket_name, key)
            if not content:
                results["errors"] += 1
                continue
            
            metadata_attributes = generate_metadata_with_model(key, content, model_id)
            metadata_file = {"metadataAttributes": metadata_attributes}
            
            print(f"  Generated: {json.dumps(metadata_attributes, indent=2)}")
            
            if not dry_run:
                s3_client.put_object(
                    Bucket=bucket_name,
                    Key=metadata_key,
                    Body=json.dumps(metadata_file, indent=2),
                    ContentType="application/json",
                )
                
                print(f"  Created: {metadata_key}")
            results["processed"] += 1
            results["files"].append(metadata_key)
    print(f"\n{'='*60}\nProcessed: {results['processed']}, Skipped: {results['skipped']}, Errors: {results['errors']}")
    return results

In [9]:
s3_path = "s3://" + INPUT_BUCKET + "/" + INPUT_PREFIX
results = generate_metadata_files_for_s3_path(s3_path, dry_run=False, overwrite_existing=True)


Processing S3 path: s3://183023889407-us-east-1-compliance-rule-generator/kb-data-source/
Using model: us.anthropic.claude-sonnet-4-20250514-v1:0
------------------------------------------------------------

Processing: kb-data-source/controls/NIST.SP.800-53r5-1-Abstract.pdf
  Generated: {
  "document_type": "abstract",
  "category": "security_controls",
  "standard": "NIST SP 800-53",
  "version": "revision_5",
  "framework": "cybersecurity_framework",
  "keywords": "NIST, security controls, risk management, cybersecurity, federal information systems",
  "document_section": "abstract_summary",
  "compliance_standard": "NIST_800_53_r5",
  "content_classification": "security_guidance"
}
  Created: kb-data-source/controls/NIST.SP.800-53r5-1-Abstract.pdf.metadata.json

Processing: kb-data-source/controls/NIST.SP.800-53r5-28-44-2-Introduction.pdf
  Generated: {
  "document_type": "cybersecurity_control_framework",
  "category": "security_controls",
  "standard": "NIST SP 800-53 Revision 5"