# Metadata filtering using Amazon Bedrock Knowledge Bases
This notebook provides sample code walkthrough for 'metadata filtering' feature, for Amazon Bedrock Knowledge Bases.

Using metadata filtering feature, you can use to improve search results by pre-filtering your retrievals from vector stores. 
For more details on this feature, please read this [blog](https://aws.amazon.com/blogs/machine-learning/amazon-bedrock-knowledge-bases-now-supports-metadata-filtering-to-improve-retrieval-accuracy/).

## 1. Import the needed libraries
First step is to install the pre-requisites packages.

In [1]:
%pip install --upgrade pip --quiet
%pip install -r ../requirements.txt --no-deps --quiet
%pip install -r ../requirements.txt --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [3]:
import botocore
botocore.__version__

'1.42.2'

In [4]:
import os
import sys
import time
import boto3
import logging
import pprint
import json

# Set the path to import module
from pathlib import Path
current_path = Path().resolve()
current_path = current_path.parent
if str(current_path) not in sys.path:
    sys.path.append(str(current_path))
# Print sys.path to verify
# print(sys.path)

from utils.knowledge_base import BedrockKnowledgeBase

In [5]:
#Clients
s3_client = boto3.client('s3')
sts_client = boto3.client('sts')
session = boto3.session.Session()
region =  session.region_name
account_id = sts_client.get_caller_identity()["Account"]
bedrock_agent_client = boto3.client('bedrock-agent')
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime') 
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)
region, account_id

('us-west-2', '183631345587')

In [6]:
import time

# Get the current timestamp
current_time = time.time()

# Format the timestamp as a string
timestamp_str = time.strftime("%Y%m%d%H%M%S", time.localtime(current_time))[-7:]
# Create the suffix using the timestamp
suffix = f"{timestamp_str}"
knowledge_base_name = 'metadata-filtering-kb'
knowledge_base_description = "Knowledge Base metadata filtering."
bucket_name = f'{knowledge_base_name}-{suffix}'
foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"

# Define data sources
data_source=[{"type": "S3", "bucket_name": bucket_name}]

## 2 - Create knowledge bases with fixed chunking strategy
Let's start by creating a [Amazon Bedrock Knowledge Bases](https://aws.amazon.com/bedrock/knowledge-bases/) to store video games data in csv format. Knowledge Bases allow you to integrate with different vector databases including [Amazon OpenSearch Serverless](https://aws.amazon.com/opensearch-service/features/serverless/), [Amazon Aurora](https://aws.amazon.com/rds/aurora/), [Pinecone](http://app.pinecone.io/bedrock-integration), [Redis Enterprise]() and [MongoDB Atlas](). For this example, we will integrate the knowledge base with Amazon OpenSearch Serverless. To do so, we will use the helper class `BedrockKnowledgeBase` which will create the knowledge base and all of its pre-requisites:
1. IAM roles and policies
2. S3 bucket
3. Amazon OpenSearch Serverless encryption, network and data access policies
4. Amazon OpenSearch Serverless collection
5. Amazon OpenSearch Serverless vector index
6. Knowledge base
7. Knowledge base data source

We will create a knowledge base using fixed chunking strategy. 

You can chhose different chunking strategies by changing the below parameter values: 
```
"chunkingStrategy": "FIXED_SIZE | NONE | HIERARCHICAL | SEMANTIC"
```

In [7]:
knowledge_base_metadata = BedrockKnowledgeBase(
    kb_name=f'{knowledge_base_name}-{suffix}',
    kb_description=knowledge_base_description,
    data_sources=data_source, 
    chunking_strategy = "FIXED_SIZE", 
    suffix = suffix
)

Step 1 - Creating or retrieving S3 bucket(s) for Knowledge Base documents
['metadata-filtering-kb-4171508']
buckets_to_check:  ['metadata-filtering-kb-4171508']
Creating bucket metadata-filtering-kb-4171508
Step 2 - Creating Knowledge Base Execution Role (AmazonBedrockExecutionRoleForKnowledgeBase_4171508) and Policies
Step 3a - Creating OSS encryption, network and data access policies
Step 3b - Creating OSS Collection (this step takes a couple of minutes to complete)
{ 'ResponseMetadata': { 'HTTPHeaders': { 'connection': 'keep-alive',
                                         'content-length': '318',
                                         'content-type': 'application/x-amz-json-1.0',
                                         'date': 'Thu, 04 Dec 2025 17:15:10 '
                                                 'GMT',
                                         'x-amzn-requestid': '2c5fe063-fdab-4925-8dff-9770a2d21078'},
                        'HTTPStatusCode': 200,
                      

[2025-12-04 17:16:41,563] p6853 {base.py:258} INFO - PUT https://94pcv6qsnq50jx9dtoxd.us-west-2.aoss.amazonaws.com:443/bedrock-sample-rag-index-4171508 [status:200 request:0.370s]



Creating index:
{ 'acknowledged': True,
  'index': 'bedrock-sample-rag-index-4171508',
  'shards_acknowledged': True}
Step 4 - Will create Lambda Function if chunking strategy selected as CUSTOM
Not creating lambda function as chunking strategy is FIXED_SIZE
Step 5 - Creating Knowledge Base
{ 'createdAt': datetime.datetime(2025, 12, 4, 17, 17, 41, 671013, tzinfo=tzlocal()),
  'description': 'Knowledge Base metadata filtering.',
  'knowledgeBaseArn': 'arn:aws:bedrock:us-west-2:183631345587:knowledge-base/NMYAZRK8CK',
  'knowledgeBaseConfiguration': { 'type': 'VECTOR',
                                  'vectorKnowledgeBaseConfiguration': { 'embeddingModelArn': 'arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0'}},
  'knowledgeBaseId': 'NMYAZRK8CK',
  'name': 'metadata-filtering-kb-4171508',
  'roleArn': 'arn:aws:iam::183631345587:role/AmazonBedrockExecutionRoleForKnowledgeBase_4171508',
  'status': 'CREATING',
  'storageConfiguration': { 'opensearchServerlessConfi

### 2.1 Download video game dataset and upload it to Amazon S3

Now that we have created the knowledge base, let's populate it with the `video_games` dataset to KB. This data is being downloaded from [here](https://aws-blogs-artifacts-public.s3.amazonaws.com/ML-16482/30_generated_video_game_records.zip). This data is about fictional video games containing information like title, description, genre, year, publisher, and score for each video games.

In [8]:
import os
import zipfile

# Download the zip file
!wget https://aws-blogs-artifacts-public.s3.amazonaws.com/ML-16482/30_generated_video_game_records.zip --no-check-certificate

# Unzip the file content - This data will get unzipped into a folder name 'video_game'
with zipfile.ZipFile('./30_generated_video_game_records.zip', 'r') as zipf:
    csv_files = [x for x in zipf.infolist() if not x.filename.startswith('__MACOSX/') and x.filename.endswith('.csv')]
    for csv_file in csv_files:
        zipf.extract(csv_file, './')

#remove original zip file
# os.remove('./30_generated_video_game_records.zip')

--2025-12-04 17:17:44--  https://aws-blogs-artifacts-public.s3.amazonaws.com/ML-16482/30_generated_video_game_records.zip
Resolving aws-blogs-artifacts-public.s3.amazonaws.com (aws-blogs-artifacts-public.s3.amazonaws.com)... 3.5.29.155, 52.217.134.89, 3.5.0.101, ...
Connecting to aws-blogs-artifacts-public.s3.amazonaws.com (aws-blogs-artifacts-public.s3.amazonaws.com)|3.5.29.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39949 (39K) [application/zip]
Saving to: ‘30_generated_video_game_records.zip’


2025-12-04 17:17:44 (645 KB/s) - ‘30_generated_video_game_records.zip’ saved [39949/39949]



Let's upload the video games data available in the `video_game` folder to s3.

In [9]:
def upload_directory(path, bucket_name):
        for root,dirs,files in os.walk(path):
            for file in files:
                if not file.startswith('.DS_Store'):
                    file_to_upload = os.path.join(root,file)
                    print(f"uploading file {file_to_upload} to {bucket_name}")
                    s3_client.upload_file(file_to_upload,bucket_name,file)

upload_directory("video_game", bucket_name)

uploading file video_game/6.csv to metadata-filtering-kb-4171508
uploading file video_game/7.csv to metadata-filtering-kb-4171508
uploading file video_game/5.csv to metadata-filtering-kb-4171508
uploading file video_game/4.csv to metadata-filtering-kb-4171508
uploading file video_game/1.csv to metadata-filtering-kb-4171508
uploading file video_game/3.csv to metadata-filtering-kb-4171508
uploading file video_game/2.csv to metadata-filtering-kb-4171508
uploading file video_game/23.csv to metadata-filtering-kb-4171508
uploading file video_game/22.csv to metadata-filtering-kb-4171508
uploading file video_game/20.csv to metadata-filtering-kb-4171508
uploading file video_game/21.csv to metadata-filtering-kb-4171508
uploading file video_game/25.csv to metadata-filtering-kb-4171508
uploading file video_game/19.csv to metadata-filtering-kb-4171508
uploading file video_game/18.csv to metadata-filtering-kb-4171508
uploading file video_game/30.csv to metadata-filtering-kb-4171508
uploading file vi

Now we start the ingestion job.

In [10]:
# ensure that the kb is available
time.sleep(30)
# sync knowledge base
knowledge_base_metadata.start_ingestion_job()

job 1 started successfully

{ 'dataSourceId': 'T9XOMFLWYP',
  'ingestionJobId': 'MSOPAMSNRO',
  'knowledgeBaseId': 'NMYAZRK8CK',
  'startedAt': datetime.datetime(2025, 12, 4, 17, 18, 16, 754173, tzinfo=tzlocal()),
  'statistics': { 'numberOfDocumentsDeleted': 0,
                  'numberOfDocumentsFailed': 0,
                  'numberOfDocumentsScanned': 30,
                  'numberOfMetadataDocumentsModified': 0,
                  'numberOfMetadataDocumentsScanned': 0,
                  'numberOfModifiedDocumentsIndexed': 0,
                  'numberOfNewDocumentsIndexed': 30},
  'status': 'COMPLETE',
  'updatedAt': datetime.datetime(2025, 12, 4, 17, 18, 28, 235594, tzinfo=tzlocal())}
........................................

Finally we save the Knowledge Base Id to test the solution at a later stage. 

In [11]:
kb_id_metadata = knowledge_base_metadata.get_knowledge_base_id()

'NMYAZRK8CK'


### 2.2 Query the Knowledge Base with Retrieve and Generate API - without metadata

Let's test the knowledge base using the [**retrieve_and_generate**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve_and_generate.html) API. With this API, Bedrock takes care of retrieving the necessary references from the knowledge base and generating the final answer using a foundation model from Bedrock.

'''
query = "A strategy game with cool graphic with score of 9.0"
'''

Expected Results: 
    * Fantasy Kingdoms: Chronicles of Eldoria is a strategy RPG game with a score of 9.0.


In [12]:
query = "A strategy game with cool graphic with score of 9.0"

In [13]:
response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id_metadata,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5
                } 
            }
        }
    }
)

pprint.pp(response['output']['text'])

('Based on the search results, the game "Mech Warfare: Titan\'s Reign" seems '
 'to fit your criteria of a strategy game with a high score of 9.0. It is '
 'described as an action game where you pilot a giant mech, battle against '
 'enemy mechs, upgrade your mech, and complete missions.')


#### 2.3 Prepare metadata for ingestion


In [14]:
import csv
import json
import pandas as pd

def generate_matadata(data_dir , metadata_fields):
    # Define the metadata attributes
    metadata_attributes = metadata_fields

    # Loop through all CSV files in the directory
    for filename in os.listdir(data_dir):
        filename= f'{data_dir}/{filename}'
        if filename.endswith(".csv"):
            # Read the CSV file
            df = pd.read_csv(filename)
            df["Id"] = [os.path.basename(filename)]
            
            # Extract the metadata attributes
            metadata = {k:v[0] for k,v in df[metadata_attributes].to_dict(orient='list').items()}
            # reorder the keys
            metadata = {key: metadata[key] for key in metadata_attributes}
            
            # Create a JSON object
            json_data = {"metadataAttributes": metadata}
            
            
            # Write the JSON object to a file
            with open(f"{filename.replace('.csv', '.csv.metadata.json')}", "w") as f:
                json.dump(json_data, f)

In [15]:
data_dir = './video_game'
metadata_fields = ["Id", "genres", "year", "publisher", "score"]

generate_matadata(data_dir, metadata_fields)

In [16]:
# upload metadata file to S3
upload_directory("video_game", bucket_name)

uploading file video_game/6.csv to metadata-filtering-kb-4171508
uploading file video_game/7.csv to metadata-filtering-kb-4171508
uploading file video_game/5.csv to metadata-filtering-kb-4171508
uploading file video_game/4.csv to metadata-filtering-kb-4171508
uploading file video_game/1.csv to metadata-filtering-kb-4171508
uploading file video_game/3.csv to metadata-filtering-kb-4171508
uploading file video_game/2.csv to metadata-filtering-kb-4171508
uploading file video_game/23.csv to metadata-filtering-kb-4171508
uploading file video_game/22.csv to metadata-filtering-kb-4171508
uploading file video_game/20.csv to metadata-filtering-kb-4171508
uploading file video_game/21.csv to metadata-filtering-kb-4171508
uploading file video_game/25.csv to metadata-filtering-kb-4171508
uploading file video_game/19.csv to metadata-filtering-kb-4171508
uploading file video_game/18.csv to metadata-filtering-kb-4171508
uploading file video_game/30.csv to metadata-filtering-kb-4171508
uploading file vi

In [17]:
# delete metadata files from local
data_dir = './video_game'
for filename in os.listdir(data_dir):
    filename= f'{data_dir}/{filename}'
    if filename.endswith(".csv.metadata.json"):
        os.remove(filename)

Now start the ingestion job. Since, we are using the same documents as used for fixed chunking, we are skipping the step to upload documents to s3 bucket. 

In [18]:
# ensure that the kb is available
time.sleep(30)
# sync knowledge base
knowledge_base_metadata.start_ingestion_job()

job 1 started successfully

{ 'dataSourceId': 'T9XOMFLWYP',
  'ingestionJobId': 'M71BPOWF24',
  'knowledgeBaseId': 'NMYAZRK8CK',
  'startedAt': datetime.datetime(2025, 12, 4, 17, 19, 51, 224301, tzinfo=tzlocal()),
  'statistics': { 'numberOfDocumentsDeleted': 0,
                  'numberOfDocumentsFailed': 0,
                  'numberOfDocumentsScanned': 30,
                  'numberOfMetadataDocumentsModified': 30,
                  'numberOfMetadataDocumentsScanned': 30,
                  'numberOfModifiedDocumentsIndexed': 0,
                  'numberOfNewDocumentsIndexed': 0},
  'status': 'COMPLETE',
  'updatedAt': datetime.datetime(2025, 12, 4, 17, 20, 1, 672621, tzinfo=tzlocal())}
........................................

### 2.4 Query the Knowledge Base with Retrieve and Generate API - with metadata

create the filter 

In [19]:
one_group_filter= {
    "andAll": [
        {
            "equals": {
                "key": "genres",
                "value": "Strategy"
            }
        },
        {
            "greaterThanOrEquals": {
                "key": "score",
                "value": 9.0
            }
        }
    ]
}

Pass the filter to `retrievalConfiguration` of the [**retrieve_and_generate**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve_and_generate.html).

In [20]:
response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id_metadata,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5,
                    "filter": one_group_filter
                } 
            }
        }
    }
)

print(response['output']['text'])

Based on the search result, Fantasy Kingdoms: Chronicles of Eldoria is a strategy RPG game with a score of 9.0. It features building and managing your own medieval kingdom, recruiting heroes, constructing buildings, and engaging in epic battles against enemies.


As you can see, with the retrieve and generate API we get the final response directly, now let's observe the citations for `RetreiveAndGenerate` API. Also, let's  observe the retrieved chunks and citations returned by the model while generating the response. When we provide the relevant context to the foundation model alongwith the query, it will most likely generate the high quality response. 

In [21]:
# response_metadata = response['citations'][0]['retrievedReferences']
# print("# of citations or chunks used to generate the response: ", len(response_metadata))
# def citations_rag_print(response_ret):
# #structure 'retrievalResults': list of contents. Each list has content, location, score, metadata
#     for num,chunk in enumerate(response_ret,1):
#         print(f'Chunk {num}: ',chunk['content']['text'],end='\n'*2)
#         print(f'Chunk {num} Location: ',chunk['location'],end='\n'*2)
#         print(f'Chunk {num} Metadata: ',chunk['metadata'],end='\n'*2)

# citations_rag_print(response_metadata)

### Clean up
Please make sure to uncomment and run below cells to delete the resources created in this notebook. If you are planning to run `dynamic-metadata-filtering` notebook under `03-advanced-concepts` section, then make sure to come back here to delete the resources. 

In [22]:
# Empty and delete S3 Bucket

objects = s3_client.list_objects(Bucket=bucket_name)  
if 'Contents' in objects:
    for obj in objects['Contents']:
        s3_client.delete_object(Bucket=bucket_name, Key=obj['Key']) 
s3_client.delete_bucket(Bucket=bucket_name)

{'ResponseMetadata': {'RequestId': '9T20TXMQKD7KXXC3',
  'HostId': 'neydIKSv5QnvCZjx9cB9pxtdkjn6anGaDVcCz1xqwxvHLVjkC7FQnXld9HIzJFtIhLHX3RYBC2vI735yFleTWPHpu+4NCDMKEj16VA5i+3U=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'neydIKSv5QnvCZjx9cB9pxtdkjn6anGaDVcCz1xqwxvHLVjkC7FQnXld9HIzJFtIhLHX3RYBC2vI735yFleTWPHpu+4NCDMKEj16VA5i+3U=',
   'x-amz-request-id': '9T20TXMQKD7KXXC3',
   'date': 'Thu, 04 Dec 2025 17:20:46 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

In [23]:
# print("===============================Knowledge base==============================")
knowledge_base_metadata.delete_kb(delete_s3_bucket=True, delete_iam_roles_and_policies=True)

Deleted data source T9XOMFLWYP
Bucket metadata-filtering-kb-4171508 does not exist, skipping deletion
Found role AmazonBedrockExecutionRoleForKnowledgeBase_4171508
 [{'PolicyName': 'AmazonBedrockOSSPolicyForKnowledgeBase_4171508', 'PolicyArn': 'arn:aws:iam::183631345587:policy/AmazonBedrockOSSPolicyForKnowledgeBase_4171508'}, {'PolicyName': 'AmazonBedrockFoundationModelPolicyForKnowledgeBase_4171508', 'PolicyArn': 'arn:aws:iam::183631345587:policy/AmazonBedrockFoundationModelPolicyForKnowledgeBase_4171508'}, {'PolicyName': 'AmazonBedrockCloudWatchPolicyForKnowledgeBase_4171508', 'PolicyArn': 'arn:aws:iam::183631345587:policy/AmazonBedrockCloudWatchPolicyForKnowledgeBase_4171508'}, {'PolicyName': 'AmazonBedrockS3PolicyForKnowledgeBase_4171508', 'PolicyArn': 'arn:aws:iam::183631345587:policy/AmazonBedrockS3PolicyForKnowledgeBase_4171508'}]
Detached policy AmazonBedrockOSSPolicyForKnowledgeBase_4171508 from role AmazonBedrockExecutionRoleForKnowledgeBase_4171508
Deleted policy AmazonBedro