# CSV metadata customization walkthrough
This notebook provides sample code walkthrough for 'CSV metadata customization' feature, a feautre from Amazon Bedrock Knowledge bases which enhances .csv file processing feature that separates content and metadata. .

For more details on this feature, please read this [blog](https://aws.amazon.com/blogs/machine-learning/knowledge-bases-for-amazon-bedrock-now-supports-advanced-parsing-chunking-and-query-reformulation-giving-greater-control-of-accuracy-in-rag-based-applications/#:~:text=Machine%20Learning%20Blog-,Knowledge%20Bases%20for%20Amazon%20Bedrock%20now%20supports%20advanced%20parsing%2C%20chunking,accuracy%20in%20RAG%20based%20applications).

## 1. Import the needed libraries
First step is to install the pre-requisites packages.

In [1]:
%pip install --upgrade pip --quiet
%pip install -r ../requirements.txt --no-deps --quiet
%pip install -r ../requirements.txt --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [3]:
import botocore
botocore.__version__

'1.42.2'

This code is part of the setup and used to :
- Add the parent directory to the python system path
- Imports a custom module (BedrockStructuredKnowledgeBase) from `utils` necessary for later executions

In [4]:
import sys
import logging
from pathlib import Path
import os
import time
import boto3
import pprint
import json

current_path = Path().resolve()
current_path = current_path.parent

if str(current_path) not in sys.path:
    sys.path.append(str(current_path))

# Print sys.path to verify
print(sys.path)

from utils.knowledge_base import BedrockKnowledgeBase

['/opt/conda/lib/python312.zip', '/opt/conda/lib/python3.12', '/opt/conda/lib/python3.12/lib-dynload', '', '/opt/conda/lib/python3.12/site-packages', '/home/sagemaker-user/rag-workshop-amazon-bedrock-knowledge-bases']


In [5]:
#Clients
s3_client = boto3.client('s3')
sts_client = boto3.client('sts')
session = boto3.session.Session()
region =  session.region_name
account_id = sts_client.get_caller_identity()["Account"]
bedrock_agent_client = boto3.client('bedrock-agent')
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime') 
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)
region, account_id

('us-west-2', '183631345587')

In [6]:
import time

# Get the current timestamp
current_time = time.time()

# Format the timestamp as a string
timestamp_str = time.strftime("%Y%m%d%H%M%S", time.localtime(current_time))[-7:]
# Create the suffix using the timestamp
suffix = f"{timestamp_str}"
knowledge_base_name_standard = 'csv-metadata-kb'
knowledge_base_name_hierarchical = 'hierarchical-kb'
knowledge_base_description = "Knowledge Base csv metadata customization."
bucket_name = f'{knowledge_base_name_standard}-{suffix}'
foundation_model = "anthropic.claude-3-sonnet-20240229-v1:0"

# Define data sources
data_source=[{"type": "S3", "bucket_name": bucket_name}]

## 2 - Create knowledge bases with fixed chunking strategy
Let's start by creating a [Amazon Bedrock Knowledge Bases](https://aws.amazon.com/bedrock/knowledge-bases/) to store video games data in csv format. Knowledge Bases allow you to integrate with different vector databases including [Amazon OpenSearch Serverless](https://aws.amazon.com/opensearch-service/features/serverless/), [Amazon Aurora](https://aws.amazon.com/rds/aurora/), [Pinecone](http://app.pinecone.io/bedrock-integration), [Redis Enterprise]() and [MongoDB Atlas](). For this example, we will integrate the knowledge base with Amazon OpenSearch Serverless. To do so, we will use the helper class `BedrockKnowledgeBase` which will create the knowledge base and all of its pre-requisites:
1. IAM roles and policies
2. S3 bucket
3. Amazon OpenSearch Serverless encryption, network and data access policies
4. Amazon OpenSearch Serverless collection
5. Amazon OpenSearch Serverless vector index
6. Knowledge base
7. Knowledge base data source

We will create a knowledge base using fixed chunking strategy. 

You can chhose different chunking strategies by changing the below parameter values: 
```
"chunkingStrategy": "FIXED_SIZE | NONE | HIERARCHICAL | SEMANTIC"
```

In [7]:
knowledge_base_standard = BedrockKnowledgeBase(
    kb_name=f'{knowledge_base_name_standard}-{suffix}',
    kb_description=knowledge_base_description,
    data_sources=data_source, 
    chunking_strategy = "FIXED_SIZE", 
    suffix = suffix
)

Step 1 - Creating or retrieving S3 bucket(s) for Knowledge Base documents
['csv-metadata-kb-4170839']
buckets_to_check:  ['csv-metadata-kb-4170839']
Creating bucket csv-metadata-kb-4170839
Step 2 - Creating Knowledge Base Execution Role (AmazonBedrockExecutionRoleForKnowledgeBase_4170839) and Policies
Step 3a - Creating OSS encryption, network and data access policies
Step 3b - Creating OSS Collection (this step takes a couple of minutes to complete)
{ 'ResponseMetadata': { 'HTTPHeaders': { 'connection': 'keep-alive',
                                         'content-length': '318',
                                         'content-type': 'application/x-amz-json-1.0',
                                         'date': 'Thu, 04 Dec 2025 17:08:42 '
                                                 'GMT',
                                         'x-amzn-requestid': 'dcab558c-b19b-4fb6-ac33-840f53c03617'},
                        'HTTPStatusCode': 200,
                        'RequestId': 'dc

[2025-12-04 17:10:13,489] p6438 {base.py:258} INFO - PUT https://0znvqm1xha6vhp5awqpj.us-west-2.aoss.amazonaws.com:443/bedrock-sample-rag-index-4170839 [status:200 request:0.467s]



Creating index:
{ 'acknowledged': True,
  'index': 'bedrock-sample-rag-index-4170839',
  'shards_acknowledged': True}
Step 4 - Will create Lambda Function if chunking strategy selected as CUSTOM
Not creating lambda function as chunking strategy is FIXED_SIZE
Step 5 - Creating Knowledge Base
{ 'createdAt': datetime.datetime(2025, 12, 4, 17, 11, 13, 596540, tzinfo=tzlocal()),
  'description': 'Knowledge Base csv metadata customization.',
  'knowledgeBaseArn': 'arn:aws:bedrock:us-west-2:183631345587:knowledge-base/EH2ITAYFQC',
  'knowledgeBaseConfiguration': { 'type': 'VECTOR',
                                  'vectorKnowledgeBaseConfiguration': { 'embeddingModelArn': 'arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0'}},
  'knowledgeBaseId': 'EH2ITAYFQC',
  'name': 'csv-metadata-kb-4170839',
  'roleArn': 'arn:aws:iam::183631345587:role/AmazonBedrockExecutionRoleForKnowledgeBase_4170839',
  'status': 'CREATING',
  'storageConfiguration': { 'opensearchServerlessCon

### 2.1 Download csv dataset and upload it to Amazon S3
Now that we have created the knowledge base, let's populate it with the `video_games.csv` dataset to KB. This data is being downloaded from [here](https://github.com/ali-ce/datasets/blob/master/Most-Expensive-Things/Videogames.csv). It contains the sales data of video games originally collected by Alice Corona is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://github.com/ali-ce/datasets/blob/master/README.md#:~:text=Creative%20Commons%20Attribution%2DShareAlike%204.0%20International%20License.).


The Knowledge Base data source expects the data to be available on the S3 bucket connected to it and changes on the data can be syncronized to the knowledge base using the `StartIngestionJob` API call. In this example we will use the [boto3 abstraction](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent/client/start_ingestion_job.html) of the API, via our helper classes. 

In [8]:
!mkdir -p ./csv_data

In [9]:
!wget https://raw.githubusercontent.com/ali-ce/datasets/master/Most-Expensive-Things/Videogames.csv --no-check-certificate -O ./csv_data/video_games.csv

--2025-12-04 17:11:15--  https://raw.githubusercontent.com/ali-ce/datasets/master/Most-Expensive-Things/Videogames.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23931 (23K) [text/plain]
Saving to: ‘./csv_data/video_games.csv’


2025-12-04 17:11:16 (77.2 MB/s) - ‘./csv_data/video_games.csv’ saved [23931/23931]



Let's upload the video games data available on the `csv_data` folder to s3.

In [10]:
def upload_directory(path, bucket_name):
        for root,dirs,files in os.walk(path):
            for file in files:
                file_to_upload = os.path.join(root,file)
                print(f"uploading file {file_to_upload} to {bucket_name}")
                s3_client.upload_file(file_to_upload,bucket_name,file)

upload_directory("csv_data", bucket_name)

uploading file csv_data/video_games.csv to csv-metadata-kb-4170839


Now we start the ingestion job.

In [11]:
# ensure that the kb is available
time.sleep(30)
# sync knowledge base
knowledge_base_standard.start_ingestion_job()

job 1 started successfully

{ 'dataSourceId': 'LAAWA6XYEI',
  'ingestionJobId': 'KBDAP0NZNF',
  'knowledgeBaseId': 'EH2ITAYFQC',
  'startedAt': datetime.datetime(2025, 12, 4, 17, 11, 47, 171707, tzinfo=tzlocal()),
  'statistics': { 'numberOfDocumentsDeleted': 0,
                  'numberOfDocumentsFailed': 0,
                  'numberOfDocumentsScanned': 1,
                  'numberOfMetadataDocumentsModified': 0,
                  'numberOfMetadataDocumentsScanned': 0,
                  'numberOfModifiedDocumentsIndexed': 0,
                  'numberOfNewDocumentsIndexed': 1},
  'status': 'COMPLETE',
  'updatedAt': datetime.datetime(2025, 12, 4, 17, 12, 9, 995013, tzinfo=tzlocal())}
........................................

Finally we save the Knowledge Base Id to test the solution at a later stage. 

In [12]:
kb_id_standard = knowledge_base_standard.get_knowledge_base_id()

'EH2ITAYFQC'


### 2.2 Query the Knowledge Base with Retrieve and Generate API - without metadata

Let's test the knowledge base using the [**retrieve_and_generate**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve_and_generate.html) API. With this API, Bedrock takes care of retrieving the necessary references from the knowledge base and generating the final answer using a foundation model from Bedrock.

'''
query = "List the video games published by Rockstar Games and released after 2010"
'''

Expected Results: Grand Theft Auto V, L.A. Noire, Max Payne 3


In [13]:
query = "Provide a list of all video games published by Rockstar Games and released after 2010"

In [14]:
response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id_standard,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5
                } 
            }
        }
    }
)

pprint.pp(response['output']['text'])

('Based on the search results, the only video game published by Rockstar Games '
 'and released after 2010 that is mentioned is Grand Theft Auto IV. It was '
 'released on April 29, 2008 for PlayStation 3 and Xbox 360, and on December '
 '2, 2008 for Microsoft Windows, so it does not meet the criteria of being '
 'released after 2010.')


#### 2.3 Prepeare metadata for ingestion


In [15]:
import csv
import json

In [16]:
def generate_json_metadata(csv_file, content_field, metadata_fields, excluded_fields):
    # Open the CSV file and read its contents
    with open(csv_file, 'r') as file:
        reader = csv.DictReader(file)
        headers = reader.fieldnames

    # Create the JSON structure
    json_data = {
        "metadataAttributes": {},
        "documentStructureConfiguration": {
            "type": "RECORD_BASED_STRUCTURE_METADATA",
            "recordBasedStructureMetadata": {
                "contentFields": [
                    {
                        "fieldName": content_field
                    }
                ],
                "metadataFieldsSpecification": {
                    "fieldsToInclude": [],
                    "fieldsToExclude": []
                }
            }
        }
    }

    # Add metadata fields to include
    for field in metadata_fields:
        json_data["documentStructureConfiguration"]["recordBasedStructureMetadata"]["metadataFieldsSpecification"]["fieldsToInclude"].append(
            {
                "fieldName": field
            }
        )

    # Add fields to exclude (all fields not in content_field or metadata_fields)
    if not excluded_fields:
        excluded_fields = set(headers) - set([content_field] + metadata_fields)
    
    for field in excluded_fields:
        json_data["documentStructureConfiguration"]["recordBasedStructureMetadata"]["metadataFieldsSpecification"]["fieldsToExclude"].append(
            {
                "fieldName": field
            }
        )

    # Generate the output JSON file name
    output_file = f"{csv_file.split('.')[0]}.csv.metadata.json"

    # Write the JSON data to the output file
    with open(output_file, 'w') as file:
        json.dump(json_data, file, indent=4)

    print(f"JSON metadata file '{output_file}' has been generated.")

In [17]:
csv_file = 'csv_data/video_games.csv'
content_field = 'Videogame'
metadata_fields = ['Year', 'Developer', 'Publisher']
excluded_fields =['Description']

generate_json_metadata(csv_file, content_field, metadata_fields, excluded_fields)

JSON metadata file 'csv_data/video_games.csv.metadata.json' has been generated.


In [18]:
# upload metadata file to S3
upload_directory("csv_data", bucket_name)

# delete metadata file from local
os.remove('csv_data/video_games.csv.metadata.json')

uploading file csv_data/video_games.csv to csv-metadata-kb-4170839
uploading file csv_data/video_games.csv.metadata.json to csv-metadata-kb-4170839


Now start the ingestion job. Since, we are using the same documents as used for fixed chunking, we are skipping the step to upload documents to s3 bucket. 

In [19]:
# ensure that the kb is available
time.sleep(30)
# sync knowledge base
knowledge_base_standard.start_ingestion_job()

job 1 started successfully

{ 'dataSourceId': 'LAAWA6XYEI',
  'ingestionJobId': '5MYI8HXTIF',
  'knowledgeBaseId': 'EH2ITAYFQC',
  'startedAt': datetime.datetime(2025, 12, 4, 17, 13, 23, 740999, tzinfo=tzlocal()),
  'statistics': { 'numberOfDocumentsDeleted': 0,
                  'numberOfDocumentsFailed': 0,
                  'numberOfDocumentsScanned': 1,
                  'numberOfMetadataDocumentsModified': 0,
                  'numberOfMetadataDocumentsScanned': 1,
                  'numberOfModifiedDocumentsIndexed': 1,
                  'numberOfNewDocumentsIndexed': 0},
  'status': 'COMPLETE',
  'updatedAt': datetime.datetime(2025, 12, 4, 17, 13, 52, 851825, tzinfo=tzlocal())}
........................................

### 2.4 Query the Knowledge Base with Retrieve and Generate API - without metadata

create the filter 

In [20]:
one_group_filter= {
    "andAll": [
        {
            "equals": {
                "key": "Publisher",
                "value": "Rockstar Games"
            }
        },
        {
            "greaterThan": {
                "key": "Year",
                "value": 2010
            }
        }
    ]
}

Pass the filter to `retrievalConfiguration` of the [**retrieve_and_generate**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve_and_generate.html).

In [21]:
response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": query
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id_standard,
            "modelArn": "arn:aws:bedrock:{}::foundation-model/{}".format(region, foundation_model),
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5,
                    "filter": one_group_filter
                } 
            }
        }
    }
)

print(response['output']['text'])

Here are some video games published by Rockstar Games and released after 2010:

- L.A. Noire (2011)
- Max Payne 3 (2012)
- Grand Theft Auto V (2013)


As you can see, with the retrieve and generate API we get the final response directly, now let's observe the citations for `RetreiveAndGenerate` API. Also, let's  observe the retrieved chunks and citations returned by the model while generating the response. When we provide the relevant context to the foundation model alongwith the query, it will most likely generate the high quality response. 

In [22]:
response_standard = response['citations'][0]['retrievedReferences']
print("# of citations or chunks used to generate the response: ", len(response_standard))
def citations_rag_print(response_ret):
#structure 'retrievalResults': list of contents. Each list has content, location, score, metadata
    for num,chunk in enumerate(response_ret,1):
        print(f'Chunk {num}: ',chunk['content']['text'],end='\n'*2)
        print(f'Chunk {num} Location: ',chunk['location'],end='\n'*2)
        print(f'Chunk {num} Metadata: ',chunk['metadata'],end='\n'*2)

citations_rag_print(response_standard)

# of citations or chunks used to generate the response:  3
Chunk 1:  L.A. Noire

Chunk 1 Location:  {'s3Location': {'uri': 's3://csv-metadata-kb-4170839/video_games.csv'}, 'type': 'S3'}

Chunk 1 Metadata:  {'x-amz-bedrock-kb-source-uri': 's3://csv-metadata-kb-4170839/video_games.csv', 'Year': '2011', 'x-amz-bedrock-kb-data-source-id': 'LAAWA6XYEI', 'x-amz-bedrock-kb-source-file-modality': 'TEXT', 'Developer': 'Team Bondi', 'Publisher': 'Rockstar Games', 'x-amz-bedrock-kb-chunk-id': '1%3A0%3AS5tb6poBAYJI6zCyGARg'}

Chunk 2:  Max Payne 3

Chunk 2 Location:  {'s3Location': {'uri': 's3://csv-metadata-kb-4170839/video_games.csv'}, 'type': 'S3'}

Chunk 2 Metadata:  {'x-amz-bedrock-kb-source-uri': 's3://csv-metadata-kb-4170839/video_games.csv', 'Year': '2012', 'x-amz-bedrock-kb-data-source-id': 'LAAWA6XYEI', 'x-amz-bedrock-kb-source-file-modality': 'TEXT', 'Developer': 'Rockstar Studios', 'Publisher': 'Rockstar Games', 'x-amz-bedrock-kb-chunk-id': '1%3A0%3ATJtb6poBAYJI6zCyGARg'}

Chunk 3:  Gr

In [23]:
%store kb_id_standard

Stored 'kb_id_standard' (str)


### Clean up
Please make sure to uncomment and run below cells to delete the resources created in this notebook. If you are planning to run `dynamic-metadata-filtering` notebook under `03-advanced-concepts` section, then make sure to come back here to delete the resources. 

In [24]:
# # Empty and delete S3 Bucket

# objects = s3_client.list_objects(Bucket=bucket_name)  
# if 'Contents' in objects:
#     for obj in objects['Contents']:
#         s3_client.delete_object(Bucket=bucket_name, Key=obj['Key']) 
# s3_client.delete_bucket(Bucket=bucket_name)

In [25]:
# print("===============================Knowledge base==============================")
knowledge_base_standard.delete_kb(delete_s3_bucket=True, delete_iam_roles_and_policies=True)

Deleted data source LAAWA6XYEI
Found bucket csv-metadata-kb-4170839
Deleted all objects in bucket csv-metadata-kb-4170839
Deleted bucket csv-metadata-kb-4170839
Found role AmazonBedrockExecutionRoleForKnowledgeBase_4170839
 [{'PolicyName': 'AmazonBedrockOSSPolicyForKnowledgeBase_4170839', 'PolicyArn': 'arn:aws:iam::183631345587:policy/AmazonBedrockOSSPolicyForKnowledgeBase_4170839'}, {'PolicyName': 'AmazonBedrockCloudWatchPolicyForKnowledgeBase_4170839', 'PolicyArn': 'arn:aws:iam::183631345587:policy/AmazonBedrockCloudWatchPolicyForKnowledgeBase_4170839'}, {'PolicyName': 'AmazonBedrockS3PolicyForKnowledgeBase_4170839', 'PolicyArn': 'arn:aws:iam::183631345587:policy/AmazonBedrockS3PolicyForKnowledgeBase_4170839'}, {'PolicyName': 'AmazonBedrockFoundationModelPolicyForKnowledgeBase_4170839', 'PolicyArn': 'arn:aws:iam::183631345587:policy/AmazonBedrockFoundationModelPolicyForKnowledgeBase_4170839'}]
Detached policy AmazonBedrockOSSPolicyForKnowledgeBase_4170839 from role AmazonBedrockExecu