# S3 Vectors API Testing with Boto3 SDK

This notebook demonstrates how to test the S3 Vectors API using the boto3 SDK with a custom service model. This approach provides a native AWS SDK experience with proper authentication, retry logic, and error handling.

Make sure the FastAPI server is running on localhost:8000 before executing these cells.

## üîß Environment Configuration Guide

Before running this notebook, update the configuration in the first code cell based on your environment:

### üè† Local Development
```python
ENDPOINT_URL = "http://localhost:8000"
AWS_ACCESS_KEY_ID = "test"
AWS_SECRET_ACCESS_KEY = "test"
```

### üåê Remote Development Server
```python
ENDPOINT_URL = "http://your-dev-server.com:8000"
AWS_ACCESS_KEY_ID = "your-dev-access-key"
AWS_SECRET_ACCESS_KEY = "your-dev-secret-key"
```

### ‚òÅÔ∏è Production/Staging Environment
```python
ENDPOINT_URL = "https://s3vectors-api.your-company.com"
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
```

### üê≥ Docker Environment
```python
ENDPOINT_URL = "http://s3vectors-container:8000"
AWS_ACCESS_KEY_ID = "docker-test"
AWS_SECRET_ACCESS_KEY = "docker-test"
```

‚ö†Ô∏è **Security Note**: Never commit real credentials to version control. Use environment variables or secure credential management for production.

## 1. Import Libraries and Configuration

This notebook demonstrates S3 Vectors API testing with boto3 SDK using real text embeddings. The setup includes:
- S3 Vectors client configuration  
- Text embedding server integration (text-embedding-nomic-embed-text-v1.5)
- Rivers of India knowledge base for semantic search testing

In [None]:
# =============================================================================
# üîß CONFIGURATION - Update these settings for your environment
# =============================================================================

# S3 Vectors API Endpoint Configuration
ENDPOINT_URL = "http://127.0.0.1:8000/"  # Change to your S3 Vectors server URL
REGION_NAME = "us-east-1"               # AWS region for compatibility

# AWS Credentials (for boto3 compatibility if needed)
AWS_ACCESS_KEY_ID = "minioadmin"              # Your AWS access key or test value
AWS_SECRET_ACCESS_KEY = "minioadmin"          # Your AWS secret key or test value

# Embedding Server Configuration
EMBEDDING_URL = "http://127.0.0.1:1234/v1/embeddings"  # Local embedding server
EMBEDDING_MODEL = "text-embedding-nomic-embed-text-v1.5"  # Embedding model

# Request Configuration
REQUEST_TIMEOUT = 30                    # Request timeout in seconds
MAX_RETRIES = 3                        # Maximum number of retries for failed requests

print("üîß Configuration Settings:")
print(f"   üì° S3 Vectors Endpoint: {ENDPOINT_URL}")
print(f"   üß† Embedding Server: {EMBEDDING_URL}")
print(f"   ü§ñ Embedding Model: {EMBEDDING_MODEL}")
print(f"   üåç Region: {REGION_NAME}")
print(f"   üîë Access Key: {AWS_ACCESS_KEY_ID[:4]}***")
print(f"   ‚è±Ô∏è Timeout: {REQUEST_TIMEOUT}s")
print(f"   üîÑ Max Retries: {MAX_RETRIES}")

# =============================================================================
# üìö LIBRARY IMPORTS
# =============================================================================

import boto3
import botocore
import json
import numpy as np
import os
import sys
import threading
import concurrent.futures
import time
import requests  # Added for embedding API calls
from botocore.loaders import Loader

print("\nüìö Libraries imported successfully!")
print(f"üêç Python version: {sys.version}")
print(f"üîß Boto3 version: {boto3.__version__}")
print(f"üîß Botocore version: {botocore.__version__}")
print(f"üîß Requests version: {requests.__version__}")
print("üí° Using boto3 SDK with S3 Vectors service model!")
print("üß† Ready to generate real text embeddings!")

üîß Configuration Settings:
   üì° S3 Vectors Endpoint: http://127.0.0.1:8000/
   üß† Embedding Server: http://127.0.0.1:1234/v1/embeddings
   ü§ñ Embedding Model: text-embedding-nomic-embed-text-v1.5
   üåç Region: us-east-1
   üîë Access Key: mini***
   ‚è±Ô∏è Timeout: 30s
   üîÑ Max Retries: 3

üìö Libraries imported successfully!
üêç Python version: 3.13.5 (main, Jul  1 2025, 18:16:22) [Clang 20.1.4 ]
üîß Boto3 version: 1.40.7
üîß Botocore version: 1.40.7
üîß Requests version: 2.32.4
üßµ Threading support: 6 active threads
üöÄ Python 3.13.5+ provides improved multithreading performance!
üí° Using boto3 SDK with S3 Vectors service model!
üß† Ready to generate real text embeddings!


## 2. Configure Boto3 Client with S3 Vectors Service Model

Configure the boto3 client to use the S3 Vectors service model for native AWS SDK functionality.

In [103]:
# Configure boto3 to use S3 Vectors service model
print(f"üîß Setting up boto3 S3 Vectors client...")
print(f"üì° Configured Endpoint URL: {ENDPOINT_URL}")
print(f"üåç Region: {REGION_NAME}")



# Clear any conflicting environment variables that might override our endpoint
env_vars_to_clear = ['AWS_ENDPOINT_URL', 'AWS_ENDPOINT_URL_S3', 'MINIO_ENDPOINT']
for var in env_vars_to_clear:
    if var in os.environ:
        print(f"üßπ Clearing environment variable: {var}={os.environ[var]}")
        del os.environ[var]

try:
    # Ensure we use the exact endpoint URL from configuration
    actual_endpoint = ENDPOINT_URL.rstrip('/')  # Remove trailing slash for consistency
    print(f"üì° Using Endpoint URL: {actual_endpoint}")
    
    # Create S3 Vectors client (NOT standard S3 client)
    s3vectors_client = boto3.client(
        's3vectors',  # This is the key - use 's3vectors' service
        region_name=REGION_NAME,
        endpoint_url=actual_endpoint,
        aws_access_key_id=AWS_ACCESS_KEY_ID,
        aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
        config=boto3.session.Config(
            retries={'max_attempts': MAX_RETRIES},
            read_timeout=REQUEST_TIMEOUT,
            connect_timeout=REQUEST_TIMEOUT,
            signature_version=botocore.UNSIGNED,  # Use botocore.UNSIGNED
        ),
        verify=False  # Skip SSL verification for local development
    )
    
    # Verify the client is using the correct endpoint
    client_endpoint = s3vectors_client._endpoint.host
    print(f"‚úÖ Client endpoint verified: {client_endpoint}")
    
    if client_endpoint != actual_endpoint:
        print(f"‚ö†Ô∏è WARNING: Client endpoint ({client_endpoint}) differs from configured ({actual_endpoint})")
    
    print("‚úÖ Boto3 S3 Vectors client created successfully!")
    print("üîß Using S3 Vectors service with native boto3 methods")
    print("üì° Client supports: create_index(), put_vectors(), query_vectors(), etc.")
    print("üéØ Ready to use S3 Vectors operations!")
        
except Exception as e:
    print(f"‚ùå Error setting up S3 Vectors client: {e}")
    print("üö® Please check:")
    print("   1. Server is running at the configured endpoint")
    print("   2. S3 Vectors service model is available")
    print("   3. Endpoint URL is correct in configuration")
    print("   4. Service model path is properly configured")
    
    s3vectors_client = None

üîß Setting up boto3 S3 Vectors client...
üì° Configured Endpoint URL: http://127.0.0.1:8000/
üåç Region: us-east-1
üì° Using Endpoint URL: http://127.0.0.1:8000
‚úÖ Client endpoint verified: http://127.0.0.1:8000
‚úÖ Boto3 S3 Vectors client created successfully!
üîß Using S3 Vectors service with native boto3 methods
üì° Client supports: create_index(), put_vectors(), query_vectors(), etc.
üéØ Ready to use S3 Vectors operations!


## 3. S3 Vectors Client Ready

The boto3 S3 Vectors client is now configured and ready to use. This provides native AWS SDK functionality with proper error handling, authentication, and retry logic.

In [104]:
# Verify boto3 S3 Vectors client is ready
if s3vectors_client is not None:
    print("üöÄ Boto3 S3 Vectors client is ready!")
    print(f"üì° Endpoint URL: {s3vectors_client._endpoint.host}")
    print(f"üåç Region: {s3vectors_client.meta.region_name}")
    print(f"üîß Service: S3 Vectors (with native API support)")
    print("‚úÖ Ready to test S3 Vectors operations:")
    print("   üì¶ Bucket operations: create_vector_bucket(), list_vector_buckets()")
    print("   üìä Index operations: create_index(), list_indexes(), delete_index()")
    print("   üîç Vector operations: put_vectors(), get_vectors(), query_vectors()")
    print("   üîê Policy operations: put_vector_bucket_policy(), get_vector_bucket_policy()")
    print("üí° Using native S3 Vectors boto3 client")
else:
    print("‚ùå S3 Vectors client not available")
    print("üö® Please check the server connectivity and configuration")
    print("üí° Make sure to run the previous cell successfully before proceeding")

üöÄ Boto3 S3 Vectors client is ready!
üì° Endpoint URL: http://127.0.0.1:8000
üåç Region: us-east-1
üîß Service: S3 Vectors (with native API support)
‚úÖ Ready to test S3 Vectors operations:
   üì¶ Bucket operations: create_vector_bucket(), list_vector_buckets()
   üìä Index operations: create_index(), list_indexes(), delete_index()
   üîç Vector operations: put_vectors(), get_vectors(), query_vectors()
   üîê Policy operations: put_vector_bucket_policy(), get_vector_bucket_policy()
üí° Using native S3 Vectors boto3 client


## 4. Test CreateVectorBucket

Create a new vector bucket using the boto3 S3 Vectors client. The bucket name includes hostname for uniqueness across environments.

In [105]:
# Test S3 Vectors operations with native boto3 methods
import time

print("üß™ Testing S3 Vectors operations with native boto3 methods...")
print("üì° Using S3 Vectors client with create_vector_bucket(), list_vector_buckets(), etc.")

# Test 1: List vector buckets
try:
    print("\n1Ô∏è‚É£ Testing boto3 list_vector_buckets()")
    response = s3vectors_client.list_vector_buckets()
    print("‚úÖ list_vector_buckets() successful!")
    
    # S3 Vectors response format
    buckets = response.get('vectorBuckets', [])
    print(f"üìä Found {len(buckets)} buckets")
    
    for bucket in buckets:
        print(f"  üì¶ {bucket['vectorBucketName']} (created: {bucket.get('creationTime', 'N/A')})")
        
except Exception as e:
    print(f"‚ùå Error in list_vector_buckets(): {e}")

# Test 2: Create vector bucket
bucket_name = f"boto3-test-{int(time.time())}"
try:
    print(f"\n2Ô∏è‚É£ Testing boto3 create_vector_bucket()")
    print(f"üèóÔ∏è Creating bucket: {bucket_name}")
    
    response = s3vectors_client.create_vector_bucket(vectorBucketName=bucket_name)
    print("‚úÖ create_vector_bucket() successful!")
    print(f"üìç Response: {response}")
    
except Exception as e:
    print(f"‚ùå Error in create_vector_bucket(): {e}")

# Test 3: Verify bucket was created
try:
    print(f"\n3Ô∏è‚É£ Verifying bucket creation...")
    response = s3vectors_client.list_vector_buckets()
    
    # Use correct S3 Vectors response format
    buckets = response.get('vectorBuckets', [])
    bucket_names = [b['vectorBucketName'] for b in buckets]
    
    if bucket_name in bucket_names:
        print(f"‚úÖ Bucket '{bucket_name}' successfully created!")
    else:
        print(f"‚ö†Ô∏è Bucket '{bucket_name}' not found in list")
        print(f"üîç Available buckets: {bucket_names}")
        
except Exception as e:
    print(f"‚ùå Error verifying bucket: {e}")

print(f"\nüéâ S3 Vectors boto3 client is working!")
print(f"üí° Using native S3 Vectors operations: create_vector_bucket(), list_vector_buckets()")
print(f"üîß Client has full S3 Vectors support with proper method signatures!")

üß™ Testing S3 Vectors operations with native boto3 methods...
üì° Using S3 Vectors client with create_vector_bucket(), list_vector_buckets(), etc.

1Ô∏è‚É£ Testing boto3 list_vector_buckets()
‚úÖ list_vector_buckets() successful!
üìä Found 14 buckets
  üì¶ b1 (created: 2025-08-13 04:42:54.826958+00:00)
  üì¶ boto3-test-1755055901 (created: 2025-08-13 04:42:54.827154+00:00)
  üì¶ boto3-test-1755056431 (created: 2025-08-13 04:42:54.827159+00:00)
  üì¶ boto3-test-1755057315 (created: 2025-08-13 04:42:54.827161+00:00)
  üì¶ boto3-test-1755057347 (created: 2025-08-13 04:42:54.827163+00:00)
  üì¶ boto3-test-1755057682 (created: 2025-08-13 04:42:54.827165+00:00)
  üì¶ boto3-test-1755058205 (created: 2025-08-13 04:42:54.827166+00:00)
  üì¶ boto3-test-1755059506 (created: 2025-08-13 04:42:54.827171+00:00)
  üì¶ bucket1 (created: 2025-08-13 04:42:54.827174+00:00)
  üì¶ test-bucket (created: 2025-08-13 04:42:54.827178+00:00)
  üì¶ test-bucket-2 (created: 2025-08-13 04:42:54.827180+

## 5. Test ListVectorBuckets

List all available vector buckets.

In [106]:
# List vector buckets using native S3 Vectors method
try:
    response = s3vectors_client.list_vector_buckets()
    
    print("‚úÖ Vector buckets listed successfully!")
    print(json.dumps(response, indent=2, default=str))
    
    # Count buckets using correct S3 Vectors response format
    buckets = response.get('vectorBuckets', [])
    bucket_count = len(buckets)
    print(f"üìä Total buckets: {bucket_count}")
    
    # Show bucket names
    if buckets:
        for bucket in buckets:
            print(f"  üì¶ {bucket['vectorBucketName']}")
    else:
        print("  üì≠ No buckets found")
            
except Exception as e:
    print(f"‚ùå Error listing buckets: {e}")

‚úÖ Vector buckets listed successfully!
{
  "ResponseMetadata": {
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Wed, 13 Aug 2025 04:43:31 GMT",
      "server": "uvicorn",
      "content-length": "2681",
      "content-type": "application/json"
    },
    "RetryAttempts": 0
  },
  "vectorBuckets": [
    {
      "vectorBucketName": "b1",
      "vectorBucketArn": "arn:aws:s3vectors:us-east-1:123456789012:bucket/b1",
      "creationTime": "2025-08-13 04:43:31.933168+00:00"
    },
    {
      "vectorBucketName": "boto3-test-1755055901",
      "vectorBucketArn": "arn:aws:s3vectors:us-east-1:123456789012:bucket/boto3-test-1755055901",
      "creationTime": "2025-08-13 04:43:31.933222+00:00"
    },
    {
      "vectorBucketName": "boto3-test-1755056431",
      "vectorBucketArn": "arn:aws:s3vectors:us-east-1:123456789012:bucket/boto3-test-1755056431",
      "creationTime": "2025-08-13 04:43:31.933226+00:00"
    },
    {
      "vectorBucketName": "boto3-test-1755057315",
      "

## 6. Test PutVectorBucketPolicy

Set a bucket policy for the vector bucket.

In [61]:
# Create and apply bucket policy
policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "s3vectors:GetVectors",
                "s3vectors:QueryVectors"
            ],
            "Resource": f"arn:aws:s3vectors:*:*:bucket/{bucket_name}/*"
        }
    ]
}

try:
    response = s3vectors_client.put_vector_bucket_policy(
        vectorBucketName=bucket_name,
        policy=policy
    )
    
    print("‚úÖ Bucket policy set successfully!")
    print(json.dumps(response, indent=2, default=str))
    
except Exception as e:
    print(f"‚ùå Error setting bucket policy: {e}")

‚ùå Error setting bucket policy: Parameter validation failed:
Invalid type for parameter policy, value: {'Version': '2012-10-17', 'Statement': [{'Effect': 'Allow', 'Principal': '*', 'Action': ['s3vectors:GetVectors', 's3vectors:QueryVectors'], 'Resource': 'arn:aws:s3vectors:*:*:bucket/boto3-test-1755057682/*'}]}, type: <class 'dict'>, valid types: <class 'str'>


## 7. Test GetVectorBucketPolicy

Retrieve the bucket policy we just set.

In [64]:
# Get bucket policy
try:
    response = s3vectors_client.get_vector_bucket_policy(
        vectorBucketName=bucket_name
    )
    
    print("‚úÖ Bucket policy retrieved successfully!")
    print(json.dumps(response, indent=2, default=str))
    
    # Verify policy content
    if 'policy' in response:
        policy_version = response['policy'].get('Version')
        statement_count = len(response['policy'].get('Statement', []))
        print(f"üìã Policy version: {policy_version}")
        print(f"üìä Number of statements: {statement_count}")
        
except Exception as e:
    print(f"‚ùå Error getting bucket policy: {e}")

‚úÖ Bucket policy retrieved successfully!
{
  "ResponseMetadata": {
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Wed, 13 Aug 2025 04:03:37 GMT",
      "server": "uvicorn",
      "content-length": "15",
      "content-type": "application/json"
    },
    "RetryAttempts": 0
  }
}


## 8. Test CreateIndex

Create a vector index in the bucket.

In [107]:
# Create vector index
index_name = "test-notebook-index"

try:
    response = s3vectors_client.create_index(
        vectorBucketName=bucket_name,
        indexName=index_name,
        dimension=768,  # Updated to match text-embedding-nomic-embed-text-v1.5 dimension
        dataType="float32",
        distanceMetric="cosine",
        metadataConfiguration={
            "nonFilterableMetadataKeys": ["description", "internal_id"]
        }
    )
    
    print("‚úÖ Vector index created successfully!")
    print(json.dumps(response, indent=2, default=str))
    
    # Extract index details
    if 'index' in response:
        index_info = response['index']
        print(f"üìä Index: {index_info.get('indexName')}")
        print(f"üìè Dimension: {index_info.get('dimension')}")
        print(f"üìê Distance metric: {index_info.get('distanceMetric')}")
        print(f"üìÖ Creation time: {index_info.get('creationTime')}")
        
except Exception as e:
    print(f"‚ùå Error creating index: {e}")

‚úÖ Vector index created successfully!
{
  "ResponseMetadata": {
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Wed, 13 Aug 2025 04:44:03 GMT",
      "server": "uvicorn",
      "content-length": "2",
      "content-type": "application/json"
    },
    "RetryAttempts": 0
  }
}


## 9. Test ListIndexes

List all indexes in the bucket.

In [108]:
# List indexes
try:
    response = s3vectors_client.list_indexes(
        vectorBucketName=bucket_name
    )
    
    print("‚úÖ Indexes listed successfully!")
    print(json.dumps(response, indent=2, default=str))
    
    # Count indexes
    index_count = len(response.get('indexes', []))
    print(f"üìä Total indexes: {index_count}")
    
    # Show index names
    if 'indexes' in response:
        for index in response['indexes']:
            print(f"  üìä {index['indexName']}")
            
except Exception as e:
    print(f"‚ùå Error listing indexes: {e}")

‚úÖ Indexes listed successfully!
{
  "ResponseMetadata": {
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Wed, 13 Aug 2025 04:44:11 GMT",
      "server": "uvicorn",
      "content-length": "261",
      "content-type": "application/json"
    },
    "RetryAttempts": 0
  },
  "indexes": [
    {
      "vectorBucketName": "boto3-test-1755060174",
      "indexName": "test-notebook-index",
      "indexArn": "arn:aws:s3vectors:us-east-1:123456789012:index/boto3-test-1755060174/test-notebook-index",
      "creationTime": "2025-08-13 04:44:11.645535+00:00"
    }
  ]
}
üìä Total indexes: 1
  üìä test-notebook-index


## 10. Test PutVectors with Real Text Embeddings

Add vectors to the index using real text embeddings about rivers of India. This replaces random vectors with meaningful semantic representations generated by the text-embedding-nomic-embed-text-v1.5 model.

### üåä River Knowledge Base
- **Ganges (Ganga)**: Sacred river flowing from Himalayas to Bay of Bengal
- **Brahmaputra**: Major river supporting agriculture in northeastern India
- **Narmada**: Westward-flowing river important for hydropower and irrigation
- **Krishna**: River crucial for Deccan plateau agriculture

### üß† Embedding Features
- **768-dimensional vectors** from text-embedding-nomic-embed-text-v1.5
- **Semantic understanding** of geographic, religious, and economic concepts
- **Real similarity search** based on meaning, not just keywords

In [109]:
# Generate embeddings using the local embedding server
import requests

def get_text_embedding(text, model="text-embedding-nomic-embed-text-v1.5"):
    """Generate text embedding using local embedding server"""
    try:
        response = requests.post(
            "http://127.0.0.1:1234/v1/embeddings",
            headers={"Content-Type": "application/json"},
            json={
                "model": model,
                "input": text
            },
            timeout=30
        )
        response.raise_for_status()
        data = response.json()
        
        # Extract embedding from response
        embedding = data["data"][0]["embedding"]
        print(f"‚úÖ Generated embedding for: '{text[:50]}...' (dimension: {len(embedding)})")
        return embedding
        
    except Exception as e:
        print(f"‚ùå Error generating embedding for '{text[:50]}...': {e}")
        # Fallback to random vector if embedding fails
        vector = np.random.randn(768).astype(np.float32)
        norm = np.linalg.norm(vector)
        if norm > 0:
            vector = vector / norm
        return vector.tolist()

# Create sample texts about rivers of India with comprehensive information
river_texts = [
    {
        "key": "ganga-river",
        "text": "The Ganges, known as Ganga in Hindi, is the most sacred river in India. It originates from the Gangotri Glacier in the Himalayas and flows through northern India for 2,525 kilometers before emptying into the Bay of Bengal. The river is considered holy by Hindus and supports over 400 million people along its course. Major cities like Varanasi, Allahabad, and Kolkata are situated on its banks.",
        "metadata": {
            "title": "Ganges River - Sacred Waters of India",
            "category": "geography",
            "region": "Northern India",
            "length_km": 2525,
            "type": "sacred_river",
            "importance": "religious_economic"
        }
    },
    {
        "key": "brahmaputra-river", 
        "text": "The Brahmaputra is one of the major rivers of Asia, flowing through Tibet, India, and Bangladesh. In India, it flows through Assam for 720 kilometers and is known as one of the few male rivers in Hindu tradition. The river is vital for agriculture in the northeastern states and supports rich biodiversity. It eventually joins the Ganges to form the world's largest delta.",
        "metadata": {
            "title": "Brahmaputra River - The Son of Brahma",
            "category": "geography", 
            "region": "Northeastern India",
            "length_km": 720,
            "type": "major_river",
            "importance": "agricultural_biodiversity"
        }
    },
    {
        "key": "narmada-river",
        "text": "The Narmada River is the fifth-longest river in India, flowing westward for 1,312 kilometers through Madhya Pradesh, Maharashtra, and Gujarat before draining into the Arabian Sea. It is one of only three major rivers in peninsular India that flow from east to west. The river is considered sacred and has numerous ancient temples along its banks. The Sardar Sarovar Dam on this river is one of the largest infrastructure projects in India.",
        "metadata": {
            "title": "Narmada River - The Lifeline of Central India", 
            "category": "geography",
            "region": "Central India",
            "length_km": 1312,
            "type": "westward_flowing",
            "importance": "irrigation_hydropower"
        }
    },
    {
        "key": "krishna-river",
        "text": "The Krishna River is the fourth-longest river in India, flowing for 1,400 kilometers through Maharashtra, Karnataka, Telangana, and Andhra Pradesh before emptying into the Bay of Bengal. The river originates near Mahabaleshwar in the Western Ghats and is crucial for irrigation in the Deccan Plateau region. Major cities like Vijayawada and Sangli are located on its banks, and it supports extensive agricultural activities.",
        "metadata": {
            "title": "Krishna River - Waters of the Deccan",
            "category": "geography",
            "region": "South India", 
            "length_km": 1400,
            "type": "peninsular_river",
            "importance": "irrigation_agriculture"
        }
    }
]

print("üåä Creating vectors with real text embeddings about rivers of India...")
print(f"üì° Using embedding model: text-embedding-nomic-embed-text-v1.5")
print(f"üìè Expected dimension: 768")

# Generate embeddings for each river text
vectors = []
for river_data in river_texts:
    print(f"\nüîÑ Processing: {river_data['metadata']['title']}")
    
    # Get embedding for the full text
    embedding = get_text_embedding(river_data['text'])
    
    # Create vector entry
    vector_entry = {
        "key": river_data['key'],
        "data": {"float32": embedding},
        "metadata": river_data['metadata']
    }
    vectors.append(vector_entry)

try:
    response = s3vectors_client.put_vectors(
        vectorBucketName=bucket_name,
        indexName=index_name,
        vectors=vectors
    )
    
    print("\n‚úÖ Vectors uploaded successfully!")
    print(json.dumps(response, indent=2, default=str))
    print(f"üìä Uploaded {len(vectors)} vectors with real embeddings")
    
    for vector in vectors:
        print(f"  üåä {vector['key']}: {vector['metadata']['title']}")
        
except Exception as e:
    print(f"‚ùå Error uploading vectors: {e}")

üåä Creating vectors with real text embeddings about rivers of India...
üì° Using embedding model: text-embedding-nomic-embed-text-v1.5
üìè Expected dimension: 768

üîÑ Processing: Ganges River - Sacred Waters of India
‚úÖ Generated embedding for: 'The Ganges, known as Ganga in Hindi, is the most s...' (dimension: 768)

üîÑ Processing: Brahmaputra River - The Son of Brahma
‚úÖ Generated embedding for: 'The Brahmaputra is one of the major rivers of Asia...' (dimension: 768)

üîÑ Processing: Narmada River - The Lifeline of Central India
‚úÖ Generated embedding for: 'The Narmada River is the fifth-longest river in In...' (dimension: 768)

üîÑ Processing: Krishna River - Waters of the Deccan
‚úÖ Generated embedding for: 'The Krishna River is the fourth-longest river in I...' (dimension: 768)

‚úÖ Vectors uploaded successfully!
{
  "ResponseMetadata": {
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Wed, 13 Aug 2025 04:45:26 GMT",
      "server": "uvicorn",
      "con

## 11. Test QueryVectors

Search for similar vectors using a query vector.

### Filter Format for S3 Vectors

S3 Vectors uses a structured filter format with operators. Common patterns:

```python
# Equality filter
filter = {
    "category": {
        "eq": "education"
    }
}

# Numeric comparison
filter = {
    "score": {
        "gte": 0.8
    }
}

# Multiple conditions
filter = {
    "category": {
        "eq": "education"
    },
    "score": {
        "gte": 0.85
    }
}

# Available operators: eq, neq, gt, gte, lt, lte, in, nin, exists
```

In [114]:
# Query for similar vectors with a real question about rivers of India
query_text = "Which river is considered most sacred in Hindu religion and flows from the Himalayas to the Bay of Bengal?"

print(f"üîç Query Question: {query_text}")
print("üîÑ Generating embedding for query...")

# Generate embedding for the query
query_vector = get_text_embedding(query_text)

try:
    response = s3vectors_client.query_vectors(
        vectorBucketName=bucket_name,
        indexName=index_name,
        queryVector={"float32": query_vector},
        topK=3,
        returnMetadata=True,   # Enable metadata return for similarity search
        returnDistance=True,   # Enable distance return for similarity search
        filter={
            "category": {
                "eq": "geography"  # Filter for geography-related content
            }
        }
    )
    
    print("‚úÖ Vector similarity search completed successfully!")
    print(json.dumps(response, indent=2, default=str))
    
    # Display results with proper similarity search information
    if 'vectors' in response:
        print(f"\nüîç Similarity Search Results ({len(response['vectors'])} found):")
        print(f"‚ùì Question: {query_text}")
        for i, result in enumerate(response['vectors'], 1):
            key = result.get('key', 'Unknown')
            distance = result.get('distance', 'N/A')
            metadata = result.get('metadata', {})
            title = metadata.get('title', 'No title')
            region = metadata.get('region', 'No region')
            length_km = metadata.get('length_km', 'N/A')
            river_type = metadata.get('type', 'No type')
            importance = metadata.get('importance', 'No importance')
            
            print(f"\n  {i}. üåä {key}: {title}")
            print(f"      üìç Region: {region}")
            print(f"      üìè Length: {length_km} km")
            print(f"      üè∑Ô∏è Type: {river_type}")
            print(f"      ‚≠ê Importance: {importance}")
            print(f"      üéØ Similarity Distance: {distance:.4f}" if isinstance(distance, (int, float)) else f"      üéØ Similarity Distance: {distance}")
            
            # Interpret similarity
            if isinstance(distance, (int, float)):
                if distance < 0.3:
                    similarity_desc = "Very Relevant ‚úÖ"
                elif distance < 0.6:
                    similarity_desc = "Relevant ‚úÖ"
                elif distance < 0.9:
                    similarity_desc = "Somewhat Relevant ‚ö†Ô∏è"
                else:
                    similarity_desc = "Less Relevant ‚ùå"
                print(f"      üìä Relevance: {similarity_desc}")
    else:
        print("üîç No similar vectors found (filtering may have excluded results)")
    
except Exception as e:
    print(f"‚ùå Error in similarity search: {e}")

üîç Query Question: Which river is considered most sacred in Hindu religion and flows from the Himalayas to the Bay of Bengal?
üîÑ Generating embedding for query...
‚úÖ Generated embedding for: 'Which river is considered most sacred in Hindu rel...' (dimension: 768)
‚úÖ Vector similarity search completed successfully!
{
  "ResponseMetadata": {
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Wed, 13 Aug 2025 05:03:19 GMT",
      "server": "uvicorn",
      "content-length": "787",
      "content-type": "application/json"
    },
    "RetryAttempts": 0
  },
  "vectors": [
    {
      "key": "ganga-river",
      "metadata": {
        "title": "Ganges River - Sacred Waters of India",
        "category": "geography",
        "region": "Northern India",
        "length_km": 2525,
        "type": "sacred_river",
        "importance": "religious_economic"
      },
      "distance": 0.12343692779541016
    },
    {
      "key": "brahmaputra-river",
      "metadata": {
        "ti

In [91]:
# Verify vector dimensions and test exact text similarity
print("üîß Verifying semantic similarity search with real embeddings...")

# Check the dimensions of our query vector
print(f"üìè Query vector dimension: {len(query_vector)}")

# Test semantic similarity with exact text from stored vector
print("\nüß™ Testing exact text similarity...")
exact_ganges_text = "The Ganges, known as Ganga in Hindi, is the most sacred river in India. It originates from the Gangotri Glacier in the Himalayas and flows through northern India for 2,525 kilometers before emptying into the Bay of Bengal."

try:
    # Generate embedding for the exact text
    exact_ganges_vector = get_text_embedding(exact_ganges_text)
    
    # Query with the exact text - should return Ganges with very low distance
    similarity_response = s3vectors_client.query_vectors(
        vectorBucketName=bucket_name,
        indexName=index_name,
        queryVector={"float32": exact_ganges_vector},
        topK=3,
        returnMetadata=True,
        returnDistance=True
    )
    
    print("‚úÖ Exact text similarity search results:")
    if 'vectors' in similarity_response:
        for i, result in enumerate(similarity_response['vectors'], 1):
            key = result.get('key', 'Unknown')
            distance = result.get('distance', 'N/A')
            metadata = result.get('metadata', {})
            title = metadata.get('title', 'No title')
            
            print(f"  {i}. üåä {key}: {title}")
            print(f"       Distance: {distance:.6f}" if isinstance(distance, (int, float)) else f"      üìè Distance: {distance}")
            
            # Check for very close semantic match with Ganges
            if key == "ganga-river" and isinstance(distance, (int, float)) and distance < 0.1:
                print("üéØ ‚≠ê EXCELLENT SEMANTIC MATCH! ‚≠ê")
    
    print("\nüìä This demonstrates that text embeddings capture semantic meaning effectively")
    
except Exception as e:
    print(f"‚ùå Error in exact text similarity test: {e}")

print("\n" + "="*70)

üîß Verifying vector similarity search configuration...
üìè Query vector dimension: 128
üìä Index configuration:
   üìè Index dimension: 128
   üìê Distance metric: cosine
   üî¢ Data type: float32
‚úÖ Vector dimensions match - similarity search should work correctly

--------------------------------------------------
üß™ Testing similarity search with a known vector...
üìÑ Using doc-1 vector as query (dimension: 128)
‚úÖ Known vector similarity search results:
  1. doc-1: Machine Learning Fundamentals (distance: -0.000000)
üéØ Perfect match found! Similarity search is working correctly.
  2. doc-3: Natural Language Processing (distance: 0.887455)
  3. doc-3: Natural Language Processing (distance: 0.948686)



In [115]:
# Test semantic search with various question types about rivers of India
print("üåä Testing semantic search with different types of questions...")
print("üìö This demonstrates how embeddings capture semantic meaning beyond keywords")

# Define test questions of different types
test_questions = [
    {
        "question": "What is the holiest river for Hindu worship?",
        "expected_match": "ganga-river",
        "explanation": "Religious/spiritual question - should match Ganges"
    },
    {
        "question": "Which river supports the most agriculture in northeastern states?",
        "expected_match": "brahmaputra-river", 
        "explanation": "Agricultural question - should match Brahmaputra"
    },
    {
        "question": "Tell me about rivers that are important for hydropower generation",
        "expected_match": "narmada-river",
        "explanation": "Energy/infrastructure question - should match Narmada"
    },
    {
        "question": "Which river is crucial for farming in the Deccan plateau?",
        "expected_match": "krishna-river",
        "explanation": "Geographic/agricultural question - should match Krishna"
    }
]

print(f"\nüî¨ Running {len(test_questions)} semantic search tests...\n")

for i, test in enumerate(test_questions, 1):
    print(f"{'='*60}")
    print(f"üß™ TEST {i}: {test['explanation']}")
    print(f"‚ùì Question: {test['question']}")
    print(f"üéØ Expected top match: {test['expected_match']}")
    print(f"{'='*60}")
    
    try:
        # Generate embedding for the test question
        test_vector = get_text_embedding(test['question'])
        
        # Perform similarity search
        response = s3vectors_client.query_vectors(
            vectorBucketName=bucket_name,
            indexName=index_name,
            queryVector={"float32": test_vector},
            topK=2,  # Just get top 2 results
            returnMetadata=True,
            returnDistance=True
        )
        
        if 'vectors' in response and len(response['vectors']) > 0:
            top_result = response['vectors'][0]
            top_key = top_result.get('key', 'Unknown')
            top_distance = top_result.get('distance', 'N/A')
            top_title = top_result.get('metadata', {}).get('title', 'No title')
            
            print(f"üèÜ TOP RESULT: {top_key}")
            print(f"üì∞ Title: {top_title}")
            print(f"üìè Distance: {top_distance:.4f}" if isinstance(top_distance, (int, float)) else f"üìè Distance: {top_distance}")
            
            # Check if prediction was correct
            if top_key == test['expected_match']:
                print("‚úÖ ‚≠ê SEMANTIC SEARCH SUCCESS! ‚≠ê")
                print("üéØ The embedding model correctly understood the semantic meaning!")
            else:
                print("‚ö†Ô∏è Different result than expected")
                print(f"   Expected: {test['expected_match']}")
                print(f"   Got: {top_key}")
                print("   This could still be semantically correct!")
            
            # Show second result for comparison
            if len(response['vectors']) > 1:
                second_result = response['vectors'][1]
                second_key = second_result.get('key', 'Unknown')
                second_distance = second_result.get('distance', 'N/A')
                second_title = second_result.get('metadata', {}).get('title', 'No title')
                print(f"\nü•à SECOND: {second_key} - {second_title}")
                print(f"üìè Distance: {second_distance:.4f}" if isinstance(second_distance, (int, float)) else f"üìè Distance: {second_distance}")
        else:
            print("‚ùå No results found")
            
    except Exception as e:
        print(f"‚ùå Error in test {i}: {e}")
    
    print()  # Empty line between tests

print("üéä Semantic search testing complete!")
print("üí° These tests demonstrate how text embeddings capture:")
print("   üî∏ Religious concepts (sacred, holy, worship)")
print("   üî∏ Geographic relationships (northeastern, Deccan plateau)")
print("   üî∏ Functional purposes (agriculture, hydropower, irrigation)")
print("   üî∏ Economic activities (farming, infrastructure)")
print("\nüöÄ This is the power of semantic vector search with real embeddings!")

üåä Testing semantic search with different types of questions...
üìö This demonstrates how embeddings capture semantic meaning beyond keywords

üî¨ Running 4 semantic search tests...

üß™ TEST 1: Religious/spiritual question - should match Ganges
‚ùì Question: What is the holiest river for Hindu worship?
üéØ Expected top match: ganga-river
‚úÖ Generated embedding for: 'What is the holiest river for Hindu worship?...' (dimension: 768)
üèÜ TOP RESULT: ganga-river
üì∞ Title: Ganges River - Sacred Waters of India
üìè Distance: 0.1849
‚úÖ ‚≠ê SEMANTIC SEARCH SUCCESS! ‚≠ê
üéØ The embedding model correctly understood the semantic meaning!

ü•à SECOND: narmada-river - Narmada River - The Lifeline of Central India
üìè Distance: 0.2134

üß™ TEST 2: Agricultural question - should match Brahmaputra
‚ùì Question: Which river supports the most agriculture in northeastern states?
üéØ Expected top match: brahmaputra-river
‚úÖ Generated embedding for: 'Which river supports the most agricul

## 12. Test GetVectors

Retrieve specific vectors by their keys.

In [116]:
# Get specific vectors by their keys
vector_keys = ["ganga-river", "brahmaputra-river"]

try:
    response = s3vectors_client.get_vectors(
        vectorBucketName=bucket_name,
        indexName=index_name,
        keys=vector_keys,
        returnData=True,
        returnMetadata=True
    )
    
    print("‚úÖ Vectors retrieved successfully!")
    print(json.dumps(response, indent=2, default=str))
    
    # Display retrieved vectors
    if 'vectors' in response:
        print(f"\nüìÑ Retrieved Vectors ({len(response['vectors'])} found):")
        for vector in response['vectors']:
            key = vector.get('key', 'Unknown')
            metadata = vector.get('metadata', {})
            title = metadata.get('title', 'No title')
            region = metadata.get('region', 'No region')
            vector_dim = len(vector.get('data', {}).get('float32', []))
            print(f"  üåä {key}: {title}")
            print(f"      üìç Region: {region} ({vector_dim}D vector)")
    
except Exception as e:
    print(f"‚ùå Error retrieving vectors: {e}")

‚úÖ Vectors retrieved successfully!
{
  "ResponseMetadata": {
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Wed, 13 Aug 2025 05:15:05 GMT",
      "server": "uvicorn",
      "content-length": "32905",
      "content-type": "application/json"
    },
    "RetryAttempts": 0
  },
  "vectors": [
    {
      "key": "ganga-river",
      "data": {
        "float32": [
          0.04290274530649185,
          0.07912567257881165,
          -0.17291699349880219,
          -0.0007869930122978985,
          0.036546651273965836,
          0.030878186225891113,
          0.003549055429175496,
          0.04689719155430794,
          0.01741626113653183,
          0.001391763798892498,
          0.052074551582336426,
          0.015848932787775993,
          0.11315764486789703,
          0.0031625088304281235,
          0.039050422608852386,
          -0.05309721454977989,
          0.01617870107293129,
          -0.03039400465786457,
          -0.007509204093366861,
          0.071

## 13. Test ListVectors

List all vectors in the index.

In [None]:
# List all vectors in the index
try:
    response = s3vectors_client.list_vectors(
        vectorBucketName=bucket_name,
        indexName=index_name,
        maxResults=10
    )
    
    print("‚úÖ Vectors listed successfully!")
    print(json.dumps(response, indent=2, default=str))
    
    # Display vector list
    if 'vectors' in response:  # Changed from 'vectorKeys' to 'vectors' for boto3 compatibility  
        print(f"\nüìã Vector Keys ({len(response['vectors'])} found):")
        for vector in response['vectors']:
            key = vector.get('key', 'Unknown')
            print(f"  üîë {key}")
    
except Exception as e:
    print(f"‚ùå Error listing vectors: {e}")

## 14. Test DeleteVectorBucketPolicy

Remove the bucket policy we set earlier.

In [None]:
# Delete bucket policy
try:
    response = s3vectors_client.delete_vector_bucket_policy(
        vectorBucketName=bucket_name
    )
    
    print("‚úÖ Bucket policy deleted successfully!")
    print(json.dumps(response, indent=2, default=str))
    
except Exception as e:
    print(f"‚ùå Error deleting bucket policy: {e}")

## 15. Test DeleteVectors

Delete specific vectors from the index.

In [None]:
# Delete specific vectors from the index
vectors_to_delete = ["krishna-river"]  # Delete one river for testing

try:
    response = s3vectors_client.delete_vectors(
        vectorBucketName=bucket_name,
        indexName=index_name,
        keys=vectors_to_delete
    )
    
    print("‚úÖ Vectors deleted successfully!")
    print(json.dumps(response, indent=2, default=str))
    print(f"üóëÔ∏è Deleted {len(vectors_to_delete)} vectors: {', '.join(vectors_to_delete)}")
    
except Exception as e:
    print(f"‚ùå Error deleting vectors: {e}")

## 16. Clean Up - Delete Index

Delete the index we created for testing.

In [None]:
# Delete the index
try:
    response = s3vectors_client.delete_index(
        vectorBucketName=bucket_name,
        indexName=index_name
    )
    
    print("‚úÖ Index deleted successfully!")
    print(json.dumps(response, indent=2, default=str))
    
except Exception as e:
    print(f"‚ùå Error deleting index: {e}")

## 17. Clean Up - Delete Vector Bucket

Delete the vector bucket we created for testing.

In [None]:
# Delete the vector bucket
try:
    response = s3vectors_client.delete_vector_bucket(
        vectorBucketName=bucket_name
    )
    
    print("‚úÖ Vector bucket deleted successfully!")
    print(json.dumps(response, indent=2, default=str))
    
except Exception as e:
    print(f"‚ùå Error deleting bucket: {e}")

## üéâ S3 Vectors Testing Complete!

This notebook demonstrates S3 Vectors API operations using the boto3 SDK with real text embeddings:

### ‚úÖ Core Operations Tested
- **Bucket Operations**: CreateVectorBucket, ListVectorBuckets, DeleteVectorBucket
- **Policy Operations**: PutVectorBucketPolicy, GetVectorBucketPolicy, DeleteVectorBucketPolicy  
- **Index Operations**: CreateIndex, ListIndexes, DeleteIndex
- **Vector Operations**: PutVectors, GetVectors, ListVectors, QueryVectors, DeleteVectors

### üß† Key Features Demonstrated
- **Real Text Embeddings**: Using text-embedding-nomic-embed-text-v1.5 model (768 dimensions)
- **Semantic Search**: Query vectors with natural language questions
- **Metadata Filtering**: Filter results by category, region, and other attributes
- **Similarity Ranking**: Results ranked by cosine distance with relevance scoring

### üåä Sample Data
- Rivers of India knowledge base with geographic, religious, and economic information
- Semantic understanding of concepts like "sacred", "agriculture", "hydropower"
- Real-world use case demonstrating vector search capabilities

All operations work with native boto3 SDK experience and proper error handling! üöÄ