A demonstration of semantic caching using AWS-native services that reduces Amazon Bedrock LLM costs and latency through intelligent caching of semantically similar queries.
Purpose: Showcase how to implement semantic search and caching using only AWS services - no external vector databases or third-party dependencies required.
- Semantic Similarity Matching: Different phrasings of the same question return cached responses
- Vector Embeddings: Using Amazon Titan Embeddings V2 (1024 dimensions)
- Serverless Vector Storage: Amazon S3 Vectors for native similarity search
- LLM Integration: Amazon Bedrock with Claude Haiku 4.5
- Cost Optimization: Cache hits are ~10x faster and avoid LLM costs
Deploy and test in 3 commands:
# 1. Check prerequisites
./scripts/check-prereqs.sh
# 2. Deploy infrastructure (β οΈ incurs AWS costs)
./scripts/deploy.sh
# 3. Test the demo
./scripts/test-demo.sh <API_ENDPOINT>
# 4. Clean up (IMPORTANT - avoid ongoing charges)
./scripts/cleanup.sh- AWS CLI v2+:
aws --version - SAM CLI v1+:
sam --version - AWS Credentials:
aws sts get-caller-identity
- Supported Regions: us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1
- Bedrock Models: Auto-enable on first use (no manual setup required)
- Amazon Titan Embeddings V2 (
amazon.titan-embed-text-v2:0) - Anthropic Claude Haiku 4.5 (
us.anthropic.claude-haiku-4-5-20251001-v1:0- inference profile)
- Amazon Titan Embeddings V2 (
- Python 3.12+: For local testing
- jq: For demo script output formatting
| Resource | Estimated Cost | Notes |
|---|---|---|
| S3 Vectors | Usage-based | Storage + query costs |
| Lambda | ~$0.20/1M requests | Plus compute time |
| API Gateway HTTP | ~$1.00/1M requests | Minimal for demo usage |
| Bedrock Embeddings | ~$0.00002/1K tokens | Titan V2 |
| Bedrock LLM | ~$0.001/1K tokens | Claude Haiku 4.5 (input) |
Estimated demo cost: < $1.00 for a few hours of testing
π‘ Benefits of this architecture:
- Fully serverless - no fixed infrastructure costs
- Pay only for what you use
- No minimum baseline charges
./scripts/cleanup.sh immediately after testing to avoid ongoing charges!
The semantic cache implements a 7-step flow:
1. Query β 2. Route β 3. Generate Embedding β 4. Vector Search
β
5a. Cache HIT (fast) ββββββββββββββββββββ Similarity β₯ 0.85?
β
5b. Cache MISS β 6. LLM Response β 7. Store in Cache
| Layer | Service | Purpose |
|---|---|---|
| API | API Gateway HTTP API | REST endpoint with CORS |
| Orchestration | Lambda | Steps 2-7 coordination |
| Cache | Amazon S3 Vectors | Serverless vector search + storage |
| AI/ML | Amazon Bedrock | Embeddings + LLM responses |
| Observability | CloudWatch | Metrics + Dashboard |
- Semantic Matching: Different phrasings hit the same cache entry
- Vector Search: Native similarity search with S3 Vectors
- Graceful Degradation: Falls back to direct LLM if cache fails
- Fully Serverless: No VPC required, fast cold starts (~300ms)
- AWS-Native: Uses only AWS services - no external dependencies
- Cost Effective: Pay only for storage and queries used
This demo uses a serverless security model without VPC:
| Control | Implementation |
|---|---|
| Authentication | IAM SigV4 for all AWS API calls |
| Encryption in Transit | TLS for all service communication |
| Encryption at Rest | AWS-managed encryption for S3 Vectors |
| Rate Limiting | API Gateway throttling (100 req/s burst) |
| Input Validation | Lambda validates all inputs |
Why no VPC? S3 Vectors and Bedrock are fully managed services accessed via authenticated APIs. VPC isolation is unnecessary and would add cold start latency.
This demo showcases the core semantic caching pattern. For production use, consider:
- API Gateway Authorizers: Add Cognito, IAM, or Lambda authorizers
- API Keys: For client identification and usage tracking
- AWS WAF: Protect against common web exploits
- Lambda Provisioned Concurrency: Eliminate cold starts for consistent latency
- Cache Warming: Pre-populate cache with common queries
- Multi-Region: Deploy to multiple regions for global low latency
- X-Ray Tracing: End-to-end request tracing
- CloudWatch Alarms: Alert on error rates, latency spikes
- Custom Metrics: Track cache hit rates, cost savings
- TTL Strategy: Implement cache expiration based on data freshness needs
- Cache Invalidation: API to clear stale entries when source data changes
- Backup Strategy: Regular exports of cached data if persistence is critical
- Reserved Capacity: For Bedrock if usage is predictable
- Similarity Threshold Tuning: Balance hit rate vs response accuracy
- Response Compression: Reduce storage costs for large responses
cd tests
pip install -r requirements.txt
pytest unit/ -v# Set API endpoint from deployment
export API_ENDPOINT="https://your-api-id.execute-api.us-east-1.amazonaws.com"
pytest integration/ -v# Test cache miss (first time)
curl -X POST $API_ENDPOINT/query \
-H "Content-Type: application/json" \
-d '{"query": "What is the capital of France?"}'
# Test cache hit (same query)
curl -X POST $API_ENDPOINT/query \
-H "Content-Type: application/json" \
-d '{"query": "What is the capital of France?"}'
# Test semantic similarity
curl -X POST $API_ENDPOINT/query \
-H "Content-Type: application/json" \
-d '{"query": "Tell me the capital city of France"}'| Test | Source | Latency | Similarity | Notes |
|---|---|---|---|---|
| First query | bedrock |
~3000ms | N/A | Cache miss - full LLM call |
| Same query | cache |
~300ms | 1.0 | Exact match - 10x faster |
| Similar phrasing | cache |
~400ms | ~0.93 | Semantic match |
| Different topic | bedrock |
~3000ms | N/A | Cache miss - different topic |
Environment variables (set in CloudFormation):
| Variable | Default | Description |
|---|---|---|
SIMILARITY_THRESHOLD |
0.85 | Minimum similarity for cache hit (0.0-1.0) |
VECTOR_BUCKET_NAME |
Auto | S3 Vectors bucket name (set by deployment) |
VECTOR_INDEX_NAME |
semanticcache | S3 Vectors index name |
Access via deployment output or:
https://us-east-1.console.aws.amazon.com/cloudwatch/home#dashboards:name=semantic-cache-demo-dashboard
- Cache Hit Rate: Percentage of requests served from cache
- Latency: Response times for cache hits vs misses
- Request Count: Total API requests
- Error Rate: Failed requests by type
# View Lambda logs
aws logs tail /aws/lambda/semantic-cache-demo-handler --followsemantic-cache-demo/
βββ infrastructure/
β βββ template.yaml # Full CloudFormation template
βββ src/cache_orchestrator/ # Lambda function code
β βββ handler.py # Main orchestrator (Steps 2-7)
β βββ embedding.py # Titan Embeddings (Step 3)
β βββ cache.py # S3 Vectors client (Steps 4,5a,7)
β βββ llm.py # Claude LLM (Steps 5b,6)
β βββ metrics.py # CloudWatch metrics
βββ scripts/ # Deployment automation
βββ tests/ # Unit + integration tests
# Install dependencies
pip install -r src/cache_orchestrator/requirements.txt
pip install -r tests/requirements.txt
# Run tests
pytest tests/unit/ -v
# Deploy changes
cd infrastructure && sam build && sam deploy| Issue | Quick Fix |
|---|---|
| Bedrock throttling | Request quota increase: ./scripts/request-quota-increase.sh |
| Cache always misses | Check S3 Vectors index: aws s3vectors list-indexes --vector-bucket-name <bucket> |
| High latency | Expected on first request (cold start). Check CloudWatch dashboard for metrics. |
| Deployment fails | Validate template: sam validate --template infrastructure/template.yaml |
| Tests failing | Install deps: pip install -r tests/requirements.txt && pytest tests/ -v |
Common Commands:
# View logs
aws logs tail /aws/lambda/semantic-cache-demo-handler --follow
# Check stack status
aws cloudformation describe-stacks --stack-name semantic-cache-demo
# Complete reset
./scripts/cleanup.sh && ./scripts/deploy.shCRITICAL: Always clean up after testing to avoid ongoing charges.
# Delete all resources
./scripts/cleanup.sh
# Verify cleanup completed
aws cloudformation describe-stacks --stack-name semantic-cache-demo
# Should return: "Stack with id semantic-cache-demo does not exist"
# Check for remaining costs
# https://console.aws.amazon.com/cost-management/home#/dashboardThe cleanup script removes:
- CloudFormation stack (all resources)
- S3 Vectors bucket and index
- CloudWatch log groups
- Local build artifacts
- Full Specification:
SPEC.md - AWS Docs: Bedrock Models | S3 Vectors
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass:
pytest tests/ -v - Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
This project demonstrates semantic caching using AWS-native services only:
- No external vector databases - Uses Amazon S3 Vectors
- No third-party LLM providers - Uses Amazon Bedrock
- No complex infrastructure - Fully serverless, no VPC required
- Production patterns - Graceful degradation, retry logic, observability
Use Cases: Customer support bots, FAQ systems, documentation assistants, or any application with repetitive natural language queries.