Skip to content

[AWS] Story 7: Production Readiness #182

@mfittko

Description

@mfittko

Summary

Final production readiness: load testing, operational runbooks, security review, and go-live checklist.

Epic: #174
Architecture: docs/architecture/planned/aws-ecs-cdk.md


Tasks

Load Testing

Use the existing llm-proxy benchmark tool:

# Basic load test
llm-proxy benchmark \
  --base-url "https://llm-proxy.example.com" \
  --endpoint "/v1/chat/completions" \
  --token "$PROXY_TOKEN" \
  --requests 100 --concurrency 10 \
  --json '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"test"}]}'

# Cache performance test
llm-proxy benchmark \
  --base-url "https://llm-proxy.example.com" \
  --endpoint "/v1/chat/completions" \
  --token "$PROXY_TOKEN" \
  --requests 50 --concurrency 10 \
  --cache --cache-ttl 300 \
  --debug \
  --json '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"cached"}]}'
  • Test with expected production load (requests, concurrency)
  • Test auto-scaling behavior (ramp up load, observe task count)
  • Test cache hit performance (Redis)
  • Identify and address bottlenecks
  • Document baseline performance metrics (latency p50/p95/p99)

Security Review

  • Review IAM policies (least privilege)
  • Verify all secrets in Secrets Manager
  • Verify encryption at rest (Aurora, Redis, logs)
  • Verify encryption in transit (TLS everywhere)
  • Review security group rules
  • Check for exposed resources

Operational Runbooks

  • Create runbook: Deployment process
  • Create runbook: Scaling (manual and auto)
  • Create runbook: Database operations (backup, restore)
  • Create runbook: Incident response
  • Create runbook: Log analysis
  • Create runbook: Secret rotation

Documentation

  • Update main README with AWS deployment info
  • Document environment variables mapping
  • Document monitoring and alerting
  • Document cost breakdown and optimization tips

Go-Live Checklist

  • All previous stories completed
  • Load testing passed
  • Security review passed
  • Runbooks reviewed by team
  • Alerting tested (trigger test alarm)
  • Rollback procedure tested
  • DNS cutover plan (if applicable)

Production Checklist

Infrastructure

  • VPC and networking configured
  • Aurora PostgreSQL running and accessible
  • ElastiCache Redis running with TLS
  • ECS services healthy
  • ALB routing correctly
  • CloudWatch dashboards populated

Security

  • HTTPS only (HTTP redirects)
  • Secrets in Secrets Manager
  • No hardcoded credentials
  • Security groups restrictive
  • IAM roles scoped appropriately

Operations

  • Alarms configured and tested
  • Log retention set
  • Backup policy verified
  • Auto-scaling tested
  • CI/CD pipeline tested

Acceptance Criteria

  • Load test shows acceptable performance
  • Security review has no critical findings
  • All runbooks created and reviewed
  • Go-live checklist fully complete
  • Production deployment successful

Dependencies

  • All previous stories (1-6)

Estimated Effort

Medium-Large - 3-4 days


Notes

  • Use existing llm-proxy benchmark tool (no k6 needed)
  • Consider phased rollout (internal users first)
  • Have rollback plan ready
  • Monitor closely for first 24-48 hours

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions