Skip to content

feat: package events via SQS, remove --stage#383

Merged
drernie merged 15 commits intomainfrom
event-security
Apr 15, 2026
Merged

feat: package events via SQS, remove --stage#383
drernie merged 15 commits intomainfrom
event-security

Conversation

@drernie
Copy link
Copy Markdown
Member

@drernie drernie commented Apr 15, 2026

Summary

  • Package events via SQS — EventBridge routes package-revision events through an SQS queue to a dedicated ECS sidecar consumer, replacing the API Gateway /package-event route with a dead-letter queue for reliability
  • Remove --stage — all deployments use a single prod stage; the profile determines the environment. Simplifies config, CLI, and deployment tracking
  • Secrets TTL cache with background refresh — prevents 504 on cache miss; SQS consumer applies secrets at startup so bucket filter works on first message

Test plan

  • npm test passes
  • npm run test:local — Docker container builds and serves health check
  • Deploy to dev profile, confirm SQS queue + sidecar container appear in ECS task definition
  • Trigger a package revision event via EventBridge, confirm canvas refreshes
  • Verify DLQ receives messages after repeated failures
  • Confirm --stage flag is rejected by CLI
  • npm run test:integration against deployed stack

🤖 Generated with Claude Code

drernie and others added 15 commits April 13, 2026 15:44
Promote the Unreleased section to [0.16.0] - 2026-04-11, matching the
tag. Attribute bullets to #379 and add a note for the #378 dependency
bumps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Capture #380 (gh-release action v3) and #381 (minor/patch deps) in the
CHANGELOG so the next release cut has them ready.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces EventBridge→API Gateway→ECS with EventBridge→SQS→sidecar
consumer for package-revision events. Captures:

- Problem (5s API Gateway timeout, retry storms, public endpoint).
- Process model pinned at one consumer process per task, bounded by
  asyncio.Semaphore(PACKAGE_EVENT_CONCURRENCY=5).
- Sidecar container (essential: true) in the same task def, sharing
  image and task role with the HTTP container. Consumer crash forces
  ECS to replace the task — silent-outage risk outweighs HTTP-isolation.
- EventBridge rule filters on source + detail-type only; bucket and
  prefix are secret-derived and enforced inside
  refresh_canvas_for_package_event (see 2026-04-11-iac-integrated/
  01-iac-breakage.md).
- Single poison-message policy: never delete on failure, rely on
  maxReceiveCount=5 redrive to DLQ. Refresh function is a total
  function returning RefreshResult.
- Visibility timeout 300s to cover worst-case refresh latency
  (PackageFileFetcher + Athena poll + Benchling SDK each 30s-class).
  Heartbeat cutover documented for when P99 approaches 240s.
- Observability, rollout, verification, and out-of-scope sections.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The --stage flag was never functional — the API Gateway stage was always
"prod" regardless of the flag value. Stage was only used as a label in
deployment tracking (deployments.json), adding complexity with no benefit.

Changes:
- Remove --stage from deploy/destroy CLI commands
- Simplify DeploymentHistory.active from Record<string, DeploymentRecord>
  to DeploymentRecord | null (one active deployment per profile)
- Remove stage field from DeploymentRecord type and JSON schema
- Hardcode API Gateway stage to "prod" in CDK stack
- Add migration logic in xdg-base.ts to convert legacy deployments.json
- Update all commands, wizards, tests, Makefile, and package.json scripts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The SQS consumer's main() created a config with s3_bucket_name="" but
never called apply_benchling_secrets() before polling. Every message was
silently skipped as "unexpected bucket" because the filter compared the
event bucket against an empty string. Also adds TTL cache (60s) to
get_benchling_secrets() to avoid per-request Secrets Manager latency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the TTL cache expires, return the stale cached value immediately
and refresh in a background thread. This ensures no webhook request
ever blocks on a Secrets Manager call (which takes 10-30s in VPC
environments without a VPC endpoint, exceeding the 29s API Gateway
timeout). The lock prevents multiple concurrent refreshes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Separate ECS and SQS consumer log streams via streamPrefix so they can
be queried independently. Apply a server-side filter to exclude GET
/health entries, which previously filled the fetch limit and hid real
application logs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ssing feedback

Use a dedicated .canvas_id sidecar file in S3 so canvas events persist their
canvas_id independently of entry.json, preventing entry events from overwriting
it during concurrent processing. Add immediate "Processing..." canvas feedback
on canvas creation and a best-effort direct canvas update after the export
workflow. Improve error logging with exc_info=True in canvas error handlers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dev profile always uses standalone deployment flow, even when the
underlying Quilt stack has BenchlingIntegration enabled. This prevents
the setup wizard from routing dev into integrated mode when testing
against shared stacks like quilt-staging. Also adds yes/no formatting
to enquirer confirm prompts and passes --yes to test:dev scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The update-standalone-redeploy flow was missing benchlingSecretArn in
the config builder call, and ran deployCommand before syncSecretsToAWS,
causing deploy to fail with "benchlingSecret is required". Reorder to
sync secrets first (matching deploy-standalone), and pass the discovered
ARN to both standalone config builders.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@drernie drernie merged commit 920f60a into main Apr 15, 2026
3 checks passed
@drernie drernie deleted the event-security branch April 15, 2026 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant