feat: package events via SQS, remove --stage#383
Merged
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces EventBridge→API Gateway→ECS with EventBridge→SQS→sidecar consumer for package-revision events. Captures: - Problem (5s API Gateway timeout, retry storms, public endpoint). - Process model pinned at one consumer process per task, bounded by asyncio.Semaphore(PACKAGE_EVENT_CONCURRENCY=5). - Sidecar container (essential: true) in the same task def, sharing image and task role with the HTTP container. Consumer crash forces ECS to replace the task — silent-outage risk outweighs HTTP-isolation. - EventBridge rule filters on source + detail-type only; bucket and prefix are secret-derived and enforced inside refresh_canvas_for_package_event (see 2026-04-11-iac-integrated/ 01-iac-breakage.md). - Single poison-message policy: never delete on failure, rely on maxReceiveCount=5 redrive to DLQ. Refresh function is a total function returning RefreshResult. - Visibility timeout 300s to cover worst-case refresh latency (PackageFileFetcher + Athena poll + Benchling SDK each 30s-class). Heartbeat cutover documented for when P99 approaches 240s. - Observability, rollout, verification, and out-of-scope sections. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The --stage flag was never functional — the API Gateway stage was always "prod" regardless of the flag value. Stage was only used as a label in deployment tracking (deployments.json), adding complexity with no benefit. Changes: - Remove --stage from deploy/destroy CLI commands - Simplify DeploymentHistory.active from Record<string, DeploymentRecord> to DeploymentRecord | null (one active deployment per profile) - Remove stage field from DeploymentRecord type and JSON schema - Hardcode API Gateway stage to "prod" in CDK stack - Add migration logic in xdg-base.ts to convert legacy deployments.json - Update all commands, wizards, tests, Makefile, and package.json scripts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The SQS consumer's main() created a config with s3_bucket_name="" but never called apply_benchling_secrets() before polling. Every message was silently skipped as "unexpected bucket" because the filter compared the event bucket against an empty string. Also adds TTL cache (60s) to get_benchling_secrets() to avoid per-request Secrets Manager latency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the TTL cache expires, return the stale cached value immediately and refresh in a background thread. This ensures no webhook request ever blocks on a Secrets Manager call (which takes 10-30s in VPC environments without a VPC endpoint, exceeding the 29s API Gateway timeout). The lock prevents multiple concurrent refreshes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Separate ECS and SQS consumer log streams via streamPrefix so they can be queried independently. Apply a server-side filter to exclude GET /health entries, which previously filled the fetch limit and hid real application logs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ssing feedback Use a dedicated .canvas_id sidecar file in S3 so canvas events persist their canvas_id independently of entry.json, preventing entry events from overwriting it during concurrent processing. Add immediate "Processing..." canvas feedback on canvas creation and a best-effort direct canvas update after the export workflow. Improve error logging with exc_info=True in canvas error handlers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dev profile always uses standalone deployment flow, even when the underlying Quilt stack has BenchlingIntegration enabled. This prevents the setup wizard from routing dev into integrated mode when testing against shared stacks like quilt-staging. Also adds yes/no formatting to enquirer confirm prompts and passes --yes to test:dev scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The update-standalone-redeploy flow was missing benchlingSecretArn in the config builder call, and ran deployCommand before syncSecretsToAWS, causing deploy to fail with "benchlingSecret is required". Reorder to sync secrets first (matching deploy-standalone), and pass the discovered ARN to both standalone config builders. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
package-revisionevents through an SQS queue to a dedicated ECS sidecar consumer, replacing the API Gateway/package-eventroute with a dead-letter queue for reliability--stage— all deployments use a singleprodstage; the profile determines the environment. Simplifies config, CLI, and deployment trackingTest plan
npm testpassesnpm run test:local— Docker container builds and serves health check--stageflag is rejected by CLInpm run test:integrationagainst deployed stack🤖 Generated with Claude Code