Skip to content

fix(monitoring): set MemoryTelemetry logger to INFO for production#3386

Merged
waleedlatif1 merged 1 commit intostagingfrom
waleedlatif1/add-aws-monitoring
Feb 28, 2026
Merged

fix(monitoring): set MemoryTelemetry logger to INFO for production#3386
waleedlatif1 merged 1 commit intostagingfrom
waleedlatif1/add-aws-monitoring

Conversation

@waleedlatif1
Copy link
Collaborator

@waleedlatif1 waleedlatif1 commented Feb 28, 2026

Summary

  • Production logger defaults to ERROR-only, which silently suppressed our logger.info('Memory snapshot', ...) calls
  • Adds { logLevel: 'INFO' } override specifically for the MemoryTelemetry logger so snapshots appear in CloudWatch

Test plan

  • Deploy to staging and verify Memory snapshot entries appear in /ecs/sim-staging/us-east-1/app CloudWatch logs every 60s

@vercel
Copy link

vercel bot commented Feb 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Feb 28, 2026 9:54pm

Request Review

…ion visibility

Production defaults to ERROR-only logging. Without this override,
memory snapshots would be silently suppressed.
@waleedlatif1 waleedlatif1 force-pushed the waleedlatif1/add-aws-monitoring branch from 993dd39 to 4788705 Compare February 28, 2026 21:54
@waleedlatif1 waleedlatif1 merged commit 3788660 into staging Feb 28, 2026
11 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/add-aws-monitoring branch February 28, 2026 21:58
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 28, 2026

Greptile Summary

This PR implements memory telemetry for production monitoring and includes several defensive improvements to prevent memory leaks and unbounded growth.

Key changes:

  • Added MemoryTelemetry logger with logLevel: 'INFO' override to ensure memory snapshots appear in CloudWatch (fixes the production logger's ERROR-only default)
  • Implemented SSE connection tracking across all streaming endpoints (a2a-message, a2a-resubscribe, mcp-events, wand, workflow-execute, execution-stream-reconnect)
  • Added protection against unbounded memory growth during Redis failures by limiting pending event queues
  • Limited stderr accumulation in isolated VM workers to 64KB
  • Refactored cleanup logic in A2A resubscribe handler to eliminate duplication

Monitoring improvements:
Memory snapshots now log every 60s including heap stats, RSS, active resources, and SSE connection counts by route, enabling correlation between connection leaks and memory spikes.

Defensive measures:

  • Event buffers drop oldest events when exceeding limits during sustained Redis failures
  • Worker stderr is capped to prevent unbounded accumulation
  • SSE tracking uses guard flags to prevent double-decrement in all exit paths

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • All changes are defensive improvements for production monitoring and memory leak prevention. The SSE connection tracking logic properly uses guard flags to prevent double-decrement, cleanup functions are correctly called in all exit paths (finally blocks and cancel methods), and the memory telemetry logger override is the intended fix for production visibility.
  • No files require special attention - verify in staging that Memory snapshot logs appear in CloudWatch as expected

Important Files Changed

Filename Overview
apps/sim/lib/monitoring/memory-telemetry.ts New file that logs memory snapshots with INFO level override to ensure production visibility
apps/sim/lib/monitoring/sse-connections.ts New SSE connection tracker using in-memory Map for memory leak diagnostics
apps/sim/instrumentation-node.ts Starts memory telemetry monitoring on app initialization
apps/sim/app/api/a2a/serve/[agentId]/route.ts Added SSE connection tracking with proper increment/decrement guards, refactored cleanup logic
apps/sim/app/api/mcp/events/route.ts Added SSE connection tracking with cleanup guard to prevent double-decrement
apps/sim/lib/execution/event-buffer.ts Added pending event limit with oldest-first eviction during Redis outages

Last reviewed commit: 4788705

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

waleedlatif1 added a commit that referenced this pull request Mar 1, 2026
…ion visibility (#3386)

Production defaults to ERROR-only logging. Without this override,
memory snapshots would be silently suppressed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant