Skip to content

fix(otel): make service.instance.id unique per process#4891

Merged
TheodoreSpeaks merged 1 commit into
stagingfrom
fix/otel-unique-instance-id
Jun 5, 2026
Merged

fix(otel): make service.instance.id unique per process#4891
TheodoreSpeaks merged 1 commit into
stagingfrom
fix/otel-unique-instance-id

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

Problem

Every Sim app replica reports OTel telemetry with the same hardcoded service.instance.id (mothership-sim), because instrumentation-node.ts builds it from a constant slug. With >1 replica (staging app runs 2 tasks), all replicas write to the same Prometheus series. Each process keeps its own independent cumulative counter, so the merged series interleaves values from two sources → phantom counter resets → rate()/increase() "add back" the drops and inflate.

Observed on staging (grafanacloud-prom):

  • resets(hosted_key_cost_charged_USD_total[40m]) = 7+ while resets(hosted_key_used_total[...]) = 0
  • "Total cost charged" inflated to ~$0.72 from a few real cents
  • service_instance_id / instance each have exactly one value despite 2 running tasks

No-key metrics (cost_charged, throttled, queue_wait_*, queue_wait_exceeded) collide fully; key-labeled ones (used/failed/upstream) are only probabilistically protected by the differing key label.

Fix

Append hostname() (the container id under ECS/Fargate, unique per task) to service.instance.id. Each replica becomes its own clean cumulative-counter series, so sum(rate(...)) / sum(increase(...)) aggregate correctly across replicas. The mothership-sim prefix is preserved so Jaeger's clock-skew adjuster still separates Sim from Go spans.

Notes

  • Emitter-only fix; historical corrupted series won't self-heal, but data after deploy is clean.
  • No dashboard changes needed — existing sum(rate(...))/sum(increase(...)) queries become correct once instance ids are unique.

🤖 Generated with Claude Code

All app replicas shared a hardcoded service.instance.id ("mothership-sim"),
so OTel metrics from every process collapsed into one Prometheus series.
Their independent cumulative counters then interleaved, producing phantom
counter resets that corrupt rate()/increase() — staging hosted-key cost
inflated to ~$0.72 from a few cents, while no-`key` metrics (cost_charged,
throttled, queue_wait_*) were affected fleet-wide.

Append the hostname (the container id under ECS, unique per task) so each
replica gets its own series and sum(rate(...)) / sum(increase(...)) aggregate
correctly. The mothership-sim prefix is kept so Jaeger's clock-skew adjuster
still separates Sim from Go.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 5, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 5, 2026 1:45am

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented Jun 5, 2026

PR Summary

Low Risk
Telemetry resource attribute only; no auth, API, or business-logic changes—metrics/traces labeling after deploy.

Overview
OpenTelemetry service.instance.id in instrumentation-node.ts now includes hostname() (via node:os) after the existing mothership-sim slug, so each ECS/Fargate replica exports metrics under its own Prometheus series instead of colliding on one id.

Comments were updated to explain the per-process uniqueness requirement (cumulative counter interleaving / bad rate() and increase()) while keeping the Sim slug for Jaeger clock-skew separation from Go.

Reviewed by Cursor Bugbot for commit e4aa8ae. Bugbot is set up for automated code reviews on this repo. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 5, 2026

Greptile Summary

This PR fixes a Prometheus counter-collision bug where all Fargate replicas of the Sim app were sharing the same hardcoded service.instance.id (mothership-sim). By appending os.hostname() (which resolves to the unique container ID under ECS), each replica now emits its own clean cumulative-counter series, making sum(rate(...)) and sum(increase(...)) aggregations correct.

  • Adds import { hostname } from 'node:os' and appends hostname() to the serviceInstanceId string at SDK initialisation time — one-line logic change, no schema or config changes required.
  • Historical corrupted series are not self-healed, but all telemetry produced after the deploy will be clean; existing dashboards using sum(rate(...)) need no changes.

Confidence Score: 5/5

Safe to merge — a targeted one-line change to the OTel bootstrap with no runtime risk and clear upside in production observability.

os.hostname() is a synchronous, always-available Node built-in that returns the container ID under ECS/Fargate. The change is confined to SDK initialisation, affects only the resource label emitted to the OTel backend, and introduces no new error paths. The mothership-sim prefix is preserved for Jaeger compatibility.

No files require special attention.

Important Files Changed

Filename Overview
apps/sim/instrumentation-node.ts Appends hostname() to service.instance.id so each ECS/Fargate replica owns a unique Prometheus series, fixing counter-reset inflation in rate()/increase() queries.

Sequence Diagram

sequenceDiagram
    participant R1 as Replica 1 (hostname: abc123)
    participant R2 as Replica 2 (hostname: def456)
    participant Prom as Prometheus/Grafana Cloud

    Note over R1,R2: Before fix — both use service.instance.id = mothership-sim
    R1->>Prom: "counter=10 {instance=mothership-sim}"
    R2->>Prom: "counter=3  {instance=mothership-sim}"
    Note over Prom: Series interleave → phantom reset → rate() inflated

    Note over R1,R2: After fix — unique service.instance.id per container
    R1->>Prom: "counter=10 {instance=mothership-sim-abc123}"
    R2->>Prom: "counter=3  {instance=mothership-sim-def456}"
    Note over Prom: Two clean series → sum(rate()) correct
Loading

Reviews (1): Last reviewed commit: "fix(otel): make service.instance.id uniq..." | Re-trigger Greptile

@TheodoreSpeaks TheodoreSpeaks merged commit f7f7840 into staging Jun 5, 2026
14 checks passed
@TheodoreSpeaks TheodoreSpeaks deleted the fix/otel-unique-instance-id branch June 5, 2026 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant