Skip to content

fix(metrics): fork-safe OTel init + per-process service.instance.id#178

Merged
pavanputhra merged 1 commit into
mainfrom
fix/otel-fork-safe-init
May 20, 2026
Merged

fix(metrics): fork-safe OTel init + per-process service.instance.id#178
pavanputhra merged 1 commit into
mainfrom
fix/otel-fork-safe-init

Conversation

@pavanputhra
Copy link
Copy Markdown
Contributor

Summary

When conserver spawns multiple worker processes via multiprocessing, every forked worker inherits the parent's `lib.metrics` module state — including the already-initialized OTel SDK and the cached instrument handles. The OTel resource has no per-process attribute, so every forked worker emits metrics sharing the same fingerprint at the collector. The metrics backend collapses those independent writes to a single series; the most recent worker's value overwrites the others.

Visible effect

With `CONSERVER_WORKERS=2` and `CONSERVER_VCON_CONCURRENCY=128`, the new `conserver.vcons.inflight` UpDownCounter peaks at exactly 128 (one worker's pool max) instead of the expected ~512 (2 pods × 2 workers × 128 thread capacity). `spaceAggregation: sum` across pods can't recover the cluster total — there's only one series to sum.

The same collision affects every conserver metric, not just inflight. Every counter and histogram has been showing one-worker numbers, not cluster totals.

Fix

Two parts:

  1. Track init state by `os.getpid()` instead of a bare bool. A forked child has a different pid than the parent that initialized the module, so it re-enters the init path and rebuilds its own MeterProvider in the child's address space. The parent's single-call optimization is preserved.

  2. Add `service.instance.id = "-pid-"` to the OTel resource. Each forked worker now has a unique fingerprint at the collector. Downstream queries can `spaceAggregation: sum` across instances to get true cluster totals.

The re-init in the child also clears the cached instrument handles inherited from the parent — those were bound to the parent's MeterProvider and won't export through the child's exporter.

Verification

  • New `test_otel_fork_safe_init.py` (5 tests): pid-based gate, re-init on pid change, instance.id resource attribute, endpoint-unset path, exporter-failure path.
  • `conftest.py` and `test_inflight_counter.py` updated to patch the renamed `_otel_initialized_pid`.
  • 35 tests pass locally (queue/inflight/dlq/parallel suites + new tests).
  • After deploy: `conserver.vcons.inflight` peak should reach 2 pods × 2 workers × 128 = ~512. Distinct series per `service.instance.id` should appear in the metrics backend. Dashboard queries using `spaceAggregation: sum` produce true cluster aggregates.

Risk / scope

This changes the cardinality of EVERY conserver metric series — what was 1 series per metric+attrs becomes N series (one per worker process). For us this is the intended outcome; query-time `sum by (chain.name)` recovers the previous shape minus the per-worker dimension. Worth flagging for anyone querying these metrics: `spaceAggregation: max` queries will start showing per-worker max (likely lower than current behaviour); `sum` queries will show the new cluster total.

Dashboards and alerts in `vconic-ops-deploy` already use `spaceAggregation: sum` everywhere it matters (inflight, queue depth, error rates) so they pick up the correct semantics automatically.

🤖 Generated with Claude Code

When conserver spawns multiple worker processes via multiprocessing,
each fork inherits the parent's lib.metrics module state — including
the already-initialized OTel SDK and the cached instrument handles.
The OTel resource doesn't include any per-process attribute, so every
forked worker emits metrics that share the same fingerprint at the
collector. Downstream backends collapse those independent writes into
a single series and the most recent worker's value overwrites the
others.

Visible effect: with WORKERS=2 and CONSERVER_VCON_CONCURRENCY=128, the
new ``conserver.vcons.inflight`` UpDownCounter peaks at exactly 128
(one worker's pool) instead of the expected ~512 (2 pods × 2 workers ×
128 capacity). Sum-across-pods queries can't recover the true cluster
total. The same collision affects every conserver metric, not just
inflight — every counter and histogram has been showing one-worker
numbers, not cluster totals.

Two-part fix:

1. Track init state by ``os.getpid()`` instead of a bare bool. A
   forked child has a different pid than the parent that initialized
   the module, so it re-enters the init path and rebuilds its own
   MeterProvider in the child's address space. The pid-based gate
   keeps the parent's single-call optimization intact.

2. Add ``service.instance.id = "<host>-pid-<pid>"`` to the OTel
   resource so each forked worker has a unique fingerprint at the
   collector. Downstream queries can now ``spaceAggregation: sum``
   across instances to get true cluster totals.

Also clears the cached instrument handles on re-init in the child —
they were created against the parent's MeterProvider and won't export
correctly through the child's exporter.

Tests:
- New ``test_otel_fork_safe_init.py`` covers pid-based gate, re-init
  on pid change, instance.id resource attribute, and the
  endpoint-unset / exporter-failure paths.
- Existing ``conftest.py`` and ``test_inflight_counter.py`` updated
  to patch the new ``_otel_initialized_pid`` (renamed from the old
  bare ``_otel_initialized``).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pavanputhra pavanputhra merged commit b0a5a00 into main May 20, 2026
1 check passed
@pavanputhra pavanputhra deleted the fix/otel-fork-safe-init branch May 20, 2026 18:55
pavanputhra added a commit that referenced this pull request May 20, 2026
… global

Follow-up to #178. The fork-safe init landed, but
SignOz queries still showed only one worker's value (peak inflight 128
across 4 worker processes instead of the expected ~512). Root cause:
the user-provided OTel resource (``host.name``, ``service.instance.id``)
never reached the metrics backend.

OpenTelemetry Python's ``metrics.set_meter_provider()`` is single-call.
When ``opentelemetry-instrumentation`` (auto-instrumentation) is active
in the process, it registers a default MeterProvider early in startup —
before this module's lazy ``_init_otel_metrics()`` runs. Our subsequent
``set_meter_provider(our_provider)`` call is silently ignored, and
``metrics.get_meter(__name__)`` returns the global proxy bound to the
auto-instrumentation's provider — whose resource we don't control.

The visible CH symptom: ``resource_attrs`` on every conserver.*
metric only contains ``service.name`` + ``telemetry.sdk.*`` +
``telemetry.auto.version``; ``host.name`` and ``service.instance.id``
that this module sets are dropped. All four worker processes collapse
onto a single fingerprint, last-write-wins.

Fix: bind ``meter`` to OUR provider directly via
``provider.get_meter(__name__)``, never touching the global. Our
provider has the correct resource and its own export pipeline to
the OTel collector. Auto-instrumentation metrics continue to flow
through their own pipeline independently — both arrive at the
collector.

Tests updated to assert ``set_meter_provider`` is NOT called and the
``meter`` global is bound to the provider's own meter, not the global
proxy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pavanputhra added a commit that referenced this pull request May 20, 2026
… global (#179)

Follow-up to #178. The fork-safe init landed, but
SignOz queries still showed only one worker's value (peak inflight 128
across 4 worker processes instead of the expected ~512). Root cause:
the user-provided OTel resource (``host.name``, ``service.instance.id``)
never reached the metrics backend.

OpenTelemetry Python's ``metrics.set_meter_provider()`` is single-call.
When ``opentelemetry-instrumentation`` (auto-instrumentation) is active
in the process, it registers a default MeterProvider early in startup —
before this module's lazy ``_init_otel_metrics()`` runs. Our subsequent
``set_meter_provider(our_provider)`` call is silently ignored, and
``metrics.get_meter(__name__)`` returns the global proxy bound to the
auto-instrumentation's provider — whose resource we don't control.

The visible CH symptom: ``resource_attrs`` on every conserver.*
metric only contains ``service.name`` + ``telemetry.sdk.*`` +
``telemetry.auto.version``; ``host.name`` and ``service.instance.id``
that this module sets are dropped. All four worker processes collapse
onto a single fingerprint, last-write-wins.

Fix: bind ``meter`` to OUR provider directly via
``provider.get_meter(__name__)``, never touching the global. Our
provider has the correct resource and its own export pipeline to
the OTel collector. Auto-instrumentation metrics continue to flow
through their own pipeline independently — both arrive at the
collector.

Tests updated to assert ``set_meter_provider`` is NOT called and the
``meter`` global is bound to the provider's own meter, not the global
proxy.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pavanputhra added a commit that referenced this pull request May 21, 2026
…context present (#181)

* fix(metrics): bind meter to our provider directly, bypass single-call global

Follow-up to #178. The fork-safe init landed, but
SignOz queries still showed only one worker's value (peak inflight 128
across 4 worker processes instead of the expected ~512). Root cause:
the user-provided OTel resource (``host.name``, ``service.instance.id``)
never reached the metrics backend.

OpenTelemetry Python's ``metrics.set_meter_provider()`` is single-call.
When ``opentelemetry-instrumentation`` (auto-instrumentation) is active
in the process, it registers a default MeterProvider early in startup —
before this module's lazy ``_init_otel_metrics()`` runs. Our subsequent
``set_meter_provider(our_provider)`` call is silently ignored, and
``metrics.get_meter(__name__)`` returns the global proxy bound to the
auto-instrumentation's provider — whose resource we don't control.

The visible CH symptom: ``resource_attrs`` on every conserver.*
metric only contains ``service.name`` + ``telemetry.sdk.*`` +
``telemetry.auto.version``; ``host.name`` and ``service.instance.id``
that this module sets are dropped. All four worker processes collapse
onto a single fingerprint, last-write-wins.

Fix: bind ``meter`` to OUR provider directly via
``provider.get_meter(__name__)``, never touching the global. Our
provider has the correct resource and its own export pipeline to
the OTel collector. Auto-instrumentation metrics continue to flow
through their own pipeline independently — both arrive at the
collector.

Tests updated to assert ``set_meter_provider`` is NOT called and the
``meter`` global is bound to the provider's own meter, not the global
proxy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(traces): always open vcon_processing root span, link if upstream context present

Previously the vcon_processing.<chain> span was only created when
self.context carried an upstream trace context (extracted from Redis
under context:{ingress}:{vcon_uuid}). When the producer didn't store
a context — e.g. an adapter that ingests directly to the ingress list
without OTel instrumentation — _create_span_from_context short-
circuited to None and every link.* / storage.* span emitted during
the chain run became its own root trace, defeating the
trace-per-vcon model.

Refactor _create_span_from_context to always return a span context
manager. A span link to the producer's trace is attached only when
trace_id and span_id parse to non-zero values; otherwise the span
opens without links. Trace structure is preserved either way, and
the link semantics still light up automatically once producers (e.g.
the BDS adapter) begin propagating context via store_context_*.

Net: -27 lines (139 changed, 56+/83-). Run-method setup collapses
from a 25-line if/else dance to two lines; the POC trace-id
verification log goes away.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant