Skip to content

fix(langchain): detach existing SpanHolder token before overwrite in _create_llm_span#3958

Merged
OzBenSimhonTraceloop merged 4 commits intotraceloop:mainfrom
saivedant169:fix/langchain-span-holder-overwrite
Apr 15, 2026
Merged

fix(langchain): detach existing SpanHolder token before overwrite in _create_llm_span#3958
OzBenSimhonTraceloop merged 4 commits intotraceloop:mainfrom
saivedant169:fix/langchain-span-holder-overwrite

Conversation

@saivedant169
Copy link
Copy Markdown
Contributor

@saivedant169 saivedant169 commented Apr 9, 2026

Closes #3957

  • I have added tests that cover my changes.
  • If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
  • PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
  • (If applicable) I have updated the documentation accordingly.

What

In _create_llm_span(), when a SpanHolder already exists for a given run_id, the old entry was being replaced directly. The previous holder's token (from context_api.attach(set_value(SUPPRESS_LANGUAGE_MODEL_INSTRUMENTATION_KEY, True))) was dropped on the floor without ever being detached, leaving an orphaned entry on the OTel context stack.

This is the same class of bug as the ones fixed in #3526 and #3807, just on a different code path. It was flagged during review of #3807 and tracked separately in #3957.

Fix

Before overwriting self.spans[run_id], check if an existing holder is present and detach its token through the existing _safe_detach_context() helper.

existing_holder = self.spans.get(run_id)
if existing_holder is not None and existing_holder.token is not None:
    self._safe_detach_context(existing_holder.token)

Scoped strictly to _create_llm_span() as described in #3957. The same pattern exists in _create_span() and _create_task_span(), but those were not called out in the issue, so leaving them out of this PR to keep it tight.

Tests

No new tests added — this is a defensive fix for a race between the runtime overwriting the registry entry and the old token remaining attached, which is hard to cover with cassette-based tests. Happy to add one if a maintainer suggests a good angle.

AI Disclosure

Took help of an AI tool to draft the fix and this description.

Summary by CodeRabbit

  • Bug Fixes
    • Improved OpenTelemetry instrumentation to correctly manage context tokens and suppression flags, preventing leaks when spans are replaced or duplicated and improving telemetry stability.
  • Tests
    • Added tests validating context token lifecycle and suppression behavior to prevent regressions.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9ab61767-1d2a-4d1d-aa6f-5352b18704ae

📥 Commits

Reviewing files that changed from the base of the PR and between 39bc0d9 and abadc39.

📒 Files selected for processing (2)
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py
  • packages/opentelemetry-instrumentation-langchain/tests/test_context_token_lifecycle.py
✅ Files skipped from review due to trivial changes (1)
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py

📝 Walkthrough

Walkthrough

TraceloopCallbackHandler now detaches any existing SpanHolder context token for a run_id before creating/attaching a new LLM span and its suppression token, preventing orphaned context tokens. A new test module validates the token lifecycle and suppression state across span create/end and duplicate run_id scenarios.

Changes

Cohort / File(s) Summary
Context Token Cleanup
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py
Before overwriting a SpanHolder for a run_id, the code detaches any existing context token to avoid leaving orphaned attachments; then creates/attaches the new span and suppression token with proper detach/attach ordering.
Tests — context token lifecycle
packages/opentelemetry-instrumentation-langchain/tests/test_context_token_lifecycle.py
New pytest module that sets up an in-memory tracer and exercises suppression flag lifecycle: create/end span, repeated create for same run_id, and ensures suppression is cleared and no leaked tokens remain across tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Poem

🐰
I hopped through spans both near and far,
Unhooked the tokens, fixed the tar.
No lonely tokens left to roam,
Each context finds its tidy home.
Trace trails gleam — a happy comb. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: detaching an existing SpanHolder token before overwriting it in the _create_llm_span method.
Linked Issues check ✅ Passed The PR implements the suggested fix from #3957 by detaching existing SpanHolder context tokens before overwriting, and adds comprehensive tests covering suppression lifecycle and the duplicate run_id regression case.
Out of Scope Changes check ✅ Passed Both changes (callback_handler.py fix and test file addition) are directly scoped to addressing the context token leak in _create_llm_span and validating the fix, with no unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@max-deygin-traceloop
Copy link
Copy Markdown
Contributor

Multiple test failures.

Deep dive Claude review:
Yes, the underlying bug is real. Every call to _create_llm_span() invokes _create_span() which
attaches the span to OTel context and stores the token in a SpanHolder at self.spans[run_id]. Then
_create_llm_span() overwrites that SpanHolder with a new one containing a different token (the
SUPPRESS_LANGUAGE_MODEL_INSTRUMENTATION_KEY suppression token). The original span-context token is
orphaned — it is never detached, leaving a stale entry on the OTel context stack. This is exactly the
class of bug described in #3526.

The PR inserts the detach after the new suppression token is created (line 468) but before the
SpanHolder overwrite (line 482). This causes an out-of-order ContextVar.reset():

  1. _create_span attaches span context → token_A (old_value = C_original)
  2. _create_llm_span attaches suppression → token_B (old_value = C_span)
  3. Fix detaches token_A → ContextVar.reset(token_A) restores context to C_original, wiping out the
    suppression
  4. Downstream LLM calls (OpenAI, Bedrock, etc.) no longer see the suppression flag → duplicate spans
    are created
  5. When _end_span later detaches token_B → context is restored to C_span (a dead span), corrupting
    context in a different way

Test results confirm this:

  • main (no fix): 136 passed, 0 failed
  • PR branch: 121 passed, 15 failed — all failures are duplicate span assertions ('openai.chat',
    'bedrock.chat' appearing where they shouldn't)

What a correct fix would look like

The detach must happen before the suppression token is created to maintain correct context ordering:

Detach BEFORE creating the suppression token

existing_holder = self.spans.get(run_id)
if existing_holder is not None and existing_holder.token is not None:
self._safe_detach_context(existing_holder.token)

Now create suppression on top of the clean context

try:
token = context_api.attach(
context_api.set_value(SUPPRESS_LANGUAGE_MODEL_INSTRUMENTATION_KEY, True)
)
except Exception:
token = None

self.spans[run_id] = SpanHolder(
span, token, None, [], workflow_name, None, entity_path
)

Alternatively (and arguably cleaner), just mutate the existing SpanHolder's token in-place instead of
creating a new one:

try:
token = context_api.attach(...)
except Exception:
token = None

existing_holder = self.spans[run_id]
self._safe_detach_context(existing_holder.token)
existing_holder.token = token

The existing test suite already catches this regression. The PR author's claim that "this is hard to
cover with cassette-based tests" is incorrect — the existing tests validate span names, and the
suppression breakage manifests as extra spans that are straightforward to detect. A purpose-built
test verifying context state before/after _create_llm_span would also be valuable but isn't strictly
necessary since the existing suite covers the symptom.

…_create_llm_span

Closes traceloop#3957

When _create_llm_span() is called for a run_id that already has an entry
in self.spans, the old holder's token was lost without being detached,
leaving an orphaned context_api.attach() on the OTel context stack. This
is the same class of bug as traceloop#3526 / traceloop#3807.

Defensively detach the existing holder's token before replacing the entry.
… order

Moved the existing holder detach before the suppression token attach.
The previous ordering detached after the suppression was already on the
context stack, which caused ContextVar.reset() to restore context to
before the span — wiping the suppression flag and producing duplicate
downstream spans (openai.chat, bedrock.chat, etc.).

With the corrected order:
1. Detach the orphaned span-context token (clean slate)
2. Attach the suppression token on top of the clean context
3. Store the new SpanHolder with the suppression token
@saivedant169 saivedant169 force-pushed the fix/langchain-span-holder-overwrite branch from 6940005 to 39bc0d9 Compare April 12, 2026 14:48
@saivedant169
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review @max-deygin-traceloop — you're right, the detach ordering was wrong.

Pushed a fix. The detach now runs before the suppression token is created, so the context stack stays correct:

  1. Detach the orphaned span-context token (clean slate)
  2. Attach the suppression token on top of the clean context
  3. Store the new SpanHolder

Should resolve the 15 test failures you saw.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (1)

465-485: ⚠️ Potential issue | 🔴 Critical

Critical: Detach logic placement causes the fix to break span context and suppression.

The check for existing_holder at line 469 is placed after _create_span() has already stored a new holder in self.spans[run_id] (line 332-334 inside _create_span). This means:

  1. _create_span() creates span, attaches span token, stores holder in self.spans[run_id]
  2. existing_holder = self.spans.get(run_id) finds the holder just created (not a pre-existing one)
  3. Detaches the span token — breaking the span's context
  4. Attaches suppression token (now orphaned from span context)
  5. Overwrites holder with suppression token

This detaches the span context immediately after creation, which explains the 15 test failures showing duplicate spans (openai.chat, bedrock.chat). The suppression flag is no longer attached under the span's context, so downstream instrumentation isn't suppressed.

The check must occur before calling _create_span() to handle the genuine edge case of _create_llm_span being invoked twice with the same run_id:

🐛 Proposed fix: move detach before `_create_span()` call
     def _create_llm_span(
         self,
         run_id: UUID,
         parent_run_id: Optional[UUID],
         name: str,
         request_type: LLMRequestTypeValues,
         metadata: Optional[dict[str, Any]] = None,
         serialized: Optional[dict[str, Any]] = None,
     ) -> Span:
         workflow_name = self.get_workflow_name(parent_run_id)
         entity_path = self.get_entity_path(parent_run_id)
 
+        # Detach any pre-existing holder's token BEFORE creating the new span.
+        # This handles the edge case where _create_llm_span is called twice
+        # with the same run_id, preventing orphaned context attachments.
+        existing_holder = self.spans.get(run_id)
+        if existing_holder is not None and existing_holder.token is not None:
+            self._safe_detach_context(existing_holder.token)
+
         span = self._create_span(
             run_id,
             parent_run_id,
             f"{name}.{request_type.value}",
             kind=SpanKind.CLIENT,
             workflow_name=workflow_name,
             entity_path=entity_path,
             metadata=metadata,
         )
 
         vendor = detect_vendor_from_class(
             _extract_class_name_from_serialized(serialized)
         )
 
         _set_span_attribute(span, GenAIAttributes.GEN_AI_PROVIDER_NAME, vendor)
         operation_name = (
             GenAiOperationNameValues.CHAT.value
             if request_type == LLMRequestTypeValues.CHAT
             else GenAiOperationNameValues.TEXT_COMPLETION.value
         )
         _set_span_attribute(
             span, GenAIAttributes.GEN_AI_OPERATION_NAME, operation_name
         )
 
-        # Detach any existing holder's token before creating the suppression
-        # token. The ordering matters: if we detach after attaching the
-        # suppression, ContextVar.reset() restores context to before the span,
-        # wiping the suppression flag and causing duplicate downstream spans.
-        existing_holder = self.spans.get(run_id)
-        if existing_holder is not None and existing_holder.token is not None:
-            self._safe_detach_context(existing_holder.token)
-
         # we already have an LLM span by this point,
         # so skip any downstream instrumentation from here
         try:
             token = context_api.attach(
                 context_api.set_value(SUPPRESS_LANGUAGE_MODEL_INSTRUMENTATION_KEY, True)
             )
         except Exception:
             # If context setting fails, continue without suppression token
             token = None
 
         self.spans[run_id] = SpanHolder(
             span, token, None, [], workflow_name, None, entity_path
         )
 
         return span
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py`
around lines 465 - 485, The detach check currently runs after _create_span() and
therefore detaches the newly created span's token; move the existing_holder
lookup and the call to _safe_detach_context(existing_holder.token) to execute
before invoking _create_span() (i.e., before the code path that stores a new
SpanHolder in self.spans[run_id]) so that only a previously stored holder is
detached; ensure the suppression token attach
(context_api.attach(...SUPPRESS_LANGUAGE_MODEL_INSTRUMENTATION_KEY...)) still
happens after creating the new span so suppression remains scoped to the new
span's context and the self.spans[run_id] assignment (SpanHolder(...)) continues
to store the correct span and token.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py`:
- Around line 465-485: The detach check currently runs after _create_span() and
therefore detaches the newly created span's token; move the existing_holder
lookup and the call to _safe_detach_context(existing_holder.token) to execute
before invoking _create_span() (i.e., before the code path that stores a new
SpanHolder in self.spans[run_id]) so that only a previously stored holder is
detached; ensure the suppression token attach
(context_api.attach(...SUPPRESS_LANGUAGE_MODEL_INSTRUMENTATION_KEY...)) still
happens after creating the new span so suppression remains scoped to the new
span's context and the self.spans[run_id] assignment (SpanHolder(...)) continues
to store the correct span and token.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 10e0ac21-e4d3-484d-ae17-09de4e718e05

📥 Commits

Reviewing files that changed from the base of the PR and between 6940005 and 39bc0d9.

📒 Files selected for processing (1)
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py

@max-deygin-traceloop
Copy link
Copy Markdown
Contributor

Thanks for the fix @saivedant169 !
Sorry for only giving an AI review at the time. Claude claims #3957 is not actually fixed.
I'm ok withe either merging it as is, as it fixes the #3526 and #3807 variation on in _create_llm_span

So it's up to you, do you want to remove the claim of fix #3957, or continue to work on it?

Here's full Claude review, attaching a test script generated to reproduce #3957
test script

Verdict: Approve with P2 note

The second commit (39bc0d90) is correct and fixes the real root cause of the
suppression-wiping bug. One residual issue remains from the original bug report
(#3957) that the PR claims to close but doesn't.


What the fix does

_create_span() always stores a SpanHolder(span, span_token, ...) in
self.spans[run_id] before returning. In the first commit, _create_llm_span
detached that span_token after attaching the suppression token. Because
ContextVar.reset(token) unconditionally restores to the snapshot saved at
attach-time — not just a stack pop — detaching span_token wiped the suppression
that was layered on top of it:

attach(span_ctx) → span_token remembers ctx_0
attach(supp_ctx) → supp_token remembers span_ctx
detach(span_token) → ContextVar.reset to ctx_0 ← WIPES suppression

The second commit reverses the order — detach span_token first, then attach
suppression on the clean ctx_0. Suppression now persists correctly and
_end_span cleans it up properly.


P1 — None. The ordering fix is correct.


P2 — Closes #3957 is inaccurate; duplicate run_id leak is not fixed

_create_span unconditionally overwrites self.spans[run_id] (line 334 of
callback_handler.py) before _create_llm_span reaches the existing_holder = self.spans.get(run_id) check. In the duplicate run_id scenario, the old
suppression token (supp_token_1) is gone from self.spans by the time we try
to read it:

Before 2nd _create_llm_span call:

self.spans[run_id] = SpanHolder(span_1, supp_token_1) supp_token_1 remembers
ctx_0

_create_span(run_id)
→ self.spans[run_id] = SpanHolder(span_2, span_token_2) supp_token_1 is now
unreachable

existing_holder = self.spans.get(run_id) # finds SpanHolder(span_2,
span_token_2)
detach(span_token_2) # restores to supp_ctx_1, NOT ctx_0
attach(supp_ctx_2) # supp_token_2 remembers supp_ctx_1
_end_span later detaches supp_token_2 → restores to supp_ctx_1

supp_token_1 is never detached → suppression still active after span ends

The fix requires saving the old holder before calling _create_span:

# in _create_llm_span, BEFORE the _create_span(...) call:
old_holder = self.spans.get(run_id)
if old_holder is not None and old_holder.token is not None:
    self._safe_detach_context(old_holder.token)

span = self._create_span(run_id, ...)

A unit test that reproduces the leak is available in the PR thread. The test
asserts that after a duplicate-run_id cycle + _end_span, the suppression flag is
None. It fails with current code and passes with the fix above.

Minimum ask: remove Closes #3957 from the PR description. Ideally apply the
one-liner fix above and add the test.

---
Noteunrelated pre-existing leak in _create_span

_create_span calls context_api.attach(...) for association_properties (when
metadata is provided) and discards the returned token. Same class of bug, not
introduced by this PR. Worth a separate issue.

@saivedant169
Copy link
Copy Markdown
Contributor Author

You can merge it that’s completely fine

_create_span unconditionally overwrites self.spans[run_id] with a new
SpanHolder. When _create_llm_span is called twice with the same run_id,
the old holder — along with its suppression token — is gone from
self.spans before the existing detach logic can read it, leaking
supp_token_1 forever.

Moved the old-holder detach to run before _create_span() is called,
so the original suppression token is properly cleaned up. A second
detach remains after _create_span to handle the span_token that
_create_span just attached (otherwise the suppression would layer on
top of the span context instead of the baseline).

Added tests/test_context_token_lifecycle.py covering:
- Suppression active after _create_llm_span
- Suppression cleared after _end_span
- Duplicate run_id no longer leaks supp_token_1 (regression for traceloop#3957)

Fully closes traceloop#3957.
@saivedant169
Copy link
Copy Markdown
Contributor Author

saivedant169 commented Apr 14, 2026

Thanks @max-deygin-traceloop went with the full fix. Moved the old-holder detach to run before _create_span() so the original suppression token is properly cleaned up before it becomes unreachable. A second detach after _create_span handles the span_token it just attached, so suppression still layers on top of the baseline context.

Added your test as tests/test_context_token_lifecycle.py. Verified:

Re the unrelated _create_span leak for association_properties — happy to file that as a separate issue if you'd like.

max-deygin-traceloop
max-deygin-traceloop approved these changes Apr 15, 2026
@OzBenSimhonTraceloop OzBenSimhonTraceloop merged commit 976eef3 into traceloop:main Apr 15, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(langchain): _create_llm_span() overwrites SpanHolder causing potential context token loss

3 participants