fix(langchain): detach existing SpanHolder token before overwrite in _create_llm_span#3958
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughTraceloopCallbackHandler now detaches any existing SpanHolder context token for a run_id before creating/attaching a new LLM span and its suppression token, preventing orphaned context tokens. A new test module validates the token lifecycle and suppression state across span create/end and duplicate run_id scenarios. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Multiple test failures. Deep dive Claude review: The PR inserts the detach after the new suppression token is created (line 468) but before the
Test results confirm this:
What a correct fix would look like The detach must happen before the suppression token is created to maintain correct context ordering: Detach BEFORE creating the suppression tokenexisting_holder = self.spans.get(run_id) Now create suppression on top of the clean contexttry: self.spans[run_id] = SpanHolder( Alternatively (and arguably cleaner), just mutate the existing SpanHolder's token in-place instead of try: existing_holder = self.spans[run_id] The existing test suite already catches this regression. The PR author's claim that "this is hard to |
…_create_llm_span Closes traceloop#3957 When _create_llm_span() is called for a run_id that already has an entry in self.spans, the old holder's token was lost without being detached, leaving an orphaned context_api.attach() on the OTel context stack. This is the same class of bug as traceloop#3526 / traceloop#3807. Defensively detach the existing holder's token before replacing the entry.
… order Moved the existing holder detach before the suppression token attach. The previous ordering detached after the suppression was already on the context stack, which caused ContextVar.reset() to restore context to before the span — wiping the suppression flag and producing duplicate downstream spans (openai.chat, bedrock.chat, etc.). With the corrected order: 1. Detach the orphaned span-context token (clean slate) 2. Attach the suppression token on top of the clean context 3. Store the new SpanHolder with the suppression token
6940005 to
39bc0d9
Compare
|
Thanks for the detailed review @max-deygin-traceloop — you're right, the detach ordering was wrong. Pushed a fix. The detach now runs before the suppression token is created, so the context stack stays correct:
Should resolve the 15 test failures you saw. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py (1)
465-485:⚠️ Potential issue | 🔴 CriticalCritical: Detach logic placement causes the fix to break span context and suppression.
The check for
existing_holderat line 469 is placed after_create_span()has already stored a new holder inself.spans[run_id](line 332-334 inside_create_span). This means:
_create_span()creates span, attaches span token, stores holder inself.spans[run_id]existing_holder = self.spans.get(run_id)finds the holder just created (not a pre-existing one)- Detaches the span token — breaking the span's context
- Attaches suppression token (now orphaned from span context)
- Overwrites holder with suppression token
This detaches the span context immediately after creation, which explains the 15 test failures showing duplicate spans (
openai.chat,bedrock.chat). The suppression flag is no longer attached under the span's context, so downstream instrumentation isn't suppressed.The check must occur before calling
_create_span()to handle the genuine edge case of_create_llm_spanbeing invoked twice with the samerun_id:🐛 Proposed fix: move detach before `_create_span()` call
def _create_llm_span( self, run_id: UUID, parent_run_id: Optional[UUID], name: str, request_type: LLMRequestTypeValues, metadata: Optional[dict[str, Any]] = None, serialized: Optional[dict[str, Any]] = None, ) -> Span: workflow_name = self.get_workflow_name(parent_run_id) entity_path = self.get_entity_path(parent_run_id) + # Detach any pre-existing holder's token BEFORE creating the new span. + # This handles the edge case where _create_llm_span is called twice + # with the same run_id, preventing orphaned context attachments. + existing_holder = self.spans.get(run_id) + if existing_holder is not None and existing_holder.token is not None: + self._safe_detach_context(existing_holder.token) + span = self._create_span( run_id, parent_run_id, f"{name}.{request_type.value}", kind=SpanKind.CLIENT, workflow_name=workflow_name, entity_path=entity_path, metadata=metadata, ) vendor = detect_vendor_from_class( _extract_class_name_from_serialized(serialized) ) _set_span_attribute(span, GenAIAttributes.GEN_AI_PROVIDER_NAME, vendor) operation_name = ( GenAiOperationNameValues.CHAT.value if request_type == LLMRequestTypeValues.CHAT else GenAiOperationNameValues.TEXT_COMPLETION.value ) _set_span_attribute( span, GenAIAttributes.GEN_AI_OPERATION_NAME, operation_name ) - # Detach any existing holder's token before creating the suppression - # token. The ordering matters: if we detach after attaching the - # suppression, ContextVar.reset() restores context to before the span, - # wiping the suppression flag and causing duplicate downstream spans. - existing_holder = self.spans.get(run_id) - if existing_holder is not None and existing_holder.token is not None: - self._safe_detach_context(existing_holder.token) - # we already have an LLM span by this point, # so skip any downstream instrumentation from here try: token = context_api.attach( context_api.set_value(SUPPRESS_LANGUAGE_MODEL_INSTRUMENTATION_KEY, True) ) except Exception: # If context setting fails, continue without suppression token token = None self.spans[run_id] = SpanHolder( span, token, None, [], workflow_name, None, entity_path ) return span🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py` around lines 465 - 485, The detach check currently runs after _create_span() and therefore detaches the newly created span's token; move the existing_holder lookup and the call to _safe_detach_context(existing_holder.token) to execute before invoking _create_span() (i.e., before the code path that stores a new SpanHolder in self.spans[run_id]) so that only a previously stored holder is detached; ensure the suppression token attach (context_api.attach(...SUPPRESS_LANGUAGE_MODEL_INSTRUMENTATION_KEY...)) still happens after creating the new span so suppression remains scoped to the new span's context and the self.spans[run_id] assignment (SpanHolder(...)) continues to store the correct span and token.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In
`@packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py`:
- Around line 465-485: The detach check currently runs after _create_span() and
therefore detaches the newly created span's token; move the existing_holder
lookup and the call to _safe_detach_context(existing_holder.token) to execute
before invoking _create_span() (i.e., before the code path that stores a new
SpanHolder in self.spans[run_id]) so that only a previously stored holder is
detached; ensure the suppression token attach
(context_api.attach(...SUPPRESS_LANGUAGE_MODEL_INSTRUMENTATION_KEY...)) still
happens after creating the new span so suppression remains scoped to the new
span's context and the self.spans[run_id] assignment (SpanHolder(...)) continues
to store the correct span and token.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 10e0ac21-e4d3-484d-ae17-09de4e718e05
📒 Files selected for processing (1)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py
|
Thanks for the fix @saivedant169 ! So it's up to you, do you want to remove the claim of fix #3957, or continue to work on it? Here's full Claude review, attaching a test script generated to reproduce #3957 Verdict: Approve with P2 note The second commit ( What the fix does
attach(span_ctx) → span_token remembers ctx_0 The second commit reverses the order — detach P1 — None. The ordering fix is correct.P2 —
|
|
You can merge it that’s completely fine |
_create_span unconditionally overwrites self.spans[run_id] with a new SpanHolder. When _create_llm_span is called twice with the same run_id, the old holder — along with its suppression token — is gone from self.spans before the existing detach logic can read it, leaking supp_token_1 forever. Moved the old-holder detach to run before _create_span() is called, so the original suppression token is properly cleaned up. A second detach remains after _create_span to handle the span_token that _create_span just attached (otherwise the suppression would layer on top of the span context instead of the baseline). Added tests/test_context_token_lifecycle.py covering: - Suppression active after _create_llm_span - Suppression cleared after _end_span - Duplicate run_id no longer leaks supp_token_1 (regression for traceloop#3957) Fully closes traceloop#3957.
|
Thanks @max-deygin-traceloop went with the full fix. Moved the old-holder detach to run before Added your test as
Re the unrelated |
Closes #3957
feat(instrumentation): ...orfix(instrumentation): ....What
In
_create_llm_span(), when aSpanHolderalready exists for a givenrun_id, the old entry was being replaced directly. The previous holder'stoken(fromcontext_api.attach(set_value(SUPPRESS_LANGUAGE_MODEL_INSTRUMENTATION_KEY, True))) was dropped on the floor without ever being detached, leaving an orphaned entry on the OTel context stack.This is the same class of bug as the ones fixed in #3526 and #3807, just on a different code path. It was flagged during review of #3807 and tracked separately in #3957.
Fix
Before overwriting
self.spans[run_id], check if an existing holder is present and detach its token through the existing_safe_detach_context()helper.Scoped strictly to
_create_llm_span()as described in #3957. The same pattern exists in_create_span()and_create_task_span(), but those were not called out in the issue, so leaving them out of this PR to keep it tight.Tests
No new tests added — this is a defensive fix for a race between the runtime overwriting the registry entry and the old token remaining attached, which is hard to cover with cassette-based tests. Happy to add one if a maintainer suggests a good angle.
AI Disclosure
Took help of an AI tool to draft the fix and this description.
Summary by CodeRabbit