docs: add tool-call parser troubleshooting for custom LLM backends#330
Open
mason5052 wants to merge 1 commit into
Open
docs: add tool-call parser troubleshooting for custom LLM backends#330mason5052 wants to merge 1 commit into
mason5052 wants to merge 1 commit into
Conversation
Issue vxcontrol#313 reported flows that stall after a few steps when running a custom OpenAI-compatible backend (LiteLLM in front of llama.cpp serving qwen3.6-35b via LLM_SERVER_*). The backend returned malformed tool-call arguments, surfaced as 'Failed to parse tool call arguments as JSON' HTTP 500s and cascading retries. The maintainer fixed the stall in the latest build by sanitizing wrong function-call arguments. Add a troubleshooting subsection under Custom LLM Provider Configuration that explains the root cause and how to diagnose it: - Custom OpenAI-compatible backends must return valid tool-call (function-call) JSON; llama.cpp, SGLang, and vLLM usually require a specific tool-call parser and matching chat template, and not every setup produces valid tool calls out of the box. - Symptoms: 'Failed to parse tool call arguments as JSON', flow stalls, looping tool calls, the 'failed to select primary docker image via llm call' start-of-flow failure, and unexpected backend HTTP errors. - Investigation: check PentAGI and backend/proxy logs, validate with the ctester utility before a full flow, confirm the parser/chat template match the model, and update PentAGI (recent builds sanitize malformed function-call arguments). Docs only. No tool-call parser code, provider runtime, schema, migration, or config-default changes. Wording frames compatibility as dependent on the backend's OpenAI-compatible tool-call behavior rather than claiming every llama.cpp backend is supported.
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds documentation to help debug LLM backend/tool-call formatting issues that can stall agent flows when using OpenAI-compatible backends.
Changes:
- Documented common tool-call (function-call) JSON parsing failure modes.
- Added investigation steps and pointers to logs and the
ctesterutility. - Clarified that correct parser/chat-template configuration is required for self-hosted inference engines.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a troubleshooting subsection under Custom LLM Provider Configuration explaining why tool-call (function-call) parser problems on self-hosted OpenAI-compatible backends (llama.cpp / SGLang / vLLM, often behind LiteLLM) cause stalled flows, and how to diagnose them. Docs only.
Problem
Issue #313 reported that flows stop responding after a few steps when running a custom backend configured through
LLM_SERVER_*(LiteLLM in front of llama.cpp servingqwen3.6-35b). The logs showed:surfaced through LiteLLM as an HTTP 500, followed by cascading retries and a 429. The maintainer confirmed the stall was fixed in the latest build by sanitizing malformed function-call arguments, and that the root cause was the model side returning corrupted tool-call arguments.
There is currently no documentation that explains this class of failure, even though it is a common pitfall with self-hosted backends and is closely related to the image-chooser failure (a flow's first action is an LLM tool call to pick the container image).
Solution
Add a
#### Troubleshooting: tool-call (function-call) parser errorssubsection right after the Custom LLM Provider Configuration content, covering:Failed to parse tool call arguments as JSON, a flow that stalls after a few steps, looping tool calls, the start-of-flowfailed to select primary docker image via llm callerror, and unexpected backend 5xx/4xx responses.ctesterbefore a full flow, confirm the parser/chat template match the model, and update PentAGI (recent builds sanitize malformed function-call arguments).The new content links only to the existing Testing LLM Agents section and references the image-chooser error in prose (no new anchor), so it stands on its own against
main.User Impact
Failed to parse tool call arguments as JSONstall.ctesterfor pre-flight validation and at the update that sanitizes malformed arguments.Test Plan
git diff --checkclean.README.md(+20 lines). No tool-call parser code, provider runtime, schema, migration, or config-default changes.failed to select primary docker image via llm callexists inbackend/pkg/providers/providers.goonmain.LLM_SERVER_URL/LLM_SERVER_KEY/LLM_SERVER_MODEL/LLM_SERVER_PROVIDERexist in.env.example.ctesterutility exists and tests tool-calling agent types, and that the#testing-llm-agentsanchor resolves.Refs #313