docs: add tool-call parser troubleshooting for custom LLM backends by mason5052 · Pull Request #330 · vxcontrol/pentagi

mason5052 · 2026-06-04T02:55:12Z

Summary

Add a troubleshooting subsection under Custom LLM Provider Configuration explaining why tool-call (function-call) parser problems on self-hosted OpenAI-compatible backends (llama.cpp / SGLang / vLLM, often behind LiteLLM) cause stalled flows, and how to diagnose them. Docs only.

Problem

Issue #313 reported that flows stop responding after a few steps when running a custom backend configured through LLM_SERVER_* (LiteLLM in front of llama.cpp serving qwen3.6-35b). The logs showed:

Failed to parse tool call arguments as JSON: [json.exception.parse_error.101]
parse error at line 1, column 131: syntax error while parsing value -
unexpected end of input

surfaced through LiteLLM as an HTTP 500, followed by cascading retries and a 429. The maintainer confirmed the stall was fixed in the latest build by sanitizing malformed function-call arguments, and that the root cause was the model side returning corrupted tool-call arguments.

There is currently no documentation that explains this class of failure, even though it is a common pitfall with self-hosted backends and is closely related to the image-chooser failure (a flow's first action is an LLM tool call to pick the container image).

Solution

Add a #### Troubleshooting: tool-call (function-call) parser errors subsection right after the Custom LLM Provider Configuration content, covering:

Custom OpenAI-compatible backends must return valid tool-call JSON; llama.cpp, SGLang, and vLLM usually require a specific tool-call parser and a matching chat template, and not every setup produces valid tool calls out of the box (compatibility depends on the backend, not PentAGI alone).
Symptoms: Failed to parse tool call arguments as JSON, a flow that stalls after a few steps, looping tool calls, the start-of-flow failed to select primary docker image via llm call error, and unexpected backend 5xx/4xx responses.
How to investigate: check both PentAGI and backend/proxy logs, validate the provider with ctester before a full flow, confirm the parser/chat template match the model, and update PentAGI (recent builds sanitize malformed function-call arguments).

The new content links only to the existing Testing LLM Agents section and references the image-chooser error in prose (no new anchor), so it stands on its own against main.

User Impact

Users on self-hosted/llama.cpp/SGLang/vLLM backends get a clear explanation of the tool-call parser failure mode and a concrete diagnosis path, instead of an opaque Failed to parse tool call arguments as JSON stall.
Points users at ctester for pre-flight validation and at the update that sanitizes malformed arguments.
No behavior change.

Test Plan

git diff --check clean.
Docs-only diff: README.md (+20 lines). No tool-call parser code, provider runtime, schema, migration, or config-default changes.
Verified the referenced error string failed to select primary docker image via llm call exists in backend/pkg/providers/providers.go on main.
Verified LLM_SERVER_URL / LLM_SERVER_KEY / LLM_SERVER_MODEL / LLM_SERVER_PROVIDER exist in .env.example.
Verified the ctester utility exists and tests tool-calling agent types, and that the #testing-llm-agents anchor resolves.
Placed away from the README regions touched by open PRs docs: clarify "primary docker image" error is an LLM backend failure #325 and docs: add embedding troubleshooting for stalled flows #327 to avoid conflicts.
No unrelated files included.

Refs #313

Issue vxcontrol#313 reported flows that stall after a few steps when running a custom OpenAI-compatible backend (LiteLLM in front of llama.cpp serving qwen3.6-35b via LLM_SERVER_*). The backend returned malformed tool-call arguments, surfaced as 'Failed to parse tool call arguments as JSON' HTTP 500s and cascading retries. The maintainer fixed the stall in the latest build by sanitizing wrong function-call arguments. Add a troubleshooting subsection under Custom LLM Provider Configuration that explains the root cause and how to diagnose it: - Custom OpenAI-compatible backends must return valid tool-call (function-call) JSON; llama.cpp, SGLang, and vLLM usually require a specific tool-call parser and matching chat template, and not every setup produces valid tool calls out of the box. - Symptoms: 'Failed to parse tool call arguments as JSON', flow stalls, looping tool calls, the 'failed to select primary docker image via llm call' start-of-flow failure, and unexpected backend HTTP errors. - Investigation: check PentAGI and backend/proxy logs, validate with the ctester utility before a full flow, confirm the parser/chat template match the model, and update PentAGI (recent builds sanitize malformed function-call arguments). Docs only. No tool-call parser code, provider runtime, schema, migration, or config-default changes. Wording frames compatibility as dependent on the backend's OpenAI-compatible tool-call behavior rather than claiming every llama.cpp backend is supported.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds documentation to help debug LLM backend/tool-call formatting issues that can stall agent flows when using OpenAI-compatible backends.

Changes:

Documented common tool-call (function-call) JSON parsing failure modes.
Added investigation steps and pointers to logs and the ctester utility.
Clarified that correct parser/chat-template configuration is required for self-hosted inference engines.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings June 4, 2026 02:55

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add tool-call parser troubleshooting for custom LLM backends#330

docs: add tool-call parser troubleshooting for custom LLM backends#330
mason5052 wants to merge 1 commit into
vxcontrol:mainfrom
mason5052:codex/issue-313-tool-call-parser-troubleshooting

mason5052 commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mason5052 commented Jun 4, 2026

Summary

Problem

Solution

User Impact

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants