-
Notifications
You must be signed in to change notification settings - Fork 2
Sanitize LLM output #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yury Tsarev <yury@upbound.io>
Signed-off-by: Yury Tsarev <yury@upbound.io>
Signed-off-by: Yury Tsarev <yury@upbound.io>
Signed-off-by: Yury Tsarev <yury@upbound.io>
tnthornton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ytsarev ! The main question I have is if we really need to rely on string parsing (which can be pretty brittle) or if there's a path for structured processing of the errors 👍
| port: 8080 | ||
| protocol: "TCP" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet this continues to flake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnthornton most probably, but it was stable during last 4 RC releases https://github.com/upbound/function-claude/actions/workflows/ci.yml over this branch
| // extractJSONFromAgentError attempts to extract JSON from langchaingo agent parsing errors. | ||
| // When the agent framework fails to parse output, it returns an error containing the raw output. | ||
| // This function extracts and cleans that output for a second parsing attempt. | ||
| // Returns the extracted content and true if extraction succeeded, empty string and false otherwise. | ||
| func extractJSONFromAgentError(err error) (string, bool) { | ||
| if err == nil { | ||
| return "", false | ||
| } | ||
| if !strings.Contains(err.Error(), "unable to parse agent output") { | ||
| return "", false | ||
| } | ||
| parts := strings.SplitN(err.Error(), "unable to parse agent output: ", 2) | ||
| if len(parts) != 2 { | ||
| return "", false | ||
| } | ||
| cleaned := stripMarkdownCodeBlocks(parts[1]) | ||
| return cleaned, true | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any possible path where we don't have to do string parsing? e.g. are the errors structured in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I checked the langchaingo source code to verify.
langchaingo uses plain sentinel errors without structured fields.
- Error definition (
agents/errors.go:19):
ErrUnableToParseOutput = errors.New("unable to parse agent output")- Error wrapping with output (https://github.com/tmc/langchaingo/blob/main/agents/conversational.go#L157):
return nil, nil, fmt.Errorf("%w: %s", ErrUnableToParseOutput, output)The error wraps the sentinel with fmt.Errorf and appends the raw output as a string. There's no custom error type with accessible fields - just a formatted string.
No structured alternative exists - ParserErrorHandler in the same file is for handling/formatting errors, not for accessing the output.
String parsing via strings.SplitN() is currently the only way to extract the embedded output. Our extractJSONFromAgentError() function uses defensive pattern matching and is covered by 8 test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Thanks for the digging into it deeper and for adding the test coverage. Hypothetically if the strings change, the tests "pop" and we can account for it.
Description of your changes
Problem
When using
claude-haiku-4-5model for cost savings, the model wraps JSON/YAML output in markdown code blocks (```json) despite prompt instructions. This caused:Similar was observed with sonnet-4-5, but there was the ability to change the behaviour with prompt modifications.
Sanitization is useful in any case and makes function-claude output and behaviour much more deterministic.
Solution
Implemented defense-in-depth markdown stripping with clean code reuse:
Key Changes
extractJSONFromAgentError()- testable function for agent error handlingresourceFrom()to return cleaned string alongside parsed resourcesstripMarkdownCodeBlocks()helper used across both pathsTest Coverage
extractJSONFromAgentError()(nil, unrelated, JSON, YAML, generic, whitespace, wrapped)resourceFrom()markdown scenarios with cleaned string validationE2E testing with https://github.com/upbound/configuration-aws-database-ai
Results
✅ Haiku model works reliably with cost savings
✅ Clean Kubernetes event messages without markdown
✅ No redundant stripping operations
✅ Works correctly in both success and error cases
I have: