You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We plan to add integration tests for the quality of the LLM output in chat. The idea is to look for signals that correlate with high quality responses, as opposed to directly measuring the response correctness (which can be really challenging!) We can put together a test suite using a bunch of examples from our dogfood Chat logs, and check the following properties:
Character consistency. Give the LLM random (or even adversarial) context and questions. The response should never “break character”, meaning
It should never apologize (no “sorry” or “apologize” in the output)
No extreme hedging (no “unfortunately”, “I am an LLM”)
Should never say it doesn’t have access to the codebase
Groundedness. Ask the LLM a question like “what file contains…” and provide context with the correct file. The LLM should mention the file in its output, and the path should be totally accurate. Also test this for symbol names.
Non-hallucination. Ask the LLM a question like “what file contains…” and provide context that contains random files, with no relation to the question. The LLM should not mention any file or symbol directly.
Once we have a test suite, we can plan a round of improvements to our preamble, default model, etc. We could also incorporate the checks into telemetry, perhaps as part of Cody Gateway.
chenkc805
changed the title
Chat: integration tests for response quality
Create integration test suite that regularly runs against our default Enterprise model (Claude 3 Sonnet)
May 6, 2024
We plan to add integration tests for the quality of the LLM output in chat. The idea is to look for signals that correlate with high quality responses, as opposed to directly measuring the response correctness (which can be really challenging!) We can put together a test suite using a bunch of examples from our dogfood Chat logs, and check the following properties:
We can build off the existing test framework for recording and replaying LLM responses.
Once we have a test suite, we can plan a round of improvements to our preamble, default model, etc. We could also incorporate the checks into telemetry, perhaps as part of Cody Gateway.
Original RFC: https://docs.google.com/document/d/1I-xTsAOsDjv_G-J3vt9oQKuWYLJnr8wehrRMMef1XR4/edit
The text was updated successfully, but these errors were encountered: