Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create integration test suite that regularly runs against our default Enterprise model (Claude 3 Sonnet) #3913

Open
jtibshirani opened this issue Apr 23, 2024 · 0 comments
Labels

Comments

@jtibshirani
Copy link
Member

jtibshirani commented Apr 23, 2024

We plan to add integration tests for the quality of the LLM output in chat. The idea is to look for signals that correlate with high quality responses, as opposed to directly measuring the response correctness (which can be really challenging!) We can put together a test suite using a bunch of examples from our dogfood Chat logs, and check the following properties:

  • Character consistency. Give the LLM random (or even adversarial) context and questions. The response should never “break character”, meaning
    • It should never apologize (no “sorry” or “apologize” in the output)
    • No extreme hedging (no “unfortunately”, “I am an LLM”)
    • Should never say it doesn’t have access to the codebase
  • Groundedness. Ask the LLM a question like “what file contains…” and provide context with the correct file. The LLM should mention the file in its output, and the path should be totally accurate. Also test this for symbol names.
  • Non-hallucination. Ask the LLM a question like “what file contains…” and provide context that contains random files, with no relation to the question. The LLM should not mention any file or symbol directly.
  • … and other simple properties!

We can build off the existing test framework for recording and replaying LLM responses.

Once we have a test suite, we can plan a round of improvements to our preamble, default model, etc. We could also incorporate the checks into telemetry, perhaps as part of Cody Gateway.

Original RFC: https://docs.google.com/document/d/1I-xTsAOsDjv_G-J3vt9oQKuWYLJnr8wehrRMMef1XR4/edit

@github-actions github-actions bot added the cody label Apr 23, 2024
@chenkc805 chenkc805 changed the title Chat: integration tests for response quality Create integration test suite that regularly runs against our default Enterprise model (Claude 3 Sonnet) May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant