Create integration test suite that regularly runs against our default Enterprise model (Claude 3 Sonnet) #3913

jtibshirani · 2024-04-23T17:04:45Z

We plan to add integration tests for the quality of the LLM output in chat. The idea is to look for signals that correlate with high quality responses, as opposed to directly measuring the response correctness (which can be really challenging!) We can put together a test suite using a bunch of examples from our dogfood Chat logs, and check the following properties:

Character consistency. Give the LLM random (or even adversarial) context and questions. The response should never “break character”, meaning
- It should never apologize (no “sorry” or “apologize” in the output)
- No extreme hedging (no “unfortunately”, “I am an LLM”)
- Should never say it doesn’t have access to the codebase
Groundedness. Ask the LLM a question like “what file contains…” and provide context with the correct file. The LLM should mention the file in its output, and the path should be totally accurate. Also test this for symbol names.
Non-hallucination. Ask the LLM a question like “what file contains…” and provide context that contains random files, with no relation to the question. The LLM should not mention any file or symbol directly.
… and other simple properties!

We can build off the existing test framework for recording and replaying LLM responses.

Once we have a test suite, we can plan a round of improvements to our preamble, default model, etc. We could also incorporate the checks into telemetry, perhaps as part of Cody Gateway.

Original RFC: https://docs.google.com/document/d/1I-xTsAOsDjv_G-J3vt9oQKuWYLJnr8wehrRMMef1XR4/edit

github-actions bot added the cody label Apr 23, 2024

chenkc805 changed the title ~~Chat: integration tests for response quality~~ Create integration test suite that regularly runs against our default Enterprise model (Claude 3 Sonnet) May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create integration test suite that regularly runs against our default Enterprise model (Claude 3 Sonnet) #3913

Create integration test suite that regularly runs against our default Enterprise model (Claude 3 Sonnet) #3913

jtibshirani commented Apr 23, 2024 •

edited

Create integration test suite that regularly runs against our default Enterprise model (Claude 3 Sonnet) #3913

Create integration test suite that regularly runs against our default Enterprise model (Claude 3 Sonnet) #3913

Comments

jtibshirani commented Apr 23, 2024 • edited

jtibshirani commented Apr 23, 2024 •

edited