-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve generate unit tests command output quality #907
Conversation
vscode/src/chat/MessageProvider.ts
Outdated
@@ -206,6 +206,7 @@ export abstract class MessageProvider extends MessageHandler implements vscode.D | |||
// TODO(keegancsmith) guardrails may be slow, we need to make this async update the interaction. | |||
displayText = await this.guardrailsAnnotateAttributions(displayText) | |||
this.transcript.addAssistantResponse(text || '', displayText) | |||
void this.editor.controllers.command?.onCompletion(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sends the response back to the command controller to create the test files when needed
- Your answers and suggestions should based on the shared context only. | ||
- Do not suggest anything that would break the working code. | ||
- Do not make assumptions or fabricating additional details. | ||
- Do not make any assumptions about the code and file names or any misleading information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this helps prevent Cody from "hallucinating"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is applied everywhere it feels like a roll of the dice. We need to record a battery of inputs and outputs to monitor for regressions. But since we don't have that yet, I guess YOLO?
Curious how you arrived at "any misleading information" being useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are thinking about the one we use in premade? That one is applied everywhere but this one is only applied to prompts for custom commands, on top of the one in premade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious how you arrived at "any misleading information" being useful.
lol a teammate reported an issue on slack about Cody hallucinated directory and file names, which I was able to reproduce. I saw that we already asked Cody not to make up file names, file path, only shows file paths that they are certain of their existence etc in the rules we set for Cody, which we can see are not doing much.
I noticed Cody doesn't know the connections between the files, file name, file path etc we share with them. Long story short, refer anything I share with Cody as context / information has been working well for me, so I tried the "any misleading information"
approach (as users can also use non-file as context so i thought why not) by adding it to my custom command starter, asked Cody the same question and find it to be helpful so far.
^ You can see it in the output channel where I added Do not make any assumptions about the code and file names or any misleading information.
in front of the same question where teammate reported the hallucinating issue
Just tried asking Cody the same question on web and get the same hallucinating problem (the codeownership it mentioned doesn't exist):
If i prepend the statement to the question, I get the "honest" response. It would just refer to the actual shared context:
cc @kalanchan regarding the next release |
@abeatrix I’ll be writing this one! Do you have a description of how the generate test output is improved by these changes? (E.g. Before and after examples? Cases that didn’t work well before, that do now?) |
@toolmantim Just added more demo videos at the end of the PR description and added a Before and After section with example test output. let me know if you need more info :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool. I gave it a go w/ some ts and go files in various projects! The results I got were a little mixed… even though some output is really promising.
I spotted in some cases the chat output contained this for existing tests:
I am writing these unit tests in the existing test file search_test.go which already contains tests for the search package.
Which makes it sound like Cody updated the existing file, which it doesn't (it opens the temporary file)
Also this happened when there was no existing file (one was created):
I am not importing any new packages since this test file already exists and follows the patterns from the existing tests.
I feel like we might need some wider testing this week, rather than rushing it into the release. And give @beyang & @csells a chance to see how it's going too.
Just saw this after the review sorry! 😅 |
Yea. Also included a note
I don't think it's possible to get results that works perfectly for all cases. This is just an improved version of what we have now. I'd suggest compare the results between the current version and the improved version to see the quality improvements. |
I guess calling this a PR for improving context for the generate unit test command is more accurate 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @abeatrix!
Left a few non-blocking comments.
Co-authored-by: Taras Yemets <yemets.taras@gmail.com>
UpdateUpdated the PR to focus on improving:
Working with @toolmantim to explore options for inserting the generated tests into new/existing test files, which will be included in a new PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some feedback inline.
I realize there's a lot of reviewers on this. Wanted to chime in to help but feel free to not block on me.
@@ -17,7 +17,7 @@ | |||
}, | |||
"test": { | |||
"description": "Generate unit tests", | |||
"prompt": "Write unit tests to validate the expected functionality for each function in the selected code. Ensure you cover any example test cases included in comments. If test files already exist, follow similar patterns for imports, setup, and assertions used in those tests. If no tests exist, feel free to import any common test libraries as needed. Focus on testing key behaviors and edge cases. Structure the tests so they are clear, maintainable, and can be run automatically. All generated code must be complete workable code, not fragments.", | |||
"prompt": "Write a suite of unit tests only for the functions inside the <selected> tags of the shared context. First, review the shared context to see if there is an existing test file that matches the file path contains the selected code. If it exists and contains tests for the selected functions, generate new unit tests for uncovered edge cases I can add to the existing suite. If none, detect what testing framework/library for writing unit test is used from the shared context. When detected, use the detected framework and libraries and follow their patterns for adding packages, imports, dependencies, setup, and assertions used in those code snippets to generate the unit tests. If no tests exist, check shared context to see if any testing libraries for {languageName} are already configured. Import those libraries as needed. If no libraries or package.json are set up, import common test libraries for {languageName}. Pay attention to file paths to see if there is an existing test file for the selected code. If yes, use the same pattern to create new unit tests for edge cases. Focus on testing key behaviors for each function. Only includes mocks if you detected one. Use descriptive variable names that explain the purpose and meaning, not just generic names like 'foo'/'bar'/'name' etc. In your answer, start by telling me which libraries you are importing and why. For example, \"No new imports because I am writing unit tests for an existing test suite\" or \"I am importing unittest since that was used in the shared Python code,\" or \"Importing Jest because it is the configured test framework for Typescript in package.json.\" Then, show the complete test code you wrote, including 1) All necessary dependencies including packages and libraries are imported 2) Setting up the test environment 3) Writing passing tests. At the end, tell me the file path where these tests should be added. The tests should validate expected functionality and cover any example test cases from comments, as well as edge cases and bugs. Code for all generated tests must be complete and workable, no fragments or comments, enclosed in a single code block.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to see some samples of this on codebases which don't have package.json, and more languages.
Grammar nit: "only includes mocks" => "only include mocks"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yee here is a list of videos of me trying this out in different languages:
typescript: https://www.loom.com/share/3489f9ae811b41d08ca0329489580543?sid=379b3122-3ea7-4b1e-bed2-12721e1f09b6
go: https://www.loom.com/share/d1514db896c840989e05f7c74fd30b81?sid=60f8d4d6-2b6c-42d3-b016-b347666893e9
go 2: https://www.loom.com/share/952caef2067a4d748bcd10eef8a60800?sid=04ca3f58-e2c8-474f-86ee-08af03f00689
scala/java: https://www.loom.com/share/d28b1089ac8a4d9ea95816c3e476ec35?sid=adabd913-3811-4009-9391-ccd1406fe91d
python: https://www.loom.com/share/7a788b389ce84a4b85fbe2d31f4a0ab4?sid=11d9314d-31c3-4530-be6a-b4e9faa5de0b
- Your answers and suggestions should based on the shared context only. | ||
- Do not suggest anything that would break the working code. | ||
- Do not make assumptions or fabricating additional details. | ||
- Do not make any assumptions about the code and file names or any misleading information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is applied everywhere it feels like a roll of the dice. We need to record a battery of inputs and outputs to monitor for regressions. But since we don't have that yet, I guess YOLO?
Curious how you arrived at "any misleading information" being useful.
lib/shared/src/chat/prompts/utils.ts
Outdated
@@ -76,6 +80,21 @@ export function isOnlySelectionRequired(contextConfig: CodyPromptContext): boole | |||
return !contextConfig.none && ((contextConfig.selection && contextConfigLength === 1) || !contextConfigLength) | |||
} | |||
|
|||
// Extract the word "unit" / "e2e" / "integration" from a test prompt text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot more relevant words here, what about: acceptance, functional, smoke, sanity (ableist but I guess we meet the user where they are), stress, performance, regression, stress, load, user acceptance, UAT, automated, security, accessibility, a11y, fuzz, ...
But since you only use this to look for "unit", why not simplify the code to just look for "unit"? You should also look for "unittest"... probably about 1/6th of "unit test" per Google Books ngrams but growing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that not everyone puts unit tests in a directory called unit, but they most likely do not exist in directories called e2e 🤣 (or if they do, they hopefully aren't the only unit tests in the projects for us to find). The main goal is to find a couple of unit tests as examples for the LLM to see how unit tests are set up in the current project.
We try to find the test files in the current directory first, if we did find one, we will just stop there.
If we can't find any, then we will do a project level search
export function getClaudeHumanText(commandInstructions: string, currentFileName?: string): string { | ||
const promptText = prompts.instruction.replace('{humanInput}', commandInstructions) | ||
if (!currentFileName) { | ||
return promptText |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fail if we're returning prompts containing uninterpolated strings like {languageName}
? That would be a programming error and it is good to detect the problem close to the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find claude would just replace the language name "in their head" if one is not provided so I think we are safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Some minor feedback about the prompt inline.
@@ -17,7 +17,7 @@ | |||
}, | |||
"test": { | |||
"description": "Generate unit tests", | |||
"prompt": "Write a suite of unit tests only for the functions inside the <selected> tags of the shared context. First, review the shared context to see if there is an existing test file that matches the file path contains the selected code. If it exists and contains tests for the selected functions, generate new unit tests for uncovered edge cases I can add to the existing suite. If none, detect what testing framework/library for writing unit test is used from the shared context. When detected, use the detected framework and libraries and follow their patterns for adding packages, imports, dependencies, setup, and assertions used in those code snippets to generate the unit tests. If no tests exist, check shared context to see if any testing libraries for {languageName} are already configured. Import those libraries as needed. If no libraries or package.json are set up, import common test libraries for {languageName}. Pay attention to file paths to see if there is an existing test file for the selected code. If yes, use the same pattern to create new unit tests for edge cases. Focus on testing key behaviors for each function. Only includes mocks if you detected one. Use descriptive variable names that explain the purpose and meaning, not just generic names like 'foo'/'bar'/'name' etc. In your answer, start by telling me which libraries you are importing and why. For example, \"No new imports because I am writing unit tests for an existing test suite\" or \"I am importing unittest since that was used in the shared Python code,\" or \"Importing Jest because it is the configured test framework for Typescript in package.json.\" Then, show the complete test code you wrote, including 1) All necessary dependencies including packages and libraries are imported 2) Setting up the test environment 3) Writing passing tests. At the end, tell me the file path where these tests should be added. The tests should validate expected functionality and cover any example test cases from comments, as well as edge cases and bugs. Code for all generated tests must be complete and workable, no fragments or comments, enclosed in a single code block.", | |||
"prompt": "Review the shared code context and configurations to identify the test framework and libraries in use, then generate a suite of unit tests for the functions in <selected> using the detected test framework and libraries. Ensure to follow the same patterns as the shared unit tests (if any), and add packages, imports, dependencies, and assertions only if used in the shared code. Pay attention to the file path of each shared context to identify if a test file for the selected code already exists. If it exists, focuse on generate new unit tests for uncovered cases. If none detected, import common unit test libraries for {languageName}. Focus on validating the key functionality with simple and clear assertions. Only includes mocks if one is detected in shared code. Before writing the tests, determine which test libraries or frameworks to import, e.g. 'No new imports needed - using existing libs'/'Importing test framework that matches shared context usage'/'Importing the defined framework' etc. Then, tell me the file path where these tests should be added and summarize test coverage. At the end, enclose only fully completed code for the new unit tests in single markdown codeblock, no fragments. New tests should validate expected functionality and cover edge cases for <selected> with all required methods imported. Do not repeat existing tests.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where you have "if it exists", do you find "it" makes the LLM squirrelly versus reiterating "if a test file exists"?
Why did we add flourishes like "shared code context and configurations"? I'm interested in how you arrived at these, determined the effect of them, because I am curious if we can scale it up.
These "think through" prompts for no new imports, importing test framework, etc. are super interesting too.
Spelling/grammar nits:
focuse on generate
→focus on generating
(or maybe justgenerate
?)Only includes mocks
→Only include mocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where you have "if it exists", do you find "it" makes the LLM squirrelly versus reiterating "if a test file exists"? Why did we add flourishes like "shared code context and configurations"? I'm interested in how you arrived at these, determined the effect of them, because I am curious if we can scale it up.
Good question! I noticed "if a test file exists"
doesn't work in some languages; even if test files were shared, it would still refer to them as "shared Python code":
So I tried to minimize the use of the word "file" and replace it with either "shared context" or "shared code".
As for configurations, we are sharing package.json for ts/js tests when no test context is found, but for non-ts/js tests, we will just share a list of files located in their root dir to see if that would give them hints if a test framework is set up. Plus on top of that, i'm not familiar with how tests are set up in other languages so instead of listing each config file for the model to pick up, it seems like configurations
has been working so far 🙏
I am not sure if I am on the right track, though 😅
And thanks for the grammar check and for taking the time to review 🙇♀️ Going to follow your suggestion and make sure the prompt still works before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dominiccooney I tried making the following changes and noticed a huge difference in the results. Thought you'd be interested!
Looks like it won't generate new tests for code with existing tests file if we don't ask it to "focus"
focuse on generate → generate
focuse on generate → focus on generating
Here is a demo run from the latest commit build: https://www.loom.com/share/a17c8c63d1284ac889cff2d0bdc93a9c?sid=aee107b7-bf23-4344-9235-a290d897e0f9 Tested the command with go, python, scala, and typescripts files as shown in the loom video above. I have also worked on inserting the tests automatically to their test files (see video below) in #984, which was closed as we agreed more design works are required as well as code change that would take a sprint or two to finish. improved.unit.test.demo_.create.new.test.file.and.insert.new.tests.to.existing.test.file.mp4as discussed with Tim (ref), I am going to merge what I have now for the quality improvement to focus on autocompletion works. |
@abeatrix thanks for showing the output from those alternative prompts! It really illustrates how steep the drop off in performance is even changing single words. |
Moving changelog item for #907 from v0.8.0 to v0.12.0. It was added to the wrong release when it was merged. ## Test plan <!-- Required. See https://docs.sourcegraph.com/dev/background-information/testing_principles. --> changelog update
Updating current preamble, rules, and prompt mixin with new findings from #907. Issue 1: Cody hallucinates about file path and code One of the main issues we saw in Chat is Cody hallucinating about files and code that do not exist in the shared context. I noticed this is mainly because of how we phrased our context, as Cody do not understand the relationship between each code snippet we share with them. For more information, see #907 (comment) One of the many prompts I've tried to prevent hallucination is to add `Do not make any assumptions about the code and file names or any misleading information` to the question, which has been working well for me. This is also the prompt we added to all our commands. [A customer has also confirmed](https://sourcegraph.slack.com/archives/C04MSD3DP5L/p1697747191354829?thread_ts=1697626152.033629&cid=C04MSD3DP5L) that adding this to their questions stopped Cody from hallucinating and received better responses from Chat, this is why I propose adding it to `PromptMixin`, which will be added to the start of every question (instead of Preabmle, which is added to the start of the session and will get left out as the conversation gets longer.) I don't think this will be 100% hallucination-proof, but it works a lot better than what we currently have. #### Before - lib/share/ui/src/components/CodeBlock is not the correct path ![Screenshot 2023-10-18 at 2 45 49 PM](https://github.com/sourcegraph/cody/assets/68532117/9e8b6438-b37c-4745-a5f5-c6abcc2ceec8) #### After - make well-informed answers ![Screenshot 2023-10-18 at 2 47 07 PM](https://github.com/sourcegraph/cody/assets/68532117/87d8b9ce-b8bc-4298-a7be-9a6957a58c36) Issue 2: The current preambles need to be updated, and not working as intended most of the times. Addressed by this PR with the following changes: - Simplified the preamble actions and rules text to be more concise. - Removed redundant rules that will always be false, e.g. Cody never has direct access to your file or repository even when we tell them they do. - Removed detailed instructions about code formatting and limitations, leaving just the essential rules - Shortened the preamble answer to focus on the core assistant persona and capabilities Issue 3: Cody will not answer in language that is not the default language of the editor CLOSE https://github.com/sourcegraph/cody/discussions/1011 && #988 Many users have complained that they cannot get Cody to answer their questions because Cody refused to answer questions that is not the same as the default language of the editor. Addressed by this PR with the following changes: Remove `languagePromptMixin` from the `activate` event for editors, which allows Cody to answer in the same language as the question #### Before <img width="609" alt="Screenshot 2023-10-19 at 5 08 17 PM" src="https://github.com/sourcegraph/cody/assets/68532117/6265a46e-d5e8-489b-a92d-824b56a391dc"> #### After <img width="624" alt="Screenshot 2023-10-19 at 5 07 57 PM" src="https://github.com/sourcegraph/cody/assets/68532117/b1478a0f-8bfa-468b-abe2-69588f1deef6"> ## Test plan <!-- Required. See https://docs.sourcegraph.com/dev/background-information/testing_principles. --> 1. Ask Cody a question in another language that is not the same as your editor. Cody should still be able to answer your question in the same language. 2. Ask Cody a question in editor, and then ask the same question in Web to compare the quality. Questions where Cody hallucinates on Web should not happen in the editor when using the updated prompts. Example question to ask in the cody repo: `Do we format the completion item before we suggest them?` <img width="883" alt="image" src="https://github.com/sourcegraph/cody/assets/68532117/86b1c8c5-4e2c-4674-9ad6-18382dbb2813">
Close #619
This PR contains changes that improve the current generate unit tests command:
Before and After
The tests will not always be perfect, but with the changes mentioned above, the improved tests should do the following consistently when possible:
before
In the video below, you can see the test imported the "chai" testing library for the tests, instead of "vitest" which is the testing framework that is used by other test files in the project:
Test.-.before.mp4
You can also see from the context list that it includes test files from e2e test and integration tests that are not relevant to unit tests:
after
In the improved version, you can see it import the correct testing framework "vitest", using more relevant test files as context:
Test.-.after.mp4
More examples
before - go
use 'NewMockCache' that does not exist
after - go
Follow-up
Test plan
Generate Unit Tests
commandGenerate Unit Tests
commandTry running other commands. The changes in this should not change the behavior of other commands.
Demo_.Improved.Unit.Tests.mp4
It doesn't always produce the perfect tests, but it generates better and higher quality tests consistently compared to the older version of the command. Works consistently with other languages as well. More demo videos:
typescript: https://www.loom.com/share/3489f9ae811b41d08ca0329489580543?sid=379b3122-3ea7-4b1e-bed2-12721e1f09b6
go: https://www.loom.com/share/d1514db896c840989e05f7c74fd30b81?sid=60f8d4d6-2b6c-42d3-b016-b347666893e9
go 2: https://www.loom.com/share/952caef2067a4d748bcd10eef8a60800?sid=04ca3f58-e2c8-474f-86ee-08af03f00689
scala/java: https://www.loom.com/share/d28b1089ac8a4d9ea95816c3e476ec35?sid=adabd913-3811-4009-9391-ccd1406fe91d
python: https://www.loom.com/share/7a788b389ce84a4b85fbe2d31f4a0ab4?sid=11d9314d-31c3-4530-be6a-b4e9faa5de0b