improve generate unit tests command output quality #907

abeatrix · 2023-09-03T02:15:51Z

Close #619

This PR contains changes that improve the current generate unit tests command:

update the current unit test prompt to give it detailed directions on how to produce unit tests with higher quality consistently (ref anthropic docs)
update context fetching process for tests to filter other test files like e2e and integration tests when generating unit-test
include import statements from current files as context to avoid cases where cody fails to include import statements in tests
update language used in prompt to let Cody know all files come from the same codebase. I noticed Cody currently does not know the shared context is from files, and certainly doe snot know the file path we shared with each context comes from a file that belongs to the same codebase. It just doesn't make that connection until we spell it out for them
did not make any changes that would change the behavior of the current test command

Before and After

The tests will not always be perfect, but with the changes mentioned above, the improved tests should do the following consistently when possible:

use the correct testing frameworks
include imports (common issues with the current version where imports are incorrectly added or often missing)
include more relevant context

before

In the video below, you can see the test imported the "chai" testing library for the tests, instead of "vitest" which is the testing framework that is used by other test files in the project:

Test.-.before.mp4

You can also see from the context list that it includes test files from e2e test and integration tests that are not relevant to unit tests:

after

In the improved version, you can see it import the correct testing framework "vitest", using more relevant test files as context:

Test.-.after.mp4

More examples

before - go

use 'NewMockCache' that does not exist

after - go

Follow-up

a way to add generated tests easily to a test file
consolidate vscode-context and move it out of shared lib

Test plan

Start Cody from this branch
Select a function in a file and run the Generate Unit Tests command
Compare the test output with the output from the current Generate Unit Tests command

Try running other commands. The changes in this should not change the behavior of other commands.

Demo_.Improved.Unit.Tests.mp4

It doesn't always produce the perfect tests, but it generates better and higher quality tests consistently compared to the older version of the command. Works consistently with other languages as well. More demo videos:
typescript: https://www.loom.com/share/3489f9ae811b41d08ca0329489580543?sid=379b3122-3ea7-4b1e-bed2-12721e1f09b6
go: https://www.loom.com/share/d1514db896c840989e05f7c74fd30b81?sid=60f8d4d6-2b6c-42d3-b016-b347666893e9
go 2: https://www.loom.com/share/952caef2067a4d748bcd10eef8a60800?sid=04ca3f58-e2c8-474f-86ee-08af03f00689
scala/java: https://www.loom.com/share/d28b1089ac8a4d9ea95816c3e476ec35?sid=adabd913-3811-4009-9391-ccd1406fe91d
python: https://www.loom.com/share/7a788b389ce84a4b85fbe2d31f4a0ab4?sid=11d9314d-31c3-4530-be6a-b4e9faa5de0b

abeatrix · 2023-09-03T02:16:43Z

vscode/src/chat/MessageProvider.ts

@@ -206,6 +206,7 @@ export abstract class MessageProvider extends MessageHandler implements vscode.D
                    // TODO(keegancsmith) guardrails may be slow, we need to make this async update the interaction.
                    displayText = await this.guardrailsAnnotateAttributions(displayText)
                    this.transcript.addAssistantResponse(text || '', displayText)
+                    void this.editor.controllers.command?.onCompletion(text)


this sends the response back to the command controller to create the test files when needed

lib/shared/src/chat/recipes/custom-prompt.ts

abeatrix · 2023-09-03T02:19:23Z

lib/shared/src/chat/prompts/templates.ts

 - Your answers and suggestions should based on the shared context only.
 - Do not suggest anything that would break the working code.
- Do not make assumptions or fabricating additional details.
+- Do not make any assumptions about the code and file names or any misleading information.


this helps prevent Cody from "hallucinating"

Since this is applied everywhere it feels like a roll of the dice. We need to record a battery of inputs and outputs to monitor for regressions. But since we don't have that yet, I guess YOLO?

Curious how you arrived at "any misleading information" being useful.

I think you are thinking about the one we use in premade? That one is applied everywhere but this one is only applied to prompts for custom commands, on top of the one in premade.

Curious how you arrived at "any misleading information" being useful.

lol a teammate reported an issue on slack about Cody hallucinated directory and file names, which I was able to reproduce. I saw that we already asked Cody not to make up file names, file path, only shows file paths that they are certain of their existence etc in the rules we set for Cody, which we can see are not doing much.
I noticed Cody doesn't know the connections between the files, file name, file path etc we share with them. Long story short, refer anything I share with Cody as context / information has been working well for me, so I tried the "any misleading information" approach (as users can also use non-file as context so i thought why not) by adding it to my custom command starter, asked Cody the same question and find it to be helpful so far.

^ You can see it in the output channel where I added Do not make any assumptions about the code and file names or any misleading information. in front of the same question where teammate reported the hallucinating issue

Just tried asking Cody the same question on web and get the same hallucinating problem (the codeownership it mentioned doesn't exist):

If i prepend the statement to the question, I get the "honest" response. It would just refer to the actual shared context:

abeatrix · 2023-09-03T03:28:55Z

cc @kalanchan regarding the next release

toolmantim · 2023-09-03T03:38:06Z

@abeatrix I’ll be writing this one!

Do you have a description of how the generate test output is improved by these changes? (E.g. Before and after examples? Cases that didn’t work well before, that do now?)

abeatrix · 2023-09-03T06:01:28Z

@toolmantim Just added more demo videos at the end of the PR description and added a Before and After section with example test output. let me know if you need more info :)

toolmantim

Super cool. I gave it a go w/ some ts and go files in various projects! The results I got were a little mixed… even though some output is really promising.

I spotted in some cases the chat output contained this for existing tests:

I am writing these unit tests in the existing test file search_test.go which already contains tests for the search package.

Which makes it sound like Cody updated the existing file, which it doesn't (it opens the temporary file)

Also this happened when there was no existing file (one was created):

I am not importing any new packages since this test file already exists and follows the patterns from the existing tests.

I feel like we might need some wider testing this week, rather than rushing it into the release. And give @beyang & @csells a chance to see how it's going too.

vscode/src/custom-prompts/ToolsProvider.ts

toolmantim · 2023-09-03T06:11:36Z

Just added more demo videos at the end of the PR description and added a Before and After section with example test output. let me know if you need more info :)

Just saw this after the review sorry! 😅

abeatrix · 2023-09-03T06:24:38Z

Just added more demo videos at the end of the PR description and added a Before and After section with example test output. let me know if you need more info :)

Just saw this after the review sorry! 😅

Yea. Also included a note

It doesn't always produce the perfect tests, but it generates better and higher quality tests consistently compared to the older version of the command.

I don't think it's possible to get results that works perfectly for all cases. This is just an improved version of what we have now. I'd suggest compare the results between the current version and the improved version to see the quality improvements.
Agree with giving more time for others to test and compare.

abeatrix · 2023-09-03T06:28:19Z

I guess calling this a PR for improving context for the generate unit test command is more accurate 😅

lib/shared/src/chat/prompts/default-prompts.json

lib/shared/src/chat/prompts/vscode-context.ts

taras-yemets

Great work, @abeatrix!
Left a few non-blocking comments.

Co-authored-by: Taras Yemets <yemets.taras@gmail.com>

abeatrix · 2023-09-06T04:16:07Z

Update

Updated the PR to focus on improving:

the quality of the context fetched for /test command
the quality of the tests generated by the /test command

Working with @toolmantim to explore options for inserting the generated tests into new/existing test files, which will be included in a new PR.

dominiccooney

Some feedback inline.

I realize there's a lot of reviewers on this. Wanted to chime in to help but feel free to not block on me.

dominiccooney · 2023-09-07T07:24:26Z

lib/shared/src/chat/prompts/default-prompts.json

@@ -17,7 +17,7 @@
    },
    "test": {
      "description": "Generate unit tests",
-      "prompt": "Write unit tests to validate the expected functionality for each function in the selected code. Ensure you cover any example test cases included in comments. If test files already exist, follow similar patterns for imports, setup, and assertions used in those tests. If no tests exist, feel free to import any common test libraries as needed. Focus on testing key behaviors and edge cases. Structure the tests so they are clear, maintainable, and can be run automatically. All generated code must be complete workable code, not fragments.",
+      "prompt": "Write a suite of unit tests only for the functions inside the <selected> tags of the shared context. First, review the shared context to see if there is an existing test file that matches the file path contains the selected code. If it exists and contains tests for the selected functions, generate new unit tests for uncovered edge cases I can add to the existing suite. If none, detect what testing framework/library for writing unit test is used from the shared context. When detected, use the detected framework and libraries and follow their patterns for adding packages, imports, dependencies, setup, and assertions used in those code snippets to generate the unit tests. If no tests exist, check shared context to see if any testing libraries for {languageName} are already configured. Import those libraries as needed. If no libraries or package.json are set up, import common test libraries for {languageName}. Pay attention to file paths to see if there is an existing test file for the selected code. If yes, use the same pattern to create new unit tests for edge cases. Focus on testing key behaviors for each function. Only includes mocks if you detected one. Use descriptive variable names that explain the purpose and meaning, not just generic names like 'foo'/'bar'/'name' etc. In your answer, start by telling me which libraries you are importing and why. For example, \"No new imports because I am writing unit tests for an existing test suite\" or \"I am importing unittest since that was used in the shared Python code,\" or \"Importing Jest because it is the configured test framework for Typescript in package.json.\" Then, show the complete test code you wrote, including 1) All necessary dependencies including packages and libraries are imported 2) Setting up the test environment 3) Writing passing tests. At the end, tell me the file path where these tests should be added. The tests should validate expected functionality and cover any example test cases from comments, as well as edge cases and bugs. Code for all generated tests must be complete and workable, no fragments or comments, enclosed in a single code block.",


I'd love to see some samples of this on codebases which don't have package.json, and more languages.

Grammar nit: "only includes mocks" => "only include mocks"

Yee here is a list of videos of me trying this out in different languages:
typescript: https://www.loom.com/share/3489f9ae811b41d08ca0329489580543?sid=379b3122-3ea7-4b1e-bed2-12721e1f09b6
go: https://www.loom.com/share/d1514db896c840989e05f7c74fd30b81?sid=60f8d4d6-2b6c-42d3-b016-b347666893e9
go 2: https://www.loom.com/share/952caef2067a4d748bcd10eef8a60800?sid=04ca3f58-e2c8-474f-86ee-08af03f00689
scala/java: https://www.loom.com/share/d28b1089ac8a4d9ea95816c3e476ec35?sid=adabd913-3811-4009-9391-ccd1406fe91d
python: https://www.loom.com/share/7a788b389ce84a4b85fbe2d31f4a0ab4?sid=11d9314d-31c3-4530-be6a-b4e9faa5de0b

dominiccooney · 2023-09-07T07:27:19Z

lib/shared/src/chat/prompts/templates.ts

 - Your answers and suggestions should based on the shared context only.
 - Do not suggest anything that would break the working code.
- Do not make assumptions or fabricating additional details.
+- Do not make any assumptions about the code and file names or any misleading information.


Since this is applied everywhere it feels like a roll of the dice. We need to record a battery of inputs and outputs to monitor for regressions. But since we don't have that yet, I guess YOLO?

Curious how you arrived at "any misleading information" being useful.

lib/shared/src/chat/prompts/templates.ts

dominiccooney · 2023-09-07T07:37:32Z

lib/shared/src/chat/prompts/utils.ts

@@ -76,6 +80,21 @@ export function isOnlySelectionRequired(contextConfig: CodyPromptContext): boole
    return !contextConfig.none && ((contextConfig.selection && contextConfigLength === 1) || !contextConfigLength)
 }

+// Extract the word "unit" / "e2e" / "integration" from a test prompt text


There's a lot more relevant words here, what about: acceptance, functional, smoke, sanity (ableist but I guess we meet the user where they are), stress, performance, regression, stress, load, user acceptance, UAT, automated, security, accessibility, a11y, fuzz, ...

But since you only use this to look for "unit", why not simplify the code to just look for "unit"? You should also look for "unittest"... probably about 1/6th of "unit test" per Google Books ngrams but growing.

My thinking was that not everyone puts unit tests in a directory called unit, but they most likely do not exist in directories called e2e 🤣 (or if they do, they hopefully aren't the only unit tests in the projects for us to find). The main goal is to find a couple of unit tests as examples for the LLM to see how unit tests are set up in the current project.
We try to find the test files in the current directory first, if we did find one, we will just stop there.
If we can't find any, then we will do a project level search

lib/shared/src/chat/recipes/custom-prompt.ts

lib/shared/src/chat/prompts/utils.ts

dominiccooney · 2023-09-07T07:44:48Z

lib/shared/src/chat/prompts/utils.ts

+export function getClaudeHumanText(commandInstructions: string, currentFileName?: string): string {
+    const promptText = prompts.instruction.replace('{humanInput}', commandInstructions)
+    if (!currentFileName) {
+        return promptText


Should we fail if we're returning prompts containing uninterpolated strings like {languageName}? That would be a programming error and it is good to detect the problem close to the source.

I find claude would just replace the language name "in their head" if one is not provided so I think we are safe?

lib/shared/src/chat/recipes/custom-prompt.ts

dominiccooney

Looks good. Some minor feedback about the prompt inline.

dominiccooney · 2023-09-12T07:25:17Z

lib/shared/src/chat/prompts/default-prompts.json

@@ -17,7 +17,7 @@
    },
    "test": {
      "description": "Generate unit tests",
-      "prompt": "Write a suite of unit tests only for the functions inside the <selected> tags of the shared context. First, review the shared context to see if there is an existing test file that matches the file path contains the selected code. If it exists and contains tests for the selected functions, generate new unit tests for uncovered edge cases I can add to the existing suite. If none, detect what testing framework/library for writing unit test is used from the shared context. When detected, use the detected framework and libraries and follow their patterns for adding packages, imports, dependencies, setup, and assertions used in those code snippets to generate the unit tests. If no tests exist, check shared context to see if any testing libraries for {languageName} are already configured. Import those libraries as needed. If no libraries or package.json are set up, import common test libraries for {languageName}. Pay attention to file paths to see if there is an existing test file for the selected code. If yes, use the same pattern to create new unit tests for edge cases. Focus on testing key behaviors for each function. Only includes mocks if you detected one. Use descriptive variable names that explain the purpose and meaning, not just generic names like 'foo'/'bar'/'name' etc. In your answer, start by telling me which libraries you are importing and why. For example, \"No new imports because I am writing unit tests for an existing test suite\" or \"I am importing unittest since that was used in the shared Python code,\" or \"Importing Jest because it is the configured test framework for Typescript in package.json.\" Then, show the complete test code you wrote, including 1) All necessary dependencies including packages and libraries are imported 2) Setting up the test environment 3) Writing passing tests. At the end, tell me the file path where these tests should be added. The tests should validate expected functionality and cover any example test cases from comments, as well as edge cases and bugs. Code for all generated tests must be complete and workable, no fragments or comments, enclosed in a single code block.",
+      "prompt": "Review the shared code context and configurations to identify the test framework and libraries in use, then generate a suite of unit tests for the functions in <selected> using the detected test framework and libraries. Ensure to follow the same patterns as the shared unit tests (if any), and add packages, imports, dependencies, and assertions only if used in the shared code. Pay attention to the file path of each shared context to identify if a test file for the selected code already exists. If it exists, focuse on generate new unit tests for uncovered cases. If none detected, import common unit test libraries for {languageName}. Focus on validating the key functionality with simple and clear assertions. Only includes mocks if one is detected in shared code. Before writing the tests, determine which test libraries or frameworks to import, e.g. 'No new imports needed - using existing libs'/'Importing test framework that matches shared context usage'/'Importing the defined framework' etc. Then, tell me the file path where these tests should be added and summarize test coverage. At the end, enclose only fully completed code for the new unit tests in single markdown codeblock, no fragments. New tests should validate expected functionality and cover edge cases for <selected> with all required methods imported. Do not repeat existing tests.",


Where you have "if it exists", do you find "it" makes the LLM squirrelly versus reiterating "if a test file exists"?

Why did we add flourishes like "shared code context and configurations"? I'm interested in how you arrived at these, determined the effect of them, because I am curious if we can scale it up.

These "think through" prompts for no new imports, importing test framework, etc. are super interesting too.

Spelling/grammar nits:

focuse on generate → focus on generating (or maybe just generate?)

Only includes mocks → Only include mocks

Where you have "if it exists", do you find "it" makes the LLM squirrelly versus reiterating "if a test file exists"? Why did we add flourishes like "shared code context and configurations"? I'm interested in how you arrived at these, determined the effect of them, because I am curious if we can scale it up.

Good question! I noticed "if a test file exists" doesn't work in some languages; even if test files were shared, it would still refer to them as "shared Python code":

So I tried to minimize the use of the word "file" and replace it with either "shared context" or "shared code".

As for configurations, we are sharing package.json for ts/js tests when no test context is found, but for non-ts/js tests, we will just share a list of files located in their root dir to see if that would give them hints if a test framework is set up. Plus on top of that, i'm not familiar with how tests are set up in other languages so instead of listing each config file for the model to pick up, it seems like configurations has been working so far 🙏

I am not sure if I am on the right track, though 😅

And thanks for the grammar check and for taking the time to review 🙇‍♀️ Going to follow your suggestion and make sure the prompt still works before merging.

@dominiccooney I tried making the following changes and noticed a huge difference in the results. Thought you'd be interested!

Looks like it won't generate new tests for code with existing tests file if we don't ask it to "focus"

focuse on generate → generate

focuse on generate → focus on generating

abeatrix · 2023-09-13T00:16:02Z

Here is a demo run from the latest commit build: https://www.loom.com/share/a17c8c63d1284ac889cff2d0bdc93a9c?sid=aee107b7-bf23-4344-9235-a290d897e0f9

Tested the command with go, python, scala, and typescripts files as shown in the loom video above.

I have also worked on inserting the tests automatically to their test files (see video below) in #984, which was closed as we agreed more design works are required as well as code change that would take a sprint or two to finish.

improved.unit.test.demo_.create.new.test.file.and.insert.new.tests.to.existing.test.file.mp4

as discussed with Tim (ref), I am going to merge what I have now for the quality improvement to focus on autocompletion works.
We will revisit this, including the works listed under the Follow-up section, after the oct release.

PR was approved

dominiccooney · 2023-09-13T02:02:44Z

@abeatrix thanks for showing the output from those alternative prompts! It really illustrates how steep the drop off in performance is even changing single words.

Moving changelog item for #907 from v0.8.0 to v0.12.0. It was added to the wrong release when it was merged. ## Test plan  changelog update

Updating current preamble, rules, and prompt mixin with new findings from #907. Issue 1: Cody hallucinates about file path and code One of the main issues we saw in Chat is Cody hallucinating about files and code that do not exist in the shared context. I noticed this is mainly because of how we phrased our context, as Cody do not understand the relationship between each code snippet we share with them. For more information, see #907 (comment) One of the many prompts I've tried to prevent hallucination is to add `Do not make any assumptions about the code and file names or any misleading information` to the question, which has been working well for me. This is also the prompt we added to all our commands. [A customer has also confirmed](https://sourcegraph.slack.com/archives/C04MSD3DP5L/p1697747191354829?thread_ts=1697626152.033629&cid=C04MSD3DP5L) that adding this to their questions stopped Cody from hallucinating and received better responses from Chat, this is why I propose adding it to `PromptMixin`, which will be added to the start of every question (instead of Preabmle, which is added to the start of the session and will get left out as the conversation gets longer.) I don't think this will be 100% hallucination-proof, but it works a lot better than what we currently have. #### Before - lib/share/ui/src/components/CodeBlock is not the correct path ![Screenshot 2023-10-18 at 2 45 49 PM](https://github.com/sourcegraph/cody/assets/68532117/9e8b6438-b37c-4745-a5f5-c6abcc2ceec8) #### After - make well-informed answers ![Screenshot 2023-10-18 at 2 47 07 PM](https://github.com/sourcegraph/cody/assets/68532117/87d8b9ce-b8bc-4298-a7be-9a6957a58c36) Issue 2: The current preambles need to be updated, and not working as intended most of the times. Addressed by this PR with the following changes: - Simplified the preamble actions and rules text to be more concise. - Removed redundant rules that will always be false, e.g. Cody never has direct access to your file or repository even when we tell them they do. - Removed detailed instructions about code formatting and limitations, leaving just the essential rules - Shortened the preamble answer to focus on the core assistant persona and capabilities Issue 3: Cody will not answer in language that is not the default language of the editor CLOSE https://github.com/sourcegraph/cody/discussions/1011 && #988 Many users have complained that they cannot get Cody to answer their questions because Cody refused to answer questions that is not the same as the default language of the editor. Addressed by this PR with the following changes: Remove `languagePromptMixin` from the `activate` event for editors, which allows Cody to answer in the same language as the question #### Before <img width="609" alt="Screenshot 2023-10-19 at 5 08 17 PM" src="https://github.com/sourcegraph/cody/assets/68532117/6265a46e-d5e8-489b-a92d-824b56a391dc"> #### After <img width="624" alt="Screenshot 2023-10-19 at 5 07 57 PM" src="https://github.com/sourcegraph/cody/assets/68532117/b1478a0f-8bfa-468b-abe2-69588f1deef6"> ## Test plan  1. Ask Cody a question in another language that is not the same as your editor. Cody should still be able to answer your question in the same language. 2. Ask Cody a question in editor, and then ask the same question in Web to compare the quality. Questions where Cody hallucinates on Web should not happen in the editor when using the updated prompts. Example question to ask in the cody repo: `Do we format the completion item before we suggest them?` <img width="883" alt="image" src="https://github.com/sourcegraph/cody/assets/68532117/86b1c8c5-4e2c-4674-9ad6-18382dbb2813">

abeatrix added 2 commits August 30, 2023 10:41

feat: improved unit tests

6a2d670

clean up

c758e2f

abeatrix requested a review from a team September 3, 2023 02:16

abeatrix commented Sep 3, 2023

View reviewed changes

lib/shared/src/chat/recipes/custom-prompt.ts Outdated Show resolved Hide resolved

abeatrix commented Sep 3, 2023

View reviewed changes

abeatrix added 2 commits September 2, 2023 19:21

Merge branch 'main' into bee/unit-tests

9dec0d6

fix history test

a9374ef

abeatrix mentioned this pull request Sep 3, 2023

make slash command required in command configs #841

Merged

abeatrix requested a review from taras-yemets September 3, 2023 02:58

update changelog

5f2a44e

abeatrix changed the title ~~feat: improved generate unit tests command~~ change: improved generate unit tests command Sep 3, 2023

fix spacing

cb87a4c

toolmantim reviewed Sep 3, 2023

View reviewed changes

vscode/src/custom-prompts/ToolsProvider.ts Outdated Show resolved Hide resolved

abeatrix mentioned this pull request Sep 3, 2023

hackathon poc - autocomplete using all surrounding code as context #885

Closed