Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve generate unit tests command output quality #907

Merged
merged 23 commits into from
Sep 13, 2023
Merged

Conversation

abeatrix
Copy link
Contributor

@abeatrix abeatrix commented Sep 3, 2023

Close #619

This PR contains changes that improve the current generate unit tests command:

  • update the current unit test prompt to give it detailed directions on how to produce unit tests with higher quality consistently (ref anthropic docs)
  • update context fetching process for tests to filter other test files like e2e and integration tests when generating unit-test
  • include import statements from current files as context to avoid cases where cody fails to include import statements in tests
  • update language used in prompt to let Cody know all files come from the same codebase. I noticed Cody currently does not know the shared context is from files, and certainly doe snot know the file path we shared with each context comes from a file that belongs to the same codebase. It just doesn't make that connection until we spell it out for them
  • did not make any changes that would change the behavior of the current test command

Before and After

The tests will not always be perfect, but with the changes mentioned above, the improved tests should do the following consistently when possible:

  • use the correct testing frameworks
  • include imports (common issues with the current version where imports are incorrectly added or often missing)
  • include more relevant context

before

In the video below, you can see the test imported the "chai" testing library for the tests, instead of "vitest" which is the testing framework that is used by other test files in the project:

Test.-.before.mp4

You can also see from the context list that it includes test files from e2e test and integration tests that are not relevant to unit tests:

image

after

In the improved version, you can see it import the correct testing framework "vitest", using more relevant test files as context:

image
Test.-.after.mp4

More examples

before - go

use 'NewMockCache' that does not exist

image

after - go

image

Follow-up

  • a way to add generated tests easily to a test file
  • consolidate vscode-context and move it out of shared lib

Test plan

  • Start Cody from this branch
  • Select a function in a file and run the Generate Unit Tests command
  • Compare the test output with the output from the current Generate Unit Tests command

Try running other commands. The changes in this should not change the behavior of other commands.

Demo_.Improved.Unit.Tests.mp4
image

It doesn't always produce the perfect tests, but it generates better and higher quality tests consistently compared to the older version of the command. Works consistently with other languages as well. More demo videos:
typescript: https://www.loom.com/share/3489f9ae811b41d08ca0329489580543?sid=379b3122-3ea7-4b1e-bed2-12721e1f09b6
go: https://www.loom.com/share/d1514db896c840989e05f7c74fd30b81?sid=60f8d4d6-2b6c-42d3-b016-b347666893e9
go 2: https://www.loom.com/share/952caef2067a4d748bcd10eef8a60800?sid=04ca3f58-e2c8-474f-86ee-08af03f00689
scala/java: https://www.loom.com/share/d28b1089ac8a4d9ea95816c3e476ec35?sid=adabd913-3811-4009-9391-ccd1406fe91d
python: https://www.loom.com/share/7a788b389ce84a4b85fbe2d31f4a0ab4?sid=11d9314d-31c3-4530-be6a-b4e9faa5de0b

@abeatrix abeatrix requested a review from a team September 3, 2023 02:16
@@ -206,6 +206,7 @@ export abstract class MessageProvider extends MessageHandler implements vscode.D
// TODO(keegancsmith) guardrails may be slow, we need to make this async update the interaction.
displayText = await this.guardrailsAnnotateAttributions(displayText)
this.transcript.addAssistantResponse(text || '', displayText)
void this.editor.controllers.command?.onCompletion(text)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sends the response back to the command controller to create the test files when needed

- Your answers and suggestions should based on the shared context only.
- Do not suggest anything that would break the working code.
- Do not make assumptions or fabricating additional details.
- Do not make any assumptions about the code and file names or any misleading information.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this helps prevent Cody from "hallucinating"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is applied everywhere it feels like a roll of the dice. We need to record a battery of inputs and outputs to monitor for regressions. But since we don't have that yet, I guess YOLO?

Curious how you arrived at "any misleading information" being useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are thinking about the one we use in premade? That one is applied everywhere but this one is only applied to prompts for custom commands, on top of the one in premade.

Copy link
Contributor Author

@abeatrix abeatrix Sep 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how you arrived at "any misleading information" being useful.

lol a teammate reported an issue on slack about Cody hallucinated directory and file names, which I was able to reproduce. I saw that we already asked Cody not to make up file names, file path, only shows file paths that they are certain of their existence etc in the rules we set for Cody, which we can see are not doing much.
I noticed Cody doesn't know the connections between the files, file name, file path etc we share with them. Long story short, refer anything I share with Cody as context / information has been working well for me, so I tried the "any misleading information" approach (as users can also use non-file as context so i thought why not) by adding it to my custom command starter, asked Cody the same question and find it to be helpful so far.

image

^ You can see it in the output channel where I added Do not make any assumptions about the code and file names or any misleading information. in front of the same question where teammate reported the hallucinating issue

Just tried asking Cody the same question on web and get the same hallucinating problem (the codeownership it mentioned doesn't exist):
image

If i prepend the statement to the question, I get the "honest" response. It would just refer to the actual shared context:
image

@abeatrix abeatrix changed the title feat: improved generate unit tests command change: improved generate unit tests command Sep 3, 2023
@abeatrix
Copy link
Contributor Author

abeatrix commented Sep 3, 2023

cc @kalanchan regarding the next release

@toolmantim
Copy link
Contributor

@abeatrix I’ll be writing this one!

Do you have a description of how the generate test output is improved by these changes? (E.g. Before and after examples? Cases that didn’t work well before, that do now?)

@abeatrix
Copy link
Contributor Author

abeatrix commented Sep 3, 2023

@toolmantim Just added more demo videos at the end of the PR description and added a Before and After section with example test output. let me know if you need more info :)

Copy link
Contributor

@toolmantim toolmantim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool. I gave it a go w/ some ts and go files in various projects! The results I got were a little mixed… even though some output is really promising.

I spotted in some cases the chat output contained this for existing tests:

I am writing these unit tests in the existing test file search_test.go which already contains tests for the search package.

Which makes it sound like Cody updated the existing file, which it doesn't (it opens the temporary file)

Also this happened when there was no existing file (one was created):

I am not importing any new packages since this test file already exists and follows the patterns from the existing tests.

Screenshot 2023-09-02 at 10 46 49 pm

I feel like we might need some wider testing this week, rather than rushing it into the release. And give @beyang & @csells a chance to see how it's going too.

vscode/src/custom-prompts/ToolsProvider.ts Outdated Show resolved Hide resolved
@toolmantim
Copy link
Contributor

Just added more demo videos at the end of the PR description and added a Before and After section with example test output. let me know if you need more info :)

Just saw this after the review sorry! 😅

@abeatrix
Copy link
Contributor Author

abeatrix commented Sep 3, 2023

Just added more demo videos at the end of the PR description and added a Before and After section with example test output. let me know if you need more info :)

Just saw this after the review sorry! 😅

Yea. Also included a note

It doesn't always produce the perfect tests, but it generates better and higher quality tests consistently compared to the older version of the command.

I don't think it's possible to get results that works perfectly for all cases. This is just an improved version of what we have now. I'd suggest compare the results between the current version and the improved version to see the quality improvements.
Agree with giving more time for others to test and compare.

@abeatrix
Copy link
Contributor Author

abeatrix commented Sep 3, 2023

I guess calling this a PR for improving context for the generate unit test command is more accurate 😅

Copy link
Contributor

@taras-yemets taras-yemets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @abeatrix!
Left a few non-blocking comments.

@abeatrix abeatrix changed the title change: improved generate unit tests command change: improved generate unit tests command quality Sep 6, 2023
@abeatrix
Copy link
Contributor Author

abeatrix commented Sep 6, 2023

Update

Updated the PR to focus on improving:

  • the quality of the context fetched for /test command
  • the quality of the tests generated by the /test command

Working with @toolmantim to explore options for inserting the generated tests into new/existing test files, which will be included in a new PR.

@abeatrix abeatrix requested review from a team and toolmantim September 6, 2023 19:32
Copy link
Contributor

@dominiccooney dominiccooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback inline.

I realize there's a lot of reviewers on this. Wanted to chime in to help but feel free to not block on me.

@@ -17,7 +17,7 @@
},
"test": {
"description": "Generate unit tests",
"prompt": "Write unit tests to validate the expected functionality for each function in the selected code. Ensure you cover any example test cases included in comments. If test files already exist, follow similar patterns for imports, setup, and assertions used in those tests. If no tests exist, feel free to import any common test libraries as needed. Focus on testing key behaviors and edge cases. Structure the tests so they are clear, maintainable, and can be run automatically. All generated code must be complete workable code, not fragments.",
"prompt": "Write a suite of unit tests only for the functions inside the <selected> tags of the shared context. First, review the shared context to see if there is an existing test file that matches the file path contains the selected code. If it exists and contains tests for the selected functions, generate new unit tests for uncovered edge cases I can add to the existing suite. If none, detect what testing framework/library for writing unit test is used from the shared context. When detected, use the detected framework and libraries and follow their patterns for adding packages, imports, dependencies, setup, and assertions used in those code snippets to generate the unit tests. If no tests exist, check shared context to see if any testing libraries for {languageName} are already configured. Import those libraries as needed. If no libraries or package.json are set up, import common test libraries for {languageName}. Pay attention to file paths to see if there is an existing test file for the selected code. If yes, use the same pattern to create new unit tests for edge cases. Focus on testing key behaviors for each function. Only includes mocks if you detected one. Use descriptive variable names that explain the purpose and meaning, not just generic names like 'foo'/'bar'/'name' etc. In your answer, start by telling me which libraries you are importing and why. For example, \"No new imports because I am writing unit tests for an existing test suite\" or \"I am importing unittest since that was used in the shared Python code,\" or \"Importing Jest because it is the configured test framework for Typescript in package.json.\" Then, show the complete test code you wrote, including 1) All necessary dependencies including packages and libraries are imported 2) Setting up the test environment 3) Writing passing tests. At the end, tell me the file path where these tests should be added. The tests should validate expected functionality and cover any example test cases from comments, as well as edge cases and bugs. Code for all generated tests must be complete and workable, no fragments or comments, enclosed in a single code block.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see some samples of this on codebases which don't have package.json, and more languages.

Grammar nit: "only includes mocks" => "only include mocks"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Your answers and suggestions should based on the shared context only.
- Do not suggest anything that would break the working code.
- Do not make assumptions or fabricating additional details.
- Do not make any assumptions about the code and file names or any misleading information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is applied everywhere it feels like a roll of the dice. We need to record a battery of inputs and outputs to monitor for regressions. But since we don't have that yet, I guess YOLO?

Curious how you arrived at "any misleading information" being useful.

lib/shared/src/chat/prompts/templates.ts Show resolved Hide resolved
@@ -76,6 +80,21 @@ export function isOnlySelectionRequired(contextConfig: CodyPromptContext): boole
return !contextConfig.none && ((contextConfig.selection && contextConfigLength === 1) || !contextConfigLength)
}

// Extract the word "unit" / "e2e" / "integration" from a test prompt text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot more relevant words here, what about: acceptance, functional, smoke, sanity (ableist but I guess we meet the user where they are), stress, performance, regression, stress, load, user acceptance, UAT, automated, security, accessibility, a11y, fuzz, ...

But since you only use this to look for "unit", why not simplify the code to just look for "unit"? You should also look for "unittest"... probably about 1/6th of "unit test" per Google Books ngrams but growing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that not everyone puts unit tests in a directory called unit, but they most likely do not exist in directories called e2e 🤣 (or if they do, they hopefully aren't the only unit tests in the projects for us to find). The main goal is to find a couple of unit tests as examples for the LLM to see how unit tests are set up in the current project.
We try to find the test files in the current directory first, if we did find one, we will just stop there.
If we can't find any, then we will do a project level search

lib/shared/src/chat/recipes/custom-prompt.ts Show resolved Hide resolved
lib/shared/src/chat/recipes/custom-prompt.ts Outdated Show resolved Hide resolved
lib/shared/src/chat/prompts/utils.ts Outdated Show resolved Hide resolved
export function getClaudeHumanText(commandInstructions: string, currentFileName?: string): string {
const promptText = prompts.instruction.replace('{humanInput}', commandInstructions)
if (!currentFileName) {
return promptText
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fail if we're returning prompts containing uninterpolated strings like {languageName}? That would be a programming error and it is good to detect the problem close to the source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find claude would just replace the language name "in their head" if one is not provided so I think we are safe?

lib/shared/src/chat/recipes/custom-prompt.ts Outdated Show resolved Hide resolved
lib/shared/src/chat/recipes/custom-prompt.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@dominiccooney dominiccooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some minor feedback about the prompt inline.

@@ -17,7 +17,7 @@
},
"test": {
"description": "Generate unit tests",
"prompt": "Write a suite of unit tests only for the functions inside the <selected> tags of the shared context. First, review the shared context to see if there is an existing test file that matches the file path contains the selected code. If it exists and contains tests for the selected functions, generate new unit tests for uncovered edge cases I can add to the existing suite. If none, detect what testing framework/library for writing unit test is used from the shared context. When detected, use the detected framework and libraries and follow their patterns for adding packages, imports, dependencies, setup, and assertions used in those code snippets to generate the unit tests. If no tests exist, check shared context to see if any testing libraries for {languageName} are already configured. Import those libraries as needed. If no libraries or package.json are set up, import common test libraries for {languageName}. Pay attention to file paths to see if there is an existing test file for the selected code. If yes, use the same pattern to create new unit tests for edge cases. Focus on testing key behaviors for each function. Only includes mocks if you detected one. Use descriptive variable names that explain the purpose and meaning, not just generic names like 'foo'/'bar'/'name' etc. In your answer, start by telling me which libraries you are importing and why. For example, \"No new imports because I am writing unit tests for an existing test suite\" or \"I am importing unittest since that was used in the shared Python code,\" or \"Importing Jest because it is the configured test framework for Typescript in package.json.\" Then, show the complete test code you wrote, including 1) All necessary dependencies including packages and libraries are imported 2) Setting up the test environment 3) Writing passing tests. At the end, tell me the file path where these tests should be added. The tests should validate expected functionality and cover any example test cases from comments, as well as edge cases and bugs. Code for all generated tests must be complete and workable, no fragments or comments, enclosed in a single code block.",
"prompt": "Review the shared code context and configurations to identify the test framework and libraries in use, then generate a suite of unit tests for the functions in <selected> using the detected test framework and libraries. Ensure to follow the same patterns as the shared unit tests (if any), and add packages, imports, dependencies, and assertions only if used in the shared code. Pay attention to the file path of each shared context to identify if a test file for the selected code already exists. If it exists, focuse on generate new unit tests for uncovered cases. If none detected, import common unit test libraries for {languageName}. Focus on validating the key functionality with simple and clear assertions. Only includes mocks if one is detected in shared code. Before writing the tests, determine which test libraries or frameworks to import, e.g. 'No new imports needed - using existing libs'/'Importing test framework that matches shared context usage'/'Importing the defined framework' etc. Then, tell me the file path where these tests should be added and summarize test coverage. At the end, enclose only fully completed code for the new unit tests in single markdown codeblock, no fragments. New tests should validate expected functionality and cover edge cases for <selected> with all required methods imported. Do not repeat existing tests.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where you have "if it exists", do you find "it" makes the LLM squirrelly versus reiterating "if a test file exists"?

Why did we add flourishes like "shared code context and configurations"? I'm interested in how you arrived at these, determined the effect of them, because I am curious if we can scale it up.

These "think through" prompts for no new imports, importing test framework, etc. are super interesting too.

Spelling/grammar nits:

  • focuse on generatefocus on generating (or maybe just generate?)
  • Only includes mocksOnly include mocks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where you have "if it exists", do you find "it" makes the LLM squirrelly versus reiterating "if a test file exists"? Why did we add flourishes like "shared code context and configurations"? I'm interested in how you arrived at these, determined the effect of them, because I am curious if we can scale it up.

Good question! I noticed "if a test file exists" doesn't work in some languages; even if test files were shared, it would still refer to them as "shared Python code":

image

So I tried to minimize the use of the word "file" and replace it with either "shared context" or "shared code".

As for configurations, we are sharing package.json for ts/js tests when no test context is found, but for non-ts/js tests, we will just share a list of files located in their root dir to see if that would give them hints if a test framework is set up. Plus on top of that, i'm not familiar with how tests are set up in other languages so instead of listing each config file for the model to pick up, it seems like configurations has been working so far 🙏

I am not sure if I am on the right track, though 😅

And thanks for the grammar check and for taking the time to review 🙇‍♀️ Going to follow your suggestion and make sure the prompt still works before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dominiccooney I tried making the following changes and noticed a huge difference in the results. Thought you'd be interested!

Looks like it won't generate new tests for code with existing tests file if we don't ask it to "focus"

focuse on generate → generate

image

image

focuse on generate → focus on generating

image

image

@abeatrix
Copy link
Contributor Author

abeatrix commented Sep 13, 2023

Here is a demo run from the latest commit build: https://www.loom.com/share/a17c8c63d1284ac889cff2d0bdc93a9c?sid=aee107b7-bf23-4344-9235-a290d897e0f9

Tested the command with go, python, scala, and typescripts files as shown in the loom video above.

I have also worked on inserting the tests automatically to their test files (see video below) in #984, which was closed as we agreed more design works are required as well as code change that would take a sprint or two to finish.

improved.unit.test.demo_.create.new.test.file.and.insert.new.tests.to.existing.test.file.mp4

as discussed with Tim (ref), I am going to merge what I have now for the quality improvement to focus on autocompletion works.
We will revisit this, including the works listed under the Follow-up section, after the oct release.

@abeatrix abeatrix changed the title change: improved generate unit tests command quality improve generate unit tests command output quality Sep 13, 2023
@abeatrix abeatrix merged commit 1d700d7 into main Sep 13, 2023
10 checks passed
@abeatrix abeatrix deleted the bee/unit-tests branch September 13, 2023 00:30
@dominiccooney
Copy link
Contributor

@abeatrix thanks for showing the output from those alternative prompts! It really illustrates how steep the drop off in performance is even changing single words.

abeatrix added a commit that referenced this pull request Sep 20, 2023
Moving changelog item for #907
from v0.8.0 to v0.12.0. It was added to the wrong release when it was
merged.

## Test plan

<!-- Required. See
https://docs.sourcegraph.com/dev/background-information/testing_principles.
-->

changelog update
abeatrix added a commit that referenced this pull request Oct 23, 2023
Updating current preamble, rules, and prompt mixin with new findings
from #907.

Issue 1: Cody hallucinates about file path and code

One of the main issues we saw in Chat is Cody hallucinating about files
and code that do not exist in the shared context.

I noticed this is mainly because of how we phrased our context, as Cody
do not understand the relationship between each code snippet we share
with them. For more information, see
#907 (comment)

One of the many prompts I've tried to prevent hallucination is to add
`Do not make any assumptions about the code and file names or any
misleading information` to the question, which has been working well for
me. This is also the prompt we added to all our commands. [A customer
has also
confirmed](https://sourcegraph.slack.com/archives/C04MSD3DP5L/p1697747191354829?thread_ts=1697626152.033629&cid=C04MSD3DP5L)
that adding this to their questions stopped Cody from hallucinating and
received better responses from Chat, this is why I propose adding it to
`PromptMixin`, which will be added to the start of every question
(instead of Preabmle, which is added to the start of the session and
will get left out as the conversation gets longer.)

I don't think this will be 100% hallucination-proof, but it works a lot
better than what we currently have.

#### Before

- lib/share/ui/src/components/CodeBlock is not the correct path

![Screenshot 2023-10-18 at 2 45
49 PM](https://github.com/sourcegraph/cody/assets/68532117/9e8b6438-b37c-4745-a5f5-c6abcc2ceec8)


#### After

-  make well-informed answers

![Screenshot 2023-10-18 at 2 47
07 PM](https://github.com/sourcegraph/cody/assets/68532117/87d8b9ce-b8bc-4298-a7be-9a6957a58c36)

Issue 2: The current preambles need to be updated, and not working as
intended most of the times.

Addressed by this PR with the following changes:
- Simplified the preamble actions and rules text to be more concise.
- Removed redundant rules that will always be false, e.g. Cody never has
direct access to your file or repository even when we tell them they do.
- Removed detailed instructions about code formatting and limitations,
leaving just the essential rules
- Shortened the preamble answer to focus on the core assistant persona
and capabilities

Issue 3: Cody will not answer in language that is not the default
language of the editor

CLOSE https://github.com/sourcegraph/cody/discussions/1011 &&
#988

Many users have complained that they cannot get Cody to answer their
questions because Cody refused to answer questions that is not the same
as the default language of the editor.

Addressed by this PR with the following changes: Remove
`languagePromptMixin` from the `activate` event for editors, which
allows Cody to answer in the same language as the question

#### Before

<img width="609" alt="Screenshot 2023-10-19 at 5 08 17 PM"
src="https://github.com/sourcegraph/cody/assets/68532117/6265a46e-d5e8-489b-a92d-824b56a391dc">

#### After

<img width="624" alt="Screenshot 2023-10-19 at 5 07 57 PM"
src="https://github.com/sourcegraph/cody/assets/68532117/b1478a0f-8bfa-468b-abe2-69588f1deef6">

## Test plan

<!-- Required. See
https://docs.sourcegraph.com/dev/background-information/testing_principles.
-->

1. Ask Cody a question in another language that is not the same as your
editor. Cody should still be able to answer your question in the same
language.
2. Ask Cody a question in editor, and then ask the same question in Web
to compare the quality. Questions where Cody hallucinates on Web should
not happen in the editor when using the updated prompts.

Example question to ask in the cody repo: `Do we format the completion
item before we suggest them?`

<img width="883" alt="image"
src="https://github.com/sourcegraph/cody/assets/68532117/86b1c8c5-4e2c-4674-9ad6-18382dbb2813">
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Generate Unit Tests
6 participants