Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate embeddings from SWE-bench repos #1

Open
raymyers opened this issue Mar 18, 2024 · 5 comments
Open

Generate embeddings from SWE-bench repos #1

raymyers opened this issue Mar 18, 2024 · 5 comments
Assignees

Comments

@raymyers
Copy link
Owner

raymyers commented Mar 18, 2024

Background

SWE-bench has "assisted" and "unassisted" scores. Assisted means you are told what files to modify. Devin is presumed to have the SOTA score of 14% unassisted. Claude 3 Opus model with no agent achieves a strong 11% assisted score, as reported by zraytam. This means that a standalone "Oracle-substitute" that only guessed the relevant files to modify could get us well on the way.

Direction

A proposed solution @AtlantisPleb involves building the Oracle-substitute on Retrieval-Augmented Generation (RAG) process.

  • Cloning the relevant repository.
  • Vector embedding all files.
  • Generating one-line summaries for each file.
  • Performing cosine similarity search over those files using a vector of user input.
  • Identifying the top-k most likely files for modification.
  • Looping through each candidate file to determine if it should be patched.

This approach could potentially match Devin's scores within the next 1-2 days by evolving the process with more sophisticated LLM prompts or smarter codebase traversal methods.

Considerations

Generating embeddings for full repositories is resource-intensive but can be optimized by doing it once per repo in SWE-bench and reusing the embeddings. We are considering hosting a shared copy for experimentation using a service such as Pinecone.

The repos are not in one state. Every exercise starts at different point in time (git hash). Perhaps we can avoid redundant processing with tagging, this needs to be fleshed out.

Related Resources

@raymyers raymyers self-assigned this Mar 18, 2024
@AtlantisPleb
Copy link

Cool. Thankfully we should only need to vector-embed one snapshot of each codebase: at the commit hash specified in the SWE-bench dataset. Embeddings can use standard Pinecone (or whatever) metadata/tags to record the associated commit hash in case we need that in the future.

@phact
Copy link
Contributor

phact commented Mar 19, 2024

Took a stab at this #2 lmk your thoughts.

@raymyers
Copy link
Owner Author

Added get oracle command

@phact
Copy link
Contributor

phact commented Mar 22, 2024

I saw in your other repo @AtlantisPleb that you're making LLM descriptions for each file. Intuitively it seems like these may be as useful as the chunked embeddings if the goal is just to select the right files to edit. Thoughts? Do you have any observations yet on how recall performs based on the descriptions instead of the chunked code yet?

@raymyers
Copy link
Owner Author

@phact

making LLM descriptions for each file. Intuitively it seems like these may be as useful as the chunked embeddings if the goal is just to select the right files to edit

This seems like a promising direction. My experience was that summaries worked very well in experiments last year, when I was trying to get GPT-4 to identify the relevant function in DukeNukem3D code.

What I was comparing at the time were opensource embeddings (krlvi/sentence-msmarco-bert-base-dot-v5-nlpl-code_search_net, vs one line GPT-3.5-Turbo summaries per file chunk, feeding many summaries directly to the LLM to choose which area to "zoom in" on. So not an apples-to-apples comparison and not the new OpenAI embeddings which might perform better on code, not sure.

I made a video about it - it was April 2023, might not be that useful anymore. Point being summaries showed a lot of promise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants