Generate embeddings from SWE-bench repos #1

raymyers · 2024-03-18T17:26:34Z

Background

SWE-bench has "assisted" and "unassisted" scores. Assisted means you are told what files to modify. Devin is presumed to have the SOTA score of 14% unassisted. Claude 3 Opus model with no agent achieves a strong 11% assisted score, as reported by zraytam. This means that a standalone "Oracle-substitute" that only guessed the relevant files to modify could get us well on the way.

Direction

A proposed solution @AtlantisPleb involves building the Oracle-substitute on Retrieval-Augmented Generation (RAG) process.

Cloning the relevant repository.
Vector embedding all files.
Generating one-line summaries for each file.
Performing cosine similarity search over those files using a vector of user input.
Identifying the top-k most likely files for modification.
Looping through each candidate file to determine if it should be patched.

This approach could potentially match Devin's scores within the next 1-2 days by evolving the process with more sophisticated LLM prompts or smarter codebase traversal methods.

Considerations

Generating embeddings for full repositories is resource-intensive but can be optimized by doing it once per repo in SWE-bench and reusing the embeddings. We are considering hosting a shared copy for experimentation using a service such as Pinecone.

The repos are not in one state. Every exercise starts at different point in time (git hash). Perhaps we can avoid redundant processing with tagging, this needs to be fleshed out.

Related Resources

Previous code crawler implementation: softgen.ai code parser
SWEEP repository contains both the chunking algorithm and the File Line Editor in Python, which can be reused.
Mirko.AI issue 11 "Abilities v2: Code Snippet Retrieval & File-Line Diff Editor (WIP Issue)"
pierrebhat hackathon code

AtlantisPleb · 2024-03-18T17:57:27Z

Cool. Thankfully we should only need to vector-embed one snapshot of each codebase: at the commit hash specified in the SWE-bench dataset. Embeddings can use standard Pinecone (or whatever) metadata/tags to record the associated commit hash in case we need that in the future.

phact · 2024-03-19T05:35:24Z

Took a stab at this #2 lmk your thoughts.

raymyers · 2024-03-19T05:47:40Z

Added get oracle command

phact · 2024-03-22T14:46:11Z

I saw in your other repo @AtlantisPleb that you're making LLM descriptions for each file. Intuitively it seems like these may be as useful as the chunked embeddings if the goal is just to select the right files to edit. Thoughts? Do you have any observations yet on how recall performs based on the descriptions instead of the chunked code yet?

raymyers · 2024-03-23T13:47:17Z

@phact

making LLM descriptions for each file. Intuitively it seems like these may be as useful as the chunked embeddings if the goal is just to select the right files to edit

This seems like a promising direction. My experience was that summaries worked very well in experiments last year, when I was trying to get GPT-4 to identify the relevant function in DukeNukem3D code.

What I was comparing at the time were opensource embeddings (krlvi/sentence-msmarco-bert-base-dot-v5-nlpl-code_search_net, vs one line GPT-3.5-Turbo summaries per file chunk, feeding many summaries directly to the LLM to choose which area to "zoom in" on. So not an apples-to-apples comparison and not the new OpenAI embeddings which might perform better on code, not sure.

I made a video about it - it was April 2023, might not be that useful anymore. Point being summaries showed a lot of promise.

raymyers self-assigned this Mar 18, 2024

raymyers mentioned this issue Mar 19, 2024

Benchmark Retrieval vs Oracle #3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate embeddings from SWE-bench repos #1

Generate embeddings from SWE-bench repos #1

raymyers commented Mar 18, 2024 •

edited

Loading

AtlantisPleb commented Mar 18, 2024

phact commented Mar 19, 2024

raymyers commented Mar 19, 2024

phact commented Mar 22, 2024 •

edited

Loading

raymyers commented Mar 23, 2024

Generate embeddings from SWE-bench repos #1

Generate embeddings from SWE-bench repos #1

Comments

raymyers commented Mar 18, 2024 • edited Loading

Background

Direction

Considerations

Related Resources

AtlantisPleb commented Mar 18, 2024

phact commented Mar 19, 2024

raymyers commented Mar 19, 2024

phact commented Mar 22, 2024 • edited Loading

raymyers commented Mar 23, 2024

raymyers commented Mar 18, 2024 •

edited

Loading

phact commented Mar 22, 2024 •

edited

Loading