-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate embeddings from SWE-bench repos #1
Comments
Cool. Thankfully we should only need to vector-embed one snapshot of each codebase: at the commit hash specified in the SWE-bench dataset. Embeddings can use standard Pinecone (or whatever) metadata/tags to record the associated commit hash in case we need that in the future. |
Took a stab at this #2 lmk your thoughts. |
Added get oracle command |
I saw in your other repo @AtlantisPleb that you're making LLM descriptions for each file. Intuitively it seems like these may be as useful as the chunked embeddings if the goal is just to select the right files to edit. Thoughts? Do you have any observations yet on how recall performs based on the descriptions instead of the chunked code yet? |
This seems like a promising direction. My experience was that summaries worked very well in experiments last year, when I was trying to get GPT-4 to identify the relevant function in DukeNukem3D code. What I was comparing at the time were opensource embeddings ( I made a video about it - it was April 2023, might not be that useful anymore. Point being summaries showed a lot of promise. |
Background
SWE-bench has "assisted" and "unassisted" scores. Assisted means you are told what files to modify. Devin is presumed to have the SOTA score of 14% unassisted. Claude 3 Opus model with no agent achieves a strong 11% assisted score, as reported by zraytam. This means that a standalone "Oracle-substitute" that only guessed the relevant files to modify could get us well on the way.
Direction
A proposed solution @AtlantisPleb involves building the Oracle-substitute on Retrieval-Augmented Generation (RAG) process.
This approach could potentially match Devin's scores within the next 1-2 days by evolving the process with more sophisticated LLM prompts or smarter codebase traversal methods.
Considerations
Generating embeddings for full repositories is resource-intensive but can be optimized by doing it once per repo in SWE-bench and reusing the embeddings. We are considering hosting a shared copy for experimentation using a service such as Pinecone.
The repos are not in one state. Every exercise starts at different point in time (git hash). Perhaps we can avoid redundant processing with tagging, this needs to be fleshed out.
Related Resources
The text was updated successfully, but these errors were encountered: