Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assistants API file upload #2

Merged
merged 3 commits into from
Mar 19, 2024
Merged

assistants API file upload #2

merged 3 commits into from
Mar 19, 2024

Conversation

phact
Copy link
Contributor

@phact phact commented Mar 19, 2024

Added a new cli command process-repo-files which clones the repo, switches to the right commit hash, and iterates [sequentially for now] over all the files, uploading them to assistants-api. This gets us embeddings and RAG for free:

python -m swe_bench_util process-repo-files

I clone to /tmp for now, not sure if that's what we want.

In the end it dumps a list of file_ids that you can give your assistant for searching.

If this is interesting I'm happy to help optimize how assistants chunks and embeds for our purposes.

@raymyers
Copy link
Owner

Thanks @phact! Looks like a minor rebase needed. At first glance I think this is pretty good start, just except a couple notes:

  • It looks like this would clone a repo once per SHA, since we're going to loop through them that will take time and space. We only need one clone per repo and we can checkout (or do worktrees if we want to get fancy later) the different SHAs from that clone
  • I would prefer to checkout into a gitignored subdir of the repo like "checkouts" instead of tmp.

@raymyers
Copy link
Owner

I'm getting this for most files, maybe will need to do something to make OpenAI understand these are text files

Dockerfile: Error code: 501 - {'message': 'Unsupported file type'}

@phact
Copy link
Contributor Author

phact commented Mar 19, 2024

I was thinking about what file types we should add. Maybe the play is to use a block list like sweep does:

https://github.com/sweepai/sweep/blob/bdcd1195bc8c2a90aa277a9169dcb13eed702868/docs/pages/blogs/generating-50k-embeddings-with-gte.mdx#L22-L26

@raymyers
Copy link
Owner

Merging, going to structure a bit and make my notes a new issue

@raymyers raymyers merged commit ff492b1 into raymyers:main Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants