-
-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store content hash to avoid embedding duplicate data #217
Comments
Since it's only 16 bytes I'm going to store this for every embedding. For the migration... how do I backfill existing rows? Since I haven't shipped the feature yet there should be very few of them out there, so I'm going to use a random hash value which will cause them to definitely be re-embedded next time it runs (since I likely didn't store the content). |
Uses new migrations feature from simonw/sqlite-migrate#9
llm/docs/embeddings/python-api.md Lines 165 to 174 in a5d6b58
Plus this index: CREATE INDEX [idx_embeddings_content_hash]
ON [embeddings] ([content_hash]); I didn't make that index unique because the same piece of content might be stored in more than one collection, resulting in multiple rows in the table. A migration backfills this for all existing rows, setting it to a random MD5 hash for rows that did not store content. |
Next I need to upgrade the various |
I'll implement the next step as part of |
I was going to build that here but I guess I'll mostly retire that tool instead: |
Originally posted by @simonw in #215 (comment)
The text was updated successfully, but these errors were encountered: