Support for plugins that implement vector indexes #216

simonw · 2023-09-03T14:48:32Z

The llm similar and collection.similar() methods currently implement the slowest brute-force approach.

I want to support faster approaches for this, like sqlite-vss and FAISS and and Pinecone and suchlike... but I'd like to do so through plugins.

Many vector indexes need to be rebuilt periodically, so I need an abstraction that supports that.

I added a modified column to the embeddings table in:

Add updated timestamp to embeddings table #211

With the aim of supporting this feature. I want indexes to be able to scan that table to see which items have been added or modified since they last ran, then re-index just those records.

The text was updated successfully, but these errors were encountered:

simonw · 2023-09-03T15:09:27Z

I think the user gets to create indexes, where an index can be assigned to one or more collection (provided those collections used the same embedding model).

Allowing a collection to have multiple indexes will support trying out different indexes and comparing their performance.

Allowing an index to cover multiple collections is important for things like wanting to be able to run a similarity search that mixes TILs and blog posts and tweets, despite them being held in different embedding collections.

So I think there are CLI and Python methods for:

Defining a new index, which is associated with one or more collections
Causing that index to build, or rebuild
Running similar searches against a specified index

Do the existing llm similar and collection.similar() methods gain these abilities automatically?

Maybe yes if a collection has a "default index" defined on it, otherwise no.

Refs #216. Closes #214

simonw · 2023-09-05T04:01:29Z

Also worth considering if the trick I used here might fit into LLM somehow:

Add an aggregate function for querying a subset of rows datasette-faiss#3

The idea there is that sometimes you want to combine a similarity search with other filters - e.g. run a SQL query to filter for just posts in a specific category, then find the most similar matches to a vector from that subset - it can actually be faster to run the filter, then build a scratch index against just the ~1,000 rows that match it. Building a FAISS index across a few thousand items for example is fast enough that it's a better approach than trying to query the whole similarity index first.

Though it's worth testing that against brute-force similarity matching too, it might turn out that once you get below ~1,000 items you should just brute-force run the comparisons and not even bother with the index.

In any case, will the LLM indexing mechanism need to solve for this kind of filtering as well? It might well be out of scope.

dave1010 · 2023-09-05T17:16:34Z

If this gets added then Chroma DB may be worth trying here. It's lightweight and incredibly easy to get running compared to a few others I've tried. https://docs.trychroma.com/

simonw · 2023-09-08T00:17:40Z

Ooh Chroma does look good - looks very easy to get it to store an index on disk: https://docs.trychroma.com/usage-guide

Looks like the actual indexing is handled by hnswlib - actually by their fork of it: https://github.com/chroma-core/hnswlib

simonw · 2023-09-09T20:50:31Z

Another option for an index could just be PyTorch with an in-memory collection of tensors.

I ran some benchmarks that looked good:

But when I tried a rough implementation like this it ended up slower than native Python (according to a time llm similar ... benchmark):

diff --git a/llm/embeddings.py b/llm/embeddings.py
index 32d7dc8..54741ee 100644
--- a/llm/embeddings.py
+++ b/llm/embeddings.py
@@ -9,6 +9,12 @@ from sqlite_utils.db import Table
 import time
 from typing import cast, Any, Dict, Iterable, List, Optional, Tuple, Union
 
+try:
+    import torch
+    import torch.nn.functional as F
+except ImportError:
+    torch = None
+
 
 @dataclass
 class Entry:
@@ -242,6 +248,30 @@ class Collection:
         """
         import llm
 
+        if torch is not None:
+            ids_and_embeddings = [
+                (row["id"], torch.tensor(llm.decode(row["embedding"])))
+                for row in self.db.query(
+                    "select id, embedding from embeddings where collection_id = ?",
+                    [self.id],
+                )
+            ]
+            input_vector = torch.tensor(vector)
+            scores = [
+                (id, F.cosine_similarity(input_vector.unsqueeze(0), embedding.unsqueeze(0)))
+                for id, embedding in ids_and_embeddings
+            ]
+            scores.sort(key=lambda id_and_score: id_and_score[1], reverse=True)
+            return [
+                Entry(
+                    id=id,
+                    score=score.item(),
+                    content=None,
+                    metadata=None,
+                )
+                for id, score in scores[:number]
+            ]
+
         def distance_score(other_encoded):
             other_vector = llm.decode(other_encoded)
             return llm.cosine_similarity(other_vector, vector)

Maybe I did something wrong here though. Would be worth spending more time seeing if PyTorch against an in-memory array can speed things up.

jeffchuber · 2023-09-12T13:43:23Z

@simonw happy to help here

Here is how Chroma's roadmap aligns with your goals.

The internals of Chroma are set up to have pluggable indexes on top of collections - we haven't yet exposed this to end users. But will fairly soon. We also plan to have a "smart index" that does KNN brute force and then a cutover to ANN.
Indexes over multiple collections - while I do understand the use case - we've chosen to not prioritize this as it adds a lot of DX complexity. We instead encourage users to eat the "read amplification" and query multiple collections/indexes and then cull/rerank themselves client side.

You may also enjoy reading this chroma proposal where we have put a lot of thought into the pipelines to support index/collection creation and access - chroma-core/chroma#1110

IvanVas · 2023-11-16T18:42:32Z

If someone ever need to move data from llm to Chroma, below is a simple script to do so. Needs a little more work to productise it though.

@simonw, hope it would help if you ever need to create smth like llm embed-multi migrate --chroma

import sqlite3
import struct
import chromadb


def decode(binary):
    if not isinstance(binary, bytes):
        raise ValueError("Binary data must be of bytes type")
    return struct.unpack("<" + "f" * (len(binary) // 4), binary)


client = chromadb.PersistentClient(path="chroma.db") # FIXME

collectionName = "collection" # FIXME
collection = client.get_or_create_collection(collectionName)

# Path to your SQLite database
db_path = "/Users/username/Library/Application Support/io.datasette.llm/embeddings.db" # FIXME

# Query to retrieve embeddings
llm_collection = "fixme" # FIXME
query = f"""
SELECT id, embedding, content FROM embeddings WHERE collection_id = (
    SELECT id FROM collections WHERE name = "{llm_collection}" 
)
"""


def parse_id(id_str):
    # FIXME parse metadata

    return {
        "meta1": "meta1",
    }


def main(db_path, batch_size=1000):
    # Connect to the SQLite database
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    # Query the embeddings table
    cursor.execute(query)
    rows = cursor.fetchall() # FIXME assumes there is not MUCH data, works fine for x100k records

    # Initialize batch variables
    batch_embeddings = []
    batch_documents = []  # You'll need to adjust how you handle documents
    batch_metadatas = []  # You'll need to adjust how you handle metadatas
    batch_ids = []

    for i, (id, embedding, content) in enumerate(rows):
        idParsed = parse_id(id)

        # Decode the binary data
        decoded_embedding = decode(embedding)

        # Append to the batch
        batch_embeddings.append(list(decoded_embedding))
        batch_documents.append(content)
        batch_metadatas.append({"meta1": idParsed['meta1']}) # FIXME
        batch_ids.append(id)

        # When batch size is reached or end of rows
        if len(batch_embeddings) == batch_size or i == len(rows) - 1:
            collection.add(
                embeddings=batch_embeddings,
                documents=batch_documents,
                metadatas=batch_metadatas,
                ids=batch_ids
            )

            print(f"Added {len(batch_embeddings)} rows to chromaDb (Total: {i + 1})")

            # Reset batch variables
            batch_embeddings = []
            batch_documents = []
            batch_metadatas = []
            batch_ids = []

    # Close the database connection
    conn.close()


if __name__ == "__main__":
    main(db_path)

simonw added plugins design labels Sep 3, 2023

simonw added this to the 0.9 - embeddings milestone Sep 3, 2023

simonw removed this from the 0.9 - embeddings milestone Sep 3, 2023

simonw added the embeddings label Sep 3, 2023

simonw mentioned this issue Sep 3, 2023

Mention brute force approach in docs #214

Closed

simonw added a commit that referenced this issue Sep 4, 2023

Mention brute-force approach, link to vector indexing issue

f842fbe

Refs #216. Closes #214

simonw mentioned this issue Sep 5, 2023

Add an aggregate function for querying a subset of rows simonw/datasette-faiss#3

Closed

simonw mentioned this issue Sep 6, 2023

Research ways to speed up brute force cosine similarity #246

Open

dave1010 mentioned this issue Sep 26, 2023

Add option for RAG-style augmentation dave1010/clipea#1

Open

irthomasthomas mentioned this issue Jan 10, 2024

simonw/llm pattern to migrate llm cli embeddings to chromadb irthomasthomas/undecidability#325

Open

This was referenced Mar 4, 2024

chroma/README.md at main · chroma-core/chroma irthomasthomas/undecidability#678

Open

pattern: detect wasm availability irthomasthomas/undecidability#692

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for plugins that implement vector indexes #216

Support for plugins that implement vector indexes #216

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 5, 2023

dave1010 commented Sep 5, 2023

simonw commented Sep 8, 2023 •

edited

simonw commented Sep 9, 2023

jeffchuber commented Sep 12, 2023

IvanVas commented Nov 16, 2023

Support for plugins that implement vector indexes #216

Support for plugins that implement vector indexes #216

Comments

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 5, 2023

dave1010 commented Sep 5, 2023

simonw commented Sep 8, 2023 • edited

simonw commented Sep 9, 2023

jeffchuber commented Sep 12, 2023

IvanVas commented Nov 16, 2023

simonw commented Sep 8, 2023 •

edited