Binary embeddings #254

simonw · 2023-09-09T20:20:52Z

Refs:

Support for embedding binary files #253

Still needed:

Decide what to do about --store - should it store the binary content in a new content_blob column? See these notes.
The model.supports_binary boolean needs to be respected - it should raise errors if you attempt to embed binary content against a text-only model
Should there be a way to mark a model as ONLY accepting binary content? Probably yes - though the models I have played with so far like CLIP and ImageBind are happy to accept either kind of content.
~~Lots of tests, using a mock embedding model that can handle binary content.~~
~~Build a plugin that uses this (I have a draft llm-clip one already).~~

simonw · 2023-09-09T20:31:16Z

I can build plugins using:

https://www.sbert.net/examples/applications/image-search/README.html#usage
https://github.com/mlfoundations/open_clip
https://github.com/facebookresearch/ImageBind - looks like I'll need to vendor their code since it isn't on PyPI

simonw · 2023-09-09T20:53:17Z

Here's my rough CLIP plugin:

import llm
from PIL import Image
from sentence_transformers import SentenceTransformer, util
import io


@llm.hookimpl
def register_embedding_models(register):
    register(ClipEmbeddingModel())


class ClipEmbeddingModel(llm.EmbeddingModel):
    model_id = "clip"
    supports_binary = True

    def __init__(self):
        self._model = None
        self._processor = None
        self._tokenizer = None

    def embed_batch(self, items):
        # Embeds a mix of text strings and binary images
        if self._model is None:
            self._model = SentenceTransformer('clip-ViT-B-32')

        to_embed = []

        for item in items:
            if isinstance(item, bytes):
                # If the item is a byte string, treat it as image data and convert to Image object
                to_embed.append(Image.open(io.BytesIO(item)))
            elif isinstance(item, str):
                to_embed.append(item)

        embeddings = self._model.encode(to_embed)
        return [[float(num) for num in embedding] for embedding in embeddings]

Though looking at this, it's really just a sentence transformer that can accept both binary and text. It could go in llm-sentence-transformers.

I could still have llm-clip be a plugin that just depends on this and then registers the right model.

simonw · 2023-09-09T22:15:52Z

I'm going to try the content_blob column and see how it feels.

simonw · 2023-09-10T04:22:24Z

If a model only supports binary and does not support text, maybe we can have it treat all input as --binary even if you forget to use that flag?

simonw · 2023-09-12T01:57:51Z

I'm landing this. Future tests will happen as I write the plugins.

Refs #229, #244, #247, #248, #254, #256, #259, #263

Refs #225, #229, #231, #254, #256, #259

simonw added enhancement New feature or request embeddings labels Sep 9, 2023

simonw mentioned this pull request Sep 9, 2023

Support for embedding binary files #253

Closed

simonw added this to the 0.10 milestone Sep 10, 2023

simonw mentioned this pull request Sep 12, 2023

Duplicate the --save feature from openai-to-sqlite similar #230

Open

simonw added 5 commits September 11, 2023 18:47

Work in progress binary embeddings support, refs #253

ed375c3

Write binary content to content_blob, with tests - refs #253

90b599c

Applied cog

aab280b

Ruff passes now

5730ab3

Fixed dumb bug I introduced

4aa76cb

simonw force-pushed the binary-embeddings branch from d6fae1f to 4aa76cb Compare September 12, 2023 01:47

supports_text and supports_binary embedding validation, refs #253

81051aa

simonw marked this pull request as ready for review September 12, 2023 01:57

simonw linked an issue Sep 12, 2023 that may be closed by this pull request

Support for embedding binary files #253

Closed

simonw merged commit 52cec13 into main Sep 12, 2023
20 checks passed

simonw deleted the binary-embeddings branch September 12, 2023 01:58

simonw mentioned this pull request Sep 12, 2023

Binary embeddings final cleanup #264

Closed

3 tasks

simonw added a commit that referenced this pull request Sep 12, 2023

Release 0.10a1

90ab024

Refs #229, #244, #247, #248, #254, #256, #259, #263

simonw added a commit that referenced this pull request Sep 12, 2023

Release 0.10

e83d205

Refs #225, #229, #231, #254, #256, #259

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary embeddings #254

Binary embeddings #254

simonw commented Sep 9, 2023 •

edited

Loading

simonw commented Sep 9, 2023

simonw commented Sep 9, 2023 •

edited

Loading

simonw commented Sep 9, 2023

simonw commented Sep 10, 2023

simonw commented Sep 12, 2023

Binary embeddings #254

Binary embeddings #254

Conversation

simonw commented Sep 9, 2023 • edited Loading

simonw commented Sep 9, 2023

simonw commented Sep 9, 2023 • edited Loading

simonw commented Sep 9, 2023

simonw commented Sep 10, 2023

simonw commented Sep 12, 2023

simonw commented Sep 9, 2023 •

edited

Loading

simonw commented Sep 9, 2023 •

edited

Loading