-
-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary embeddings #254
Binary embeddings #254
Conversation
I can build plugins using:
|
Here's my rough CLIP plugin: import llm
from PIL import Image
from sentence_transformers import SentenceTransformer, util
import io
@llm.hookimpl
def register_embedding_models(register):
register(ClipEmbeddingModel())
class ClipEmbeddingModel(llm.EmbeddingModel):
model_id = "clip"
supports_binary = True
def __init__(self):
self._model = None
self._processor = None
self._tokenizer = None
def embed_batch(self, items):
# Embeds a mix of text strings and binary images
if self._model is None:
self._model = SentenceTransformer('clip-ViT-B-32')
to_embed = []
for item in items:
if isinstance(item, bytes):
# If the item is a byte string, treat it as image data and convert to Image object
to_embed.append(Image.open(io.BytesIO(item)))
elif isinstance(item, str):
to_embed.append(item)
embeddings = self._model.encode(to_embed)
return [[float(num) for num in embedding] for embedding in embeddings] Though looking at this, it's really just a sentence transformer that can accept both binary and text. It could go in I could still have |
I'm going to try the |
If a model only supports binary and does not support text, maybe we can have it treat all input as |
d6fae1f
to
4aa76cb
Compare
I'm landing this. Future tests will happen as I write the plugins. |
Refs:
Still needed:
--store
- should it store the binary content in a newcontent_blob
column? See these notes.model.supports_binary
boolean needs to be respected - it should raise errors if you attempt to embed binary content against a text-only modelLots of tests, using a mock embedding model that can handle binary content.Build a plugin that uses this (I have a draftllm-clip
one already).