Extension of embeddings draft implementation to support local models #3463
+120
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello!
Pull Request overview
Details
This PR is a simple extension of the embeddings API draft, with the intention to 1) support local embedding models and 2) inform the PydanticAI maintainers about common usage formats that might want to be incorporated in the overall design.
Local models
Part 1 of these is rather straight-forward: this PR allows users to run any embedding model from https://huggingface.co/models?library=sentence-transformers, like any Qwen, bge, sentence-transformers, EmbeddingGemma, Nomic, Jina, mixedbread, etc. model.
Here's some usage snippets (requires
pip install sentence-transformers, torch can be installed with or without GPU support):And a sample with a cosine similarity call:
Personal Recommendation
Part 2 of my goal here doesn't involve this PR much at all. Instead, I want to inform you about how embedding models are commonly used. It wasn't always the case, but nowadays embedding models are almost exclusively used for retrieval. In this setting, users want to embed both queries and documents, and then use an efficient search system/vector database that computes dot product or cosine similarity to find relevant documents given queries (or rather, query embeddings).
In practice, model authors started separating the query and document paths. For example, I'll use two of the most recent big embedding model releases:
These both use special prompts/prefixes for queries, such as
Instruct: Given a question, retrieve passages that answer the question\nQuery:, while not using any prompt/prefix for documents. Some models use prompts for both queries and documents, and other models even use fully different underlying models for queries/documents, e.g. https://huggingface.co/MongoDB/mdbr-leaf-ir-asym or https://huggingface.co/jinaai/jina-embeddings-v4.From what I've seen over the past 2 years, I can almost guarantee you that the next OpenAI embedding model will also require some form of distinction between queries and documents. Cohere's and Voyage's models already do.
Due to this, I think that you should support separate methods for embedding queries and documents, e.g.
embed_queryandembed_document. The underlying implementation for each provider (OpenAI, Cohere, Voyage, Sentence Transformers) can then use whatever that provider uses to distinguish the input types (e.g.input_typefor Cohere and Voyage,encode_queryvsencode_documentfor Sentence Transformers, and twice the same endpoint for OpenAI until they add support for distinctions).cc @dmontagu @DouweM as you set up the original draft implementation and cc @ggozad as you inspired the structure via https://github.com/ggozad/haiku.rag (P.s. I really like the "slim" approach)