Skip to content

Add Sentence Transformers for embeddings #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

oidebrett
Copy link

Hi there, I'm Ivo and I am loving the project,

This PR introduces support for Sentence Transformers (documentation here) as an embedding option. I've followed the documentation here.

To test, set the code/config/config_embedding.yaml to be preferred_provider: sentence-transformers and redo the db_load python -m tools.db_load https://feeds.libsyn.com/121695/rss Behind-the-Tech.

I reset the code/config/config_embedding.yaml to be preferred_provider: azure_openai as per the repo

Looking forward to your feedback!

@oidebrett
Copy link
Author

oidebrett commented May 26, 2025 via email

@jennifermarsman jennifermarsman requested a review from Copilot May 27, 2025 15:20
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the Sentence Transformers framework as a new local embedding provider, including docs, dependency updates, and integration in the embedding factory.

  • Add documentation for the new provider
  • Introduce the sentence-transformers dependency and implementation
  • Register provider in factory, update config and README

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
docs/SentenceTransformers.md New guide on using Sentence Transformers embeddings
code/requirements.txt Add sentence-transformers package requirement
code/embedding/sentence_transformer_embedding.py Implement SentenceTransformer-based embedding functions
code/embedding/embedding.py Register sentence-transformers provider and locks
code/config/config_embedding.yaml Add provider config for sentence-transformers
code/README.md Include new embedding module in project README

@@ -22,7 +22,9 @@
"openai": threading.Lock(),
"gemini": threading.Lock(),
"azure_openai": threading.Lock(),
"snowflake": threading.Lock()
"snowflake": threading.Lock(),
"snowflake": threading.Lock(),
Copy link
Preview

Copilot AI May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two identical "snowflake": threading.Lock(), entries; remove the duplicate to avoid key override.

Suggested change
"snowflake": threading.Lock(),

Copilot uses AI. Check for mistakes.


async def get_sentence_transformer_embedding(
text: str,
model: Optional[str] = None,
Copy link
Preview

Copilot AI May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model parameter is accepted but never used; update get_sentence_transformer_embedding to pass this override to get_embedding_model or remove the unused argument.

Copilot uses AI. Check for mistakes.

"""
try:
model_instance = get_embedding_model()
embedding = model_instance.encode(text.replace("\n", " "), convert_to_numpy=True).tolist()
Copy link
Preview

Copilot AI May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling the blocking encode method in an async function can block the event loop; consider running it via loop.run_in_executor or a thread pool.

Suggested change
embedding = model_instance.encode(text.replace("\n", " "), convert_to_numpy=True).tolist()
loop = asyncio.get_running_loop()
embedding = await loop.run_in_executor(
None,
lambda: model_instance.encode(text.replace("\n", " "), convert_to_numpy=True).tolist()
)

Copilot uses AI. Check for mistakes.

model: snowflake-arctic-embed-m-v1.5

sentence-transformers:
Copy link
Preview

Copilot AI May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config key sentence-transformers does not match the lookup in get_model_name() which uses sentence_transformer; unify the provider identifier to ensure overrides work.

Suggested change
sentence-transformers:
sentence_transformer:

Copilot uses AI. Check for mistakes.

code/README.md Outdated
@@ -33,6 +33,7 @@ code/
| ├── embedding.py #
| ├── gemini_embedding.py #
| ├── openai_embedding.py #
| ├── sentence_transformer_embedding.py #
Copy link
Preview

Copilot AI May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The README entry for sentence_transformer_embedding.py has an empty comment; add a brief description of the module’s purpose.

Suggested change
| ├── sentence_transformer_embedding.py #
| ├── sentence_transformer_embedding.py # Embedding generation using Sentence Transformers

Copilot uses AI. Check for mistakes.

@jennifermarsman jennifermarsman added the enhancement New feature or request label Jun 6, 2025
@toloco
Copy link

toloco commented Jun 11, 2025

Using a version of this PR, I get this error:
SentenceTransformer("all-MiniLM-L6-v2")

@oidebrett have you come with the same issue?

ValueError: shapes (848,384) and (1024,) not aligned: 384 (dim 1) != 1024 (dim 0)


sentence-transformers:
model: all-MiniLM-L6-v2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the name should be: "sentence-transformers/all-MiniLM-L6-v2" instead of just "all-MiniLM-L6-v2"

@oidebrett
Copy link
Author

oidebrett commented Jun 11, 2025 via email

@toloco
Copy link

toloco commented Jun 12, 2025

Thanks!, silly me didn't get I've run another embedding model before...

@linjieli222 linjieli222 self-assigned this Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants