Skip to content

feat: add local_files_only to sentence-transformers embedders #9400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jantrienes
Copy link
Contributor

Related Issues

None, however a similar issue came up in the past: #1256

Proposed Changes:

Pass local_files_only argument to SentenceTransformer to allow use of embedders in offline environments. When local_files_only=True and the model is available in cache, no outside networking requests are made. When the model is not available in cache, an exception is raised. See offline mode docs of Huggingface.

Note that for proper offline use, passing local_files_only=True is necessary even if the model is available in cache, as there are other places where the flag is used in SentenceTransformer (e.g., here).

How did you test it?

Unit tests. Additionally, following snippet demonstrates the desired behavior:

>>> from haystack.components.embedders import SentenceTransformersDocumentEmbedder
>>> SentenceTransformersDocumentEmbedder(model="all-MiniLM-L12-v2", progress_bar=False, local_files_only=True).warm_up()

[... truncated ...]

OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
>>> SentenceTransformersDocumentEmbedder(model="all-MiniLM-L12-v2", progress_bar=False, local_files_only=False).warm_up()
modules.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 246kB/s]

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@jantrienes jantrienes requested review from a team as code owners May 16, 2025 11:08
@jantrienes jantrienes requested review from dfokina and anakin87 and removed request for a team May 16, 2025 11:08
@CLAassistant
Copy link

CLAassistant commented May 16, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels May 16, 2025
@jantrienes jantrienes changed the title feat: add to sentence-transformers embedders feat: add local_files_only to sentence-transformers embedders May 16, 2025
@coveralls
Copy link
Collaborator

coveralls commented May 16, 2025

Pull Request Test Coverage Report for Build 15117750305

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 75 unchanged lines in 12 files lost coverage.
  • Overall coverage decreased (-0.02%) to 90.425%

Files with Coverage Reduction New Missed Lines %
components/evaluators/sas_evaluator.py 1 62.3%
components/embedders/sentence_transformers_document_embedder.py 2 96.88%
components/embedders/sentence_transformers_text_embedder.py 2 96.43%
components/generators/utils.py 3 18.75%
core/pipeline/async_pipeline.py 3 67.32%
components/generators/hugging_face_local.py 4 91.76%
components/rankers/sentence_transformers_diversity.py 4 94.84%
components/rankers/transformers_similarity.py 5 91.43%
components/readers/extractive.py 7 95.71%
components/agents/agent.py 14 91.07%
Totals Coverage Status
Change from base Build 15065181550: -0.02%
Covered Lines: 10955
Relevant Lines: 12115

💛 - Coveralls

@anakin87
Copy link
Member

Hey @jantrienes, thanks for the PR.

I would say that there is no need to introduce a new parameter since we already provide config_kwargs: "Additional keyword arguments for AutoConfig.from_pretrained when loading the model configuration."

This works as expected and raises an error if the model is not available locally:

from haystack.components.embedders import SentenceTransformersTextEmbedder

text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-mpnet-base-v2",
                config_kwargs={"local_files_only":True})
text_embedder.warm_up()

However, I agree that this feature is a bit hidden.
So I would propose to transform this PR and just improve the docstrings that describe the config_kwargs parameter: we can explicitly mention this use case.

What do you think?

@jantrienes
Copy link
Contributor Author

Hi @anakin87, thanks a lot!

We initially used config_kwargs={"local_files_only": True}, but it didn’t fully prevent network access. In our deployment environment (behind a proxy), this led to infinite timeouts. Unfortunately, I can't provide a complete repro.

After closer inspection, we found that other parts of sentence-transformers also perform network I/O. For example, here:
https://github.com/UKPLab/sentence-transformers/blob/dd8a8d7d4d5660e7b2b69e8a1ece8202417c0069/sentence_transformers/SentenceTransformer.py#L1825-L1826

While these paths are guarded by not local_files_only, the check only works if local_files_only is passed directly to the SentenceTransformer init. As far as I can tell, config_kwargs isn't unpacked.

Given that local_files_only is part of the sentence-transformers API, I thought it would be safe to include it explicitly. Alternatively, this might be something to fix upstream, but I’m not familiar enough with HF internals to say for sure.

@anakin87
Copy link
Member

Thanks. I'll take a better look then.

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your clarification, that is correct.
I agree with this change and only left minor comments on the wording.

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@anakin87 anakin87 enabled auto-merge (squash) May 19, 2025 16:04
@anakin87 anakin87 merged commit 83b087c into deepset-ai:main May 19, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants