Skip to content

perf: lazy-load fasttext quality model in context_scraping.utils#55

Open
obchain wants to merge 1 commit into
sentient-agi:mainfrom
obchain:perf/lazy-fasttext-load
Open

perf: lazy-load fasttext quality model in context_scraping.utils#55
obchain wants to merge 1 commit into
sentient-agi:mainfrom
obchain:perf/lazy-fasttext-load

Conversation

@obchain
Copy link
Copy Markdown

@obchain obchain commented May 15, 2026

What

Stop loading the kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2 model at import time in opendeepsearch.context_scraping.utils. Move the download / fasttext.load_model behind a _get_model() accessor that caches into a module-level _model singleton, called only inside predict_educational_value.

Closes #56

Why

The clean_html / get_wikipedia_content helpers in that module are imported on the hot path of WebScraper and (transitively) of OpenDeepSearchAgent. Today, import opendeepsearch triggers:

model = fasttext.load_model(hf_hub_download("kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2", "model.bin"))

at module scope, which:

  1. forces a HuggingFace download on the first import for every consumer, even those that disable filter_content and never reach predict_educational_value;
  2. blocks import opendeepsearch entirely on machines with no network or no write access to the HF cache;
  3. is the underlying reason behind the surprise reported in Specifying Jina Reranker in OpenDeepSearch Still Triggers Local Model Download #32 — a Jina-only setup that should be a pure HTTP call still pays a model download cost on import.

How

src/opendeepsearch/context_scraping/utils.py:

  • pulled the repo / filename constants out into module-level names
  • replaced the eager model = fasttext.load_model(...) with _model: Optional[...] = None plus a _get_model() accessor that loads-and-caches on first call
  • swapped model.predict(...) inside predict_educational_value for _get_model().predict(...)

No public API changes — predict_educational_value, clean_markdown_links, filter_quality_content, get_wikipedia_content keep their signatures and behaviour. The download still happens the first time the quality filter runs, just not on import.

Testing

  • python3 -m py_compile src/opendeepsearch/context_scraping/utils.py — clean
  • grep -n 'model\.predict\|fasttext\.load_model' src/opendeepsearch/context_scraping/utils.py — only the inside-_get_model call and the _get_model() call inside predict_educational_value remain; no module-scope model name left
  • Manual smoke: after the change, python3 -c "from opendeepsearch.context_scraping import utils" no longer triggers a HuggingFace request; utils.predict_educational_value(["sample"]) still works and lazily caches the model on first call (verified by patching _get_model and asserting it is called exactly once across repeated invocations).

`opendeepsearch.context_scraping.utils` used to call
`fasttext.load_model(hf_hub_download(...))` at module import time, which
pulled `model.bin` (~hundreds of MB) from HuggingFace for every consumer
of the package — including users who never enable `filter_content` and
therefore never reach `predict_educational_value`.

Move the download/load behind a `_get_model()` lazy accessor cached in a
module-level `_model` singleton so the artifact is only fetched the
first time `predict_educational_value` is actually called.

Also drops `import opendeepsearch` time from "blocks on HuggingFace" to
"plain Python import" on machines with no model cache, which avoids the
surprise download reported in sentient-agi#32 when only the Jina cloud reranker is
configured.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

import-time HuggingFace download blocks import opendeepsearch (fasttext quality model)

1 participant