perf: lazy-load fasttext quality model in context_scraping.utils by obchain · Pull Request #55 · sentient-agi/OpenDeepSearch

obchain · 2026-05-15T11:40:41Z

What

Stop loading the kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2 model at import time in opendeepsearch.context_scraping.utils. Move the download / fasttext.load_model behind a _get_model() accessor that caches into a module-level _model singleton, called only inside predict_educational_value.

Closes #56

Why

The clean_html / get_wikipedia_content helpers in that module are imported on the hot path of WebScraper and (transitively) of OpenDeepSearchAgent. Today, import opendeepsearch triggers:

model = fasttext.load_model(hf_hub_download("kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2", "model.bin"))

at module scope, which:

forces a HuggingFace download on the first import for every consumer, even those that disable filter_content and never reach predict_educational_value;
blocks import opendeepsearch entirely on machines with no network or no write access to the HF cache;
is the underlying reason behind the surprise reported in Specifying Jina Reranker in OpenDeepSearch Still Triggers Local Model Download #32 — a Jina-only setup that should be a pure HTTP call still pays a model download cost on import.

How

src/opendeepsearch/context_scraping/utils.py:

pulled the repo / filename constants out into module-level names
replaced the eager model = fasttext.load_model(...) with _model: Optional[...] = None plus a _get_model() accessor that loads-and-caches on first call
swapped model.predict(...) inside predict_educational_value for _get_model().predict(...)

No public API changes — predict_educational_value, clean_markdown_links, filter_quality_content, get_wikipedia_content keep their signatures and behaviour. The download still happens the first time the quality filter runs, just not on import.

Testing

python3 -m py_compile src/opendeepsearch/context_scraping/utils.py — clean
grep -n 'model\.predict\|fasttext\.load_model' src/opendeepsearch/context_scraping/utils.py — only the inside-_get_model call and the _get_model() call inside predict_educational_value remain; no module-scope model name left
Manual smoke: after the change, python3 -c "from opendeepsearch.context_scraping import utils" no longer triggers a HuggingFace request; utils.predict_educational_value(["sample"]) still works and lazily caches the model on first call (verified by patching _get_model and asserting it is called exactly once across repeated invocations).

`opendeepsearch.context_scraping.utils` used to call `fasttext.load_model(hf_hub_download(...))` at module import time, which pulled `model.bin` (~hundreds of MB) from HuggingFace for every consumer of the package — including users who never enable `filter_content` and therefore never reach `predict_educational_value`. Move the download/load behind a `_get_model()` lazy accessor cached in a module-level `_model` singleton so the artifact is only fetched the first time `predict_educational_value` is actually called. Also drops `import opendeepsearch` time from "blocks on HuggingFace" to "plain Python import" on machines with no model cache, which avoids the surprise download reported in sentient-agi#32 when only the Jina cloud reranker is configured.

obchain mentioned this pull request May 15, 2026

import-time HuggingFace download blocks import opendeepsearch (fasttext quality model) #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: lazy-load fasttext quality model in context_scraping.utils#55

perf: lazy-load fasttext quality model in context_scraping.utils#55
obchain wants to merge 1 commit into
sentient-agi:mainfrom
obchain:perf/lazy-fasttext-load

obchain commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

obchain commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

obchain commented May 15, 2026 •

edited

Loading